William J. Riehl ORCID

Research Question

Which microbial species and ecological environments show the highest concentration of antibiotic resistance genes, and can we predict resistance accumulation from phylogenetic and ecological features?

Research Plan

Approach

Phase 1: Data Exploration

  • Explore BERDL pangenome database structure and tables
  • Understand relationships between genomes, genes, orthogroups, and taxonomy
  • Document schemas, row counts, and available metadata
  • Identify functional annotation sources (eggNOG, KEGG, COG, PFAM)

Phase 2: ARG Identification

  • Query gene annotations from eggnog_mapper_annotations table
  • Identify ARGs using known resistance databases (CARD, ResFinder) via keyword matching
  • Cross-reference with COG and KEGG functional categories
  • Build ARG annotation dataset linking genes to resistance mechanisms and drug classes
  • Save curated ARG dataset for downstream analysis

Phase 3: Distribution Analysis

  • Calculate ARG prevalence metrics by:
  • Species (% genomes with ARGs, mean ARGs per genome)
  • Drug class distribution
  • Phylogenetic clade patterns
  • Identify statistical outliers (hotspot species)
  • Analyze patterns by environment (if metadata available)

Phase 4: Pangenome Characterization

  • Classify genes as core, accessory, or unique for each species
  • Compute pangenome openness metrics
  • Analyze correlation between pangenome openness and ARG diversity
  • Test whether ARGs are preferentially core or accessory

Phase 5: Fitness Analysis

  • Cross-reference ARG genes with kescience_fitnessbrowser fitness data
  • Extract fitness effects across experimental conditions
  • Analyze trade-offs between resistance carriage and fitness costs
  • Identify essential vs. non-essential ARGs

Phase 6: Integration & Visualization

  • Create publication-quality figures showing:
  • Top hotspot species by ARG prevalence and diversity
  • Phylogenetic distribution patterns
  • Fitness trade-off landscapes
  • Build interactive dashboards for exploration
  • Generate summary statistics and reports

Revision History

  • v1 (2026-02): Migrated from README.md

Overview

This project maps antibiotic resistance genes (ARGs) across 293,059 genomes in the BERDL pangenome collection, identifies species and environmental hotspots, analyzes whether ARGs are preferentially core or accessory genes, and evaluates fitness trade-offs using the Fitness Browser transposon mutant data. It integrates annotations from eggNOG, CARD, and ResFinder with pangenome structure and ecological metadata.

Data Collections

Review

AI Review BERIL Automated Review 2026-02-12 Needs Re-review

Summary

This is a well-conceived research project with an important scientific question: identifying which microbial species and environments harbor the highest concentrations of antibiotic resistance genes. The project demonstrates thorough planning with clear documentation, a phased approach, and attention to data infrastructure challenges. However, the project is currently in the planning stage with no analysis implemented. All six notebooks contain only skeleton code with TODO placeholders, no data files have been generated, and no actual findings exist. The validation work reveals critical technical constraints that must be addressed before implementation can succeed. The project shows promise but requires substantial implementation work to achieve its stated goals.

Methodology

Research Question and Approach

Strengths:
- Research question is clearly stated, testable, and scientifically important
- Six-phase approach is logical: exploration → ARG identification → distribution analysis → pangenome analysis → fitness analysis → visualization
- Recognizes the need to integrate multiple data sources (pangenome, annotations, fitness, taxonomy)
- Acknowledges key challenges upfront (ARG detection methodology, data source mapping)

Weaknesses:
- No hypothesis stated: The research question asks "which species show highest ARG concentration?" but doesn't propose a testable hypothesis (e.g., "open pangenomes will have higher ARG diversity than closed pangenomes")
- ARG identification methodology is underspecified: The approach relies on "keyword matching" of gene descriptions, which will have high false positive/negative rates. No validation strategy is defined for confirming that matched genes are true ARGs
- Missing validation plan: No description of how to verify that identified ARGs are functionally validated resistance genes vs. homologs or annotations with resistance-related keywords
- Unclear fitness data mapping: The project assumes ARG genes from pangenome can be mapped to fitness browser data, but the two databases use different organisms and gene ID schemes. This mapping challenge is mentioned but not solved

Reproducibility

Strengths:
- Excellent documentation structure with README, QUICKSTART, IMPORTANT_NOTES, ANALYSIS_CHECKLIST
- Clear file organization with designated directories for data, figures, results
- requirements.txt provided with specific version constraints
- Detailed BERDL_VALIDATION_REAL.md documents actual API testing results
- Acknowledges that notebooks must run in BERDL JupyterHub (not locally)

Weaknesses:
- Cannot currently be reproduced: All notebooks are skeleton code with no executable analysis
- External data sources not specified: References CARD and ResFinder databases but provides no download links, version numbers, or parsing instructions
- No sample data: No small test datasets provided to demonstrate the approach
- Missing intermediate data: The pipeline assumes data flows between notebooks (e.g., notebook 02 creates arg_annotations.csv for notebook 03), but with no implemented code, the data schema and format are undefined

Data Sources

Clearly Identified:
- Primary: kbase_ke_pangenome collection (293,059 genomes, GTDB r214)
- Secondary: kescience_fitnessbrowser for fitness data
- External: CARD, ResFinder, PATRIC (mentioned but not integrated)

Issues:
- eggNOG annotations table inaccessible: BERDL_VALIDATION_REAL.md reveals this critical table times out via REST API, but notebooks don't account for this
- orthogroup table missing: Validation shows this table doesn't exist or is inaccessible, yet notebook 01 queries it
- Fitness data integration unclear: No evidence that pangenome gene IDs can be mapped to fitness browser locus IDs across different organisms

Code Quality

SQL Queries

Not Assessed - No Executable Queries:
All SQL queries in notebooks are either commented out, wrapped in TODO blocks, or are template strings that won't execute. Examples:

Notebook 01, cell 7:

print(f"\nRow count: {spark.sql('SELECT COUNT(*) as count FROM kbase_ke_pangenome.pangenome').collect()[0]['count']}")

This query will work for the pangenome table but will timeout for large tables like gene or eggnog_mapper_annotations per validation findings.

Notebook 02, cell 5:

genes_query = """
SELECT gene_id, orthogroup_id, genome_id, gene_description, gene_function
FROM kbase_ke_pangenome.gene
WHERE gene_description IS NOT NULL OR gene_function IS NOT NULL
LIMIT 1000
"""

This query string is defined but never executed. It also references columns (gene_description, gene_function) that may not exist in the actual schema.

Notebook 03, cell 3:
The prevalence query references a table arg_genes that doesn't exist and would need to be created in notebook 02:

LEFT JOIN arg_genes ON g.genome_id = arg_genes.genome_id

Critical Issues from Validation:
- ❌ SHOW TABLES IN kbase_ke_pangenome will timeout (55 seconds)
- ❌ COUNT(*) queries on billion-row tables will timeout
- ❌ orthogroup table queries will fail (table doesn't exist)
- ❌ eggnog_mapper_annotations inaccessible via REST API

Statistical Methods

Not Assessed - No Implementation:
Notebooks mention statistical tests (Pearson/Spearman correlations, significance tests) but contain no code implementing them. Notebook 03 imports scipy.stats but never uses it.

Notebook Organization

Good Structure:
- Logical flow: Setup → Query → Analysis → Visualization
- Markdown cells clearly describe each phase
- Imports at the top of each notebook
- Clear separation of concerns across 6 notebooks

Issues:
- No error handling: No try/except blocks for queries that might fail
- No checkpointing: Long pipelines have no intermediate saves to recover from failures
- Inconsistent .env handling: Notebook 01 loads .env but subsequent notebooks don't (unclear if spark session persists)
- No data validation: No checks that loaded dataframes have expected columns/shapes

Pitfall Awareness

Pitfalls Documented but Not Addressed:

The project is aware of docs/pitfalls.md but notebooks don't implement the recommended solutions:

  1. String-typed numeric columns (pitfalls.md line 40): Fitness browser columns are all strings requiring CAST, but notebook 05 has no casting code

  2. Annotation join keys (pitfalls.md line 148): eggNOG annotations join on gene_cluster_id not gene_id, but notebook 02 doesn't specify this

  3. Gene clusters are species-specific (pitfalls.md line 165): Cannot compare cluster IDs across species, but no code enforces species-level grouping

  4. API vs Spark SQL (pitfalls.md line 318): Large queries must use Spark SQL not REST API, but notebooks don't configure Spark session appropriately

Findings Assessment

No Findings to Assess

The project has zero completed analysis, so there are no conclusions, visualizations, or results to evaluate. Specifically:

  • Timeline status (README line 143-149): All phases marked incomplete (accurate)
  • data/ directory: Empty except .gitkeep
  • figures/ directory: Empty except .gitkeep
  • results/ directory: Empty
  • Expected outputs (README line 120-140): None generated

Analysis Completeness

Phase 1 (Data Exploration): 0% complete
- No table schemas documented
- No row counts obtained
- No data quality assessment performed

Phase 2 (ARG Identification): 0% complete
- No CARD/ResFinder databases downloaded
- No keyword matching implemented
- No arg_annotations.csv created

Phase 3 (Distribution Analysis): 0% complete
- No prevalence calculations
- No hotspot species identified
- No hotspot_species.csv created

Phase 4 (Pangenome Analysis): 0% complete
- No core/accessory classification
- No openness metrics calculated
- No correlations tested

Phase 5 (Fitness Analysis): 0% complete
- No fitness browser exploration
- No gene ID mapping attempted
- No fitness_results.csv created

Phase 6 (Visualization): 0% complete
- No figures generated
- No dashboards built
- No summary statistics written

Validation Work

Positive Aspect:
The BERDL_VALIDATION_REAL.md file shows excellent due diligence by actually testing queries against the API. This validation reveals critical issues that would block implementation:

  • eggNOG table inaccessible (needed for Phase 2)
  • COUNT(*) queries timeout on large tables (affects Phase 1)
  • orthogroup table missing (breaks Notebook 01)

However, these findings have not been integrated back into the notebooks. The notebooks still contain queries that the validation proves will fail.

Suggestions

Critical Issues (Must Fix Before Implementation)

  1. Update Notebook 01 to remove failing queries
  2. Remove SHOW TABLES query (times out per validation line 55)
  3. Remove orthogroup table queries (table doesn't exist per validation line 75)
  4. Replace COUNT(*) on large tables with LIMIT-based sampling
  5. Use only tables confirmed in BERDL_VALIDATION_REAL.md (genome, pangenome, gene_cluster)

  6. Fix ARG identification strategy in Notebook 02

  7. Current approach: keyword matching on gene descriptions (high false positive rate)
  8. Recommendation: Download CARD database ARO ontology, use sequence similarity (BLAST/DIAMOND) against CARD protein sequences, set identity threshold (≥80% for close homologs)
  9. Implement validation: manually inspect sample of matches to estimate precision/recall
  10. Document limitations: annotation-based detection will miss novel ARGs and may include non-functional homologs

  11. Address eggNOG table inaccessibility

  12. Validation shows eggnog_mapper_annotations times out via REST API
  13. Solution: Add explicit note in notebook that it MUST run in JupyterHub with Spark SQL, not via REST API
  14. Add alternative: use gene_cluster table joined with CARD external reference instead

  15. Resolve fitness data mapping problem

  16. Fitness browser uses different organisms (model organisms like E. coli K12) than pangenome (wild-type strains)
  17. Gene ID formats don't match between databases
  18. Recommendation: Either (a) limit fitness analysis to specific organisms present in both databases, or (b) remove Phase 5 entirely and acknowledge as future work
  19. Current plan assumes mapping works, but provides no code or evidence it's possible

  20. Define a testable hypothesis

  21. Current research question is exploratory ("which species have ARGs?")
  22. Add hypothesis to test, e.g.: "Species with open pangenomes (high accessory:core ratio) will have higher ARG diversity than species with closed pangenomes"
  23. This enables statistical testing in Phase 4

High-Priority Improvements

  1. Implement Phase 1 with working queries
  2. Use queries confirmed to work from validation (e.g., SELECT * FROM pangenome LIMIT 10)
  3. Document actual schemas found (don't guess column names)
  4. Save schema documentation to data/schemas.json for reference

  5. Create sample ARG annotation dataset

  6. Even with 10-100 sample ARGs, you can test the pipeline
  7. Download CARD database v3.2.9 (latest as of 2026), parse ARO IDs and drug classes
  8. Implement keyword search on 1 species to test feasibility before scaling

  9. Add data validation between phases

  10. Notebook 03 expects arg_annotations.csv from Notebook 02
  11. Add cell at start of Notebook 03 to check file exists and has required columns
  12. Fail fast with clear error message if upstream data missing

  13. Implement checkpointing for long queries

  14. Validation shows large queries may timeout
  15. Add code to save intermediate results: df.write.parquet("data/intermediate/step1.parquet")
  16. Add logic to skip recomputation if checkpoint exists

  17. Use gene_cluster instead of orthogroup

    • Validation confirms orthogroup table doesn't exist
    • Per pitfalls.md line 148, annotations join on gene_cluster_id
    • Update all notebooks to use correct table name

Nice-to-Have Enhancements

  1. Add example queries that work

    • Include 2-3 fully executable queries in Notebook 01 that successfully return data
    • Use validation test results (pangenome sample query worked)
    • Builds user confidence that infrastructure works
  2. Visualize intermediate results

    • Phase 3 should create at least 1 simple bar chart (top 10 species by genome count)
    • Even without ARG data, you can visualize pangenome structure
    • Shows progress and helps debugging
  3. Document data quality issues

    • AlphaEarth embeddings only cover 28.4% of genomes (pitfalls.md line 85)
    • NCBI environment metadata is sparse and EAV format (pitfalls.md line 103)
    • Acknowledge these will limit environmental analysis in Phase 3
  4. Add expected runtime estimates

    • Based on validation, note which queries are fast (<5s) vs slow (>30s)
    • Helps users know if their notebook is hung or just slow
  5. Create a "quick validation" notebook

    • Notebook 00 that runs 5-10 simple queries to verify BERDL access
    • Tests: list databases, sample pangenome, check genome count
    • Users can run this first to confirm environment before starting Phase 1

This review was generated by an AI system. It should be treated as advisory input, not a definitive assessment.

Notebooks