Antibiotic Resistance Hotspots in Microbial Pangenomes

William J. Riehl

Research Question

Which microbial species and ecological environments show the highest concentration of antibiotic resistance genes, and can we predict resistance accumulation from phylogenetic and ecological features?

Research Plan

Approach

Phase 1: Data Exploration

Explore BERDL pangenome database structure and tables
Understand relationships between genomes, genes, orthogroups, and taxonomy
Document schemas, row counts, and available metadata
Identify functional annotation sources (eggNOG, KEGG, COG, PFAM)

Phase 2: ARG Identification

Query gene annotations from eggnog_mapper_annotations table
Identify ARGs using known resistance databases (CARD, ResFinder) via keyword matching
Cross-reference with COG and KEGG functional categories
Build ARG annotation dataset linking genes to resistance mechanisms and drug classes
Save curated ARG dataset for downstream analysis

Phase 3: Distribution Analysis

Calculate ARG prevalence metrics by:
Species (% genomes with ARGs, mean ARGs per genome)
Drug class distribution
Phylogenetic clade patterns
Identify statistical outliers (hotspot species)
Analyze patterns by environment (if metadata available)

Phase 4: Pangenome Characterization

Classify genes as core, accessory, or unique for each species
Compute pangenome openness metrics
Analyze correlation between pangenome openness and ARG diversity
Test whether ARGs are preferentially core or accessory

Phase 5: Fitness Analysis

Cross-reference ARG genes with kescience_fitnessbrowser fitness data
Extract fitness effects across experimental conditions
Analyze trade-offs between resistance carriage and fitness costs
Identify essential vs. non-essential ARGs

Phase 6: Integration & Visualization

Create publication-quality figures showing:
Top hotspot species by ARG prevalence and diversity
Phylogenetic distribution patterns
Fitness trade-off landscapes
Build interactive dashboards for exploration
Generate summary statistics and reports

Revision History

v1 (2026-02): Migrated from README.md

Overview

This project maps antibiotic resistance genes (ARGs) across 293,059 genomes in the BERDL pangenome collection, identifies species and environmental hotspots, analyzes whether ARGs are preferentially core or accessory genes, and evaluates fitness trade-offs using the Fitness Browser transposon mutant data. It integrates annotations from eggNOG, CARD, and ResFinder with pangenome structure and ecological metadata.

Data Collections

🧬

Pangenome Collection

kbase_ke_pangenome

KBase, DOE

📊

Fitness Browser

kescience_fitnessbrowser

Price Lab, LBNL

💬

KBase Genomes

kbase_genomes

KBase, DOE

Review

Summary

This is a well-conceived research project with an important scientific question: identifying which microbial species and environments harbor the highest concentrations of antibiotic resistance genes. The project demonstrates thorough planning with clear documentation, a phased approach, and attention to data infrastructure challenges. However, the project is currently in the planning stage with no analysis implemented. All six notebooks contain only skeleton code with TODO placeholders, no data files have been generated, and no actual findings exist. The validation work reveals critical technical constraints that must be addressed before implementation can succeed. The project shows promise but requires substantial implementation work to achieve its stated goals.

Methodology

Research Question and Approach

Strengths:
- Research question is clearly stated, testable, and scientifically important
- Six-phase approach is logical: exploration → ARG identification → distribution analysis → pangenome analysis → fitness analysis → visualization
- Recognizes the need to integrate multiple data sources (pangenome, annotations, fitness, taxonomy)
- Acknowledges key challenges upfront (ARG detection methodology, data source mapping)

Weaknesses:
- No hypothesis stated: The research question asks "which species show highest ARG concentration?" but doesn't propose a testable hypothesis (e.g., "open pangenomes will have higher ARG diversity than closed pangenomes")
- ARG identification methodology is underspecified: The approach relies on "keyword matching" of gene descriptions, which will have high false positive/negative rates. No validation strategy is defined for confirming that matched genes are true ARGs
- Missing validation plan: No description of how to verify that identified ARGs are functionally validated resistance genes vs. homologs or annotations with resistance-related keywords
- Unclear fitness data mapping: The project assumes ARG genes from pangenome can be mapped to fitness browser data, but the two databases use different organisms and gene ID schemes. This mapping challenge is mentioned but not solved

Reproducibility

Strengths:
- Excellent documentation structure with README, QUICKSTART, IMPORTANT_NOTES, ANALYSIS_CHECKLIST
- Clear file organization with designated directories for data, figures, results
- requirements.txt provided with specific version constraints
- Detailed BERDL_VALIDATION_REAL.md documents actual API testing results
- Acknowledges that notebooks must run in BERDL JupyterHub (not locally)

Weaknesses:
- Cannot currently be reproduced: All notebooks are skeleton code with no executable analysis
- External data sources not specified: References CARD and ResFinder databases but provides no download links, version numbers, or parsing instructions
- No sample data: No small test datasets provided to demonstrate the approach
- Missing intermediate data: The pipeline assumes data flows between notebooks (e.g., notebook 02 creates arg_annotations.csv for notebook 03), but with no implemented code, the data schema and format are undefined

Data Sources

Clearly Identified:
- Primary: kbase_ke_pangenome collection (293,059 genomes, GTDB r214)
- Secondary: kescience_fitnessbrowser for fitness data
- External: CARD, ResFinder, PATRIC (mentioned but not integrated)

Issues:
- eggNOG annotations table inaccessible: BERDL_VALIDATION_REAL.md reveals this critical table times out via REST API, but notebooks don't account for this
- orthogroup table missing: Validation shows this table doesn't exist or is inaccessible, yet notebook 01 queries it
- Fitness data integration unclear: No evidence that pangenome gene IDs can be mapped to fitness browser locus IDs across different organisms

Code Quality

SQL Queries

Not Assessed - No Executable Queries:
All SQL queries in notebooks are either commented out, wrapped in TODO blocks, or are template strings that won't execute. Examples:

Notebook 01, cell 7:

print(f"\nRow count: {spark.sql('SELECT COUNT(*) as count FROM kbase_ke_pangenome.pangenome').collect()[0]['count']}")

This query will work for the pangenome table but will timeout for large tables like gene or eggnog_mapper_annotations per validation findings.

Notebook 02, cell 5:

genes_query = """
SELECT gene_id, orthogroup_id, genome_id, gene_description, gene_function
FROM kbase_ke_pangenome.gene
WHERE gene_description IS NOT NULL OR gene_function IS NOT NULL
LIMIT 1000
"""

This query string is defined but never executed. It also references columns (gene_description, gene_function) that may not exist in the actual schema.

Notebook 03, cell 3:
The prevalence query references a table arg_genes that doesn't exist and would need to be created in notebook 02:

LEFT JOIN arg_genes ON g.genome_id = arg_genes.genome_id

Critical Issues from Validation:
- ❌ SHOW TABLES IN kbase_ke_pangenome will timeout (55 seconds)
- ❌ COUNT(*) queries on billion-row tables will timeout
- ❌ orthogroup table queries will fail (table doesn't exist)
- ❌ eggnog_mapper_annotations inaccessible via REST API

Statistical Methods

Not Assessed - No Implementation:
Notebooks mention statistical tests (Pearson/Spearman correlations, significance tests) but contain no code implementing them. Notebook 03 imports scipy.stats but never uses it.

Notebook Organization

Good Structure:
- Logical flow: Setup → Query → Analysis → Visualization
- Markdown cells clearly describe each phase
- Imports at the top of each notebook
- Clear separation of concerns across 6 notebooks

Issues:
- No error handling: No try/except blocks for queries that might fail
- No checkpointing: Long pipelines have no intermediate saves to recover from failures
- Inconsistent .env handling: Notebook 01 loads .env but subsequent notebooks don't (unclear if spark session persists)
- No data validation: No checks that loaded dataframes have expected columns/shapes

Pitfall Awareness

Pitfalls Documented but Not Addressed:

The project is aware of docs/pitfalls.md but notebooks don't implement the recommended solutions:

String-typed numeric columns (pitfalls.md line 40): Fitness browser columns are all strings requiring CAST, but notebook 05 has no casting code
Annotation join keys (pitfalls.md line 148): eggNOG annotations join on gene_cluster_id not gene_id, but notebook 02 doesn't specify this
Gene clusters are species-specific (pitfalls.md line 165): Cannot compare cluster IDs across species, but no code enforces species-level grouping
API vs Spark SQL (pitfalls.md line 318): Large queries must use Spark SQL not REST API, but notebooks don't configure Spark session appropriately

Findings Assessment

No Findings to Assess

The project has zero completed analysis, so there are no conclusions, visualizations, or results to evaluate. Specifically:

✅ Timeline status (README line 143-149): All phases marked incomplete (accurate)
❌ data/ directory: Empty except .gitkeep
❌ figures/ directory: Empty except .gitkeep
❌ results/ directory: Empty
❌ Expected outputs (README line 120-140): None generated

Analysis Completeness

Phase 1 (Data Exploration): 0% complete
- No table schemas documented
- No row counts obtained
- No data quality assessment performed

Phase 2 (ARG Identification): 0% complete
- No CARD/ResFinder databases downloaded
- No keyword matching implemented
- No arg_annotations.csv created

Phase 3 (Distribution Analysis): 0% complete
- No prevalence calculations
- No hotspot species identified
- No hotspot_species.csv created

Phase 4 (Pangenome Analysis): 0% complete
- No core/accessory classification
- No openness metrics calculated
- No correlations tested

Phase 5 (Fitness Analysis): 0% complete
- No fitness browser exploration
- No gene ID mapping attempted
- No fitness_results.csv created

Phase 6 (Visualization): 0% complete
- No figures generated
- No dashboards built
- No summary statistics written

Validation Work

Positive Aspect:
The BERDL_VALIDATION_REAL.md file shows excellent due diligence by actually testing queries against the API. This validation reveals critical issues that would block implementation:

eggNOG table inaccessible (needed for Phase 2)
COUNT(*) queries timeout on large tables (affects Phase 1)
orthogroup table missing (breaks Notebook 01)

However, these findings have not been integrated back into the notebooks. The notebooks still contain queries that the validation proves will fail.

Suggestions

Critical Issues (Must Fix Before Implementation)

Update Notebook 01 to remove failing queries
Remove SHOW TABLES query (times out per validation line 55)
Remove orthogroup table queries (table doesn't exist per validation line 75)
Replace COUNT(*) on large tables with LIMIT-based sampling
Use only tables confirmed in BERDL_VALIDATION_REAL.md (genome, pangenome, gene_cluster)
Fix ARG identification strategy in Notebook 02
Current approach: keyword matching on gene descriptions (high false positive rate)
Recommendation: Download CARD database ARO ontology, use sequence similarity (BLAST/DIAMOND) against CARD protein sequences, set identity threshold (≥80% for close homologs)
Implement validation: manually inspect sample of matches to estimate precision/recall
Document limitations: annotation-based detection will miss novel ARGs and may include non-functional homologs
Address eggNOG table inaccessibility
Validation shows eggnog_mapper_annotations times out via REST API
Solution: Add explicit note in notebook that it MUST run in JupyterHub with Spark SQL, not via REST API
Add alternative: use gene_cluster table joined with CARD external reference instead
Resolve fitness data mapping problem
Fitness browser uses different organisms (model organisms like E. coli K12) than pangenome (wild-type strains)
Gene ID formats don't match between databases
Recommendation: Either (a) limit fitness analysis to specific organisms present in both databases, or (b) remove Phase 5 entirely and acknowledge as future work
Current plan assumes mapping works, but provides no code or evidence it's possible
Define a testable hypothesis
Current research question is exploratory ("which species have ARGs?")
Add hypothesis to test, e.g.: "Species with open pangenomes (high accessory:core ratio) will have higher ARG diversity than species with closed pangenomes"
This enables statistical testing in Phase 4

High-Priority Improvements

Implement Phase 1 with working queries
Use queries confirmed to work from validation (e.g., SELECT * FROM pangenome LIMIT 10)
Document actual schemas found (don't guess column names)
Save schema documentation to data/schemas.json for reference
Create sample ARG annotation dataset
Even with 10-100 sample ARGs, you can test the pipeline
Download CARD database v3.2.9 (latest as of 2026), parse ARO IDs and drug classes
Implement keyword search on 1 species to test feasibility before scaling
Add data validation between phases
Notebook 03 expects arg_annotations.csv from Notebook 02
Add cell at start of Notebook 03 to check file exists and has required columns
Fail fast with clear error message if upstream data missing
Implement checkpointing for long queries
Validation shows large queries may timeout
Add code to save intermediate results: df.write.parquet("data/intermediate/step1.parquet")
Add logic to skip recomputation if checkpoint exists
Use gene_cluster instead of orthogroup
- Validation confirms orthogroup table doesn't exist
- Per pitfalls.md line 148, annotations join on gene_cluster_id
- Update all notebooks to use correct table name

Nice-to-Have Enhancements

Add example queries that work
- Include 2-3 fully executable queries in Notebook 01 that successfully return data
- Use validation test results (pangenome sample query worked)
- Builds user confidence that infrastructure works
Visualize intermediate results
- Phase 3 should create at least 1 simple bar chart (top 10 species by genome count)
- Even without ARG data, you can visualize pangenome structure
- Shows progress and helps debugging
Document data quality issues
- AlphaEarth embeddings only cover 28.4% of genomes (pitfalls.md line 85)
- NCBI environment metadata is sparse and EAV format (pitfalls.md line 103)
- Acknowledge these will limit environmental analysis in Phase 3
Add expected runtime estimates
- Based on validation, note which queries are fast (<5s) vs slow (>30s)
- Helps users know if their notebook is hung or just slow
Create a "quick validation" notebook
- Notebook 00 that runs 5-10 simple queries to verify BERDL access
- Tests: list databases, sample pangenome, check genome count
- Users can run this first to confirm environment before starting Phase 1