03 Distribution Analysis
Jupyter notebook from the Antibiotic Resistance Hotspots in Microbial Pangenomes project.
Phase 3: ARG Distribution Analysis¶
This notebook analyzes the distribution of antibiotic resistance genes across:
- Individual species and genera
- Phylogenetic clades
- Estimated ecological environments
Metrics Calculated¶
- ARG prevalence (% genomes with specific ARGs)
- ARG diversity (unique ARGs per species)
- Hotspot scores (combined metric for ranking)
- Statistical tests for significance
In [ ]:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
# Load ARG annotations from Phase 2
# arg_df = pd.read_csv('../data/arg_annotations.csv')
print("Phase 3: ARG Distribution Analysis")
Compute ARG Prevalence by Species¶
In [ ]:
# TODO: Query BERDL to calculate:
# - Total genomes per species
# - Genomes with at least one ARG per species
# - Prevalence = (genomes_with_ARG / total_genomes) * 100
# Note: This assumes ARGs have been identified and saved to a temporary table/dataframe
prevalence_query = """
SELECT
s.GTDB_species,
COUNT(DISTINCT g.genome_id) as total_genomes,
COUNT(DISTINCT CASE WHEN arg_genes.gene_id IS NOT NULL THEN g.genome_id END) as genomes_with_arg,
COUNT(DISTINCT arg_genes.gene_id) as unique_args,
ROUND(100.0 * COUNT(DISTINCT CASE WHEN arg_genes.gene_id IS NOT NULL THEN g.genome_id END) /
COUNT(DISTINCT g.genome_id), 2) as prevalence_percent
FROM kbase_ke_pangenome.genome g
JOIN kbase_ke_pangenome.gtdb_species_clade s
ON g.gtdb_species_clade_id = s.gtdb_species_clade_id
LEFT JOIN arg_genes ON g.genome_id = arg_genes.genome_id
GROUP BY s.GTDB_species
ORDER BY prevalence_percent DESC
"""
print("Prevalence query prepared")
print("Note: This query assumes arg_genes table/view exists with columns: gene_id, genome_id")
Identify Hotspot Species¶
In [ ]:
# TODO: Calculate hotspot scores and identify statistical outliers
# Hotspot score = function of (prevalence, diversity, mean_genes_per_genome)
print("Hotspot identification in progress...")
Phylogenetic and Environmental Patterns¶
In [ ]:
# TODO: Analyze ARG patterns at higher taxonomic levels
# - Distribution across phyla
# - Distribution across orders
# - Correlation with estimated environment (if available)
print("Phylogenetic analysis in progress...")
Visualization¶
In [ ]:
# TODO: Create visualizations:
# - Top 20 hotspot species
# - ARG prevalence distribution
# - Phylogenetic tree with ARG overlay
print("Visualizations to be generated...")