03 Phylogenetic Distribution
Jupyter notebook from the Pan-Bacterial AMR Gene Landscape project.
NB03: Phylogenetic Distribution of AMR Genes¶
Goal: Test H2 — is AMR gene density phylogenetically structured? Identify AMR hotspot lineages. Analyze how AMR density correlates with pangenome openness and genome count across the tree.
Depends on: NB01 outputs (data/amr_census.csv, data/amr_species_summary.csv)
Outputs:
../data/amr_by_taxonomy.csv— AMR stats at each taxonomic rank../figures/amr_phylogenetic_*.png
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from pathlib import Path
DATA_DIR = Path('../data')
FIG_DIR = Path('../figures')
plt.rcParams.update({
'figure.figsize': (12, 6), 'figure.dpi': 150, 'font.size': 11,
'axes.titlesize': 13, 'savefig.bbox': 'tight', 'savefig.dpi': 150,
})
amr = pd.read_csv(DATA_DIR / 'amr_census.csv')
species = pd.read_csv(DATA_DIR / 'amr_species_summary.csv')
print(f"Loaded {len(amr):,} AMR clusters, {len(species):,} species with AMR")
Loaded 83,008 AMR clusters, 14,723 species with AMR
1. Taxonomic Hierarchy: AMR at Every Rank¶
Aggregate AMR counts at phylum, class, order, family, and genus levels.
In [2]:
# AMR stats at each taxonomic rank
ranks = ['phylum', 'class', 'order', 'family', 'genus']
rank_summaries = {}
for rank in ranks:
summary = species.groupby(rank).agg(
n_species=('gtdb_species_clade_id', 'count'),
total_amr=('n_amr', 'sum'),
mean_amr=('n_amr', 'mean'),
median_amr=('n_amr', 'median'),
max_amr=('n_amr', 'max'),
mean_openness=('openness', 'mean'),
mean_genomes=('no_genomes', 'mean')
).round(2)
summary = summary.sort_values('total_amr', ascending=False)
rank_summaries[rank] = summary
print(f"\n=== {rank.upper()} level: Top 15 by total AMR clusters ===")
print(summary.head(15).to_string())
# Save genus-level (most granular useful rank)
rank_summaries['genus'].to_csv(DATA_DIR / 'amr_by_genus.csv')
rank_summaries['family'].to_csv(DATA_DIR / 'amr_by_family.csv')
print(f"\nSaved genus and family summaries to data/")
=== PHYLUM level: Top 15 by total AMR clusters ===
n_species total_amr mean_amr median_amr max_amr mean_openness mean_genomes
phylum
p__Pseudomonadota 4992 42904 8.59 3.0 1120 0.30 21.36
p__Bacillota 1401 12614 9.00 4.0 645 0.31 45.35
p__Actinomycetota 2036 9027 4.43 3.0 61 0.28 10.95
p__Bacillota_A 2254 8776 3.89 3.0 165 0.41 8.38
p__Bacteroidota 1835 5458 2.97 2.0 75 0.37 7.60
p__Bacillota_C 118 396 3.36 2.0 14 0.39 9.30
p__Cyanobacteriota 176 386 2.19 2.0 11 0.32 4.70
p__Acidobacteriota 175 373 2.13 1.0 16 0.39 3.09
p__Campylobacterota 96 294 3.06 1.0 61 0.29 68.94
p__Chloroflexota 176 291 1.65 1.0 7 0.43 3.73
p__Verrucomicrobiota 140 211 1.51 1.0 18 0.36 6.24
p__Myxococcota 86 189 2.20 2.0 6 0.30 3.55
p__Spirochaetota 117 171 1.46 1.0 11 0.36 11.44
p__Gemmatimonadota 74 164 2.22 1.0 13 0.42 3.65
p__Bacillota_B 40 138 3.45 2.0 23 0.39 3.85
=== CLASS level: Top 15 by total AMR clusters ===
n_species total_amr mean_amr median_amr max_amr mean_openness mean_genomes
class
c__Gammaproteobacteria 3394 37752 11.12 4.0 1120 0.30 28.35
c__Bacilli 1401 12614 9.00 4.0 645 0.31 45.35
c__Clostridia 2245 8765 3.90 3.0 165 0.41 8.39
c__Actinomycetia 1824 8600 4.71 4.0 61 0.26 11.53
c__Bacteroidia 1662 5157 3.10 2.0 75 0.37 7.77
c__Alphaproteobacteria 1587 5140 3.24 2.0 61 0.29 6.54
c__Negativicutes 118 396 3.36 2.0 14 0.39 9.30
c__Cyanobacteriia 145 326 2.25 2.0 11 0.31 4.70
c__Campylobacteria 96 294 3.06 1.0 61 0.29 68.94
c__Coriobacteriia 115 246 2.14 1.0 12 0.39 8.16
c__Terriglobia 89 209 2.35 1.0 16 0.42 3.02
c__Verrucomicrobiae 122 189 1.55 1.0 18 0.36 6.56
c__Gemmatimonadetes 74 164 2.22 1.0 13 0.42 3.65
c__Anaerolineae 92 151 1.64 1.0 6 0.40 3.04
c__Acidimicrobiia 72 134 1.86 1.0 6 0.51 3.56
=== ORDER level: Top 15 by total AMR clusters ===
n_species total_amr mean_amr median_amr max_amr mean_openness mean_genomes
order
o__Enterobacterales 426 15974 37.50 9.0 1120 0.30 97.62
o__Pseudomonadales 958 9838 10.27 5.0 538 0.29 24.88
o__Burkholderiales 912 5844 6.41 3.0 91 0.32 14.51
o__Lachnospirales 839 4173 4.97 4.0 53 0.40 7.38
o__Lactobacillales 474 3963 8.36 4.0 261 0.31 71.50
o__Enterobacterales_A 456 3426 7.51 3.0 184 0.28 23.16
o__Bacillales 313 3317 10.60 6.0 132 0.27 21.66
o__Mycobacteriales 623 3077 4.94 4.0 61 0.23 20.87
o__Bacteroidales 905 3049 3.37 1.0 75 0.43 10.63
o__Staphylococcales 97 2918 30.08 12.0 645 0.31 201.57
o__Streptomycetales 384 2895 7.54 7.0 33 0.29 5.04
o__Rhizobiales 719 2748 3.82 3.0 61 0.27 8.89
o__Oscillospirales 737 2265 3.07 2.0 70 0.42 7.16
o__Actinomycetales 628 1811 2.88 2.0 28 0.27 8.07
o__Xanthomonadales 265 1713 6.46 4.0 109 0.28 12.52
=== FAMILY level: Top 15 by total AMR clusters ===
n_species total_amr mean_amr median_amr max_amr mean_openness mean_genomes
family
f__Enterobacteriaceae 426 15974 37.50 9.0 1120 0.30 97.62
f__Pseudomonadaceae 488 6372 13.06 7.0 538 0.26 27.21
f__Lachnospiraceae 798 4061 5.09 4.0 53 0.40 7.52
f__Streptomycetaceae 381 2881 7.56 7.0 33 0.29 5.06
f__Staphylococcaceae 76 2832 37.26 18.0 645 0.32 256.32
f__Burkholderiaceae 280 2588 9.24 5.0 91 0.27 20.54
f__Moraxellaceae 155 2443 15.76 5.0 382 0.35 59.70
f__Mycobacteriaceae 423 2191 5.18 4.0 61 0.24 29.31
f__Bacteroidaceae 412 1923 4.67 2.0 75 0.46 14.28
f__Rhizobiaceae 377 1798 4.77 3.0 61 0.27 13.04
f__Bacillaceae_G 52 1615 31.06 23.0 132 0.30 59.62
f__Streptococcaceae 200 1551 7.76 4.0 100 0.32 88.24
f__Xanthomonadaceae 199 1457 7.32 4.0 109 0.29 15.73
f__Burkholderiaceae_B 242 1337 5.52 3.0 57 0.36 4.63
f__Enterococcaceae 64 1279 19.98 5.0 261 0.26 96.53
=== GENUS level: Top 15 by total AMR clusters ===
n_species total_amr mean_amr median_amr max_amr mean_openness mean_genomes
genus
g__Pseudomonas_E 398 4789 12.03 7.0 149 0.26 14.29
g__Klebsiella 16 3296 206.00 148.0 1120 0.42 1076.56
g__Enterobacter 32 2963 92.59 48.0 540 0.35 127.72
g__Streptomyces 368 2820 7.66 7.0 33 0.29 5.12
g__Staphylococcus 61 2500 40.98 22.0 645 0.32 310.93
g__Acinetobacter 112 2227 19.88 7.0 382 0.35 78.37
g__Bacillus_A 46 1592 34.61 23.5 132 0.32 67.04
g__Streptococcus 181 1432 7.91 5.0 100 0.32 94.62
g__Citrobacter 10 1343 134.30 111.5 379 0.41 108.30
g__Burkholderia 50 1060 21.20 12.0 91 0.34 87.68
g__Salmonella 5 989 197.80 28.0 840 0.33 2310.40
g__Stenotrophomonas 63 960 15.24 9.0 109 0.34 13.03
g__Vibrio 115 956 8.31 2.0 184 0.28 49.30
g__Aeromonas 26 878 33.77 16.0 157 0.35 34.12
g__Mycobacterium 170 805 4.74 3.0 44 0.22 58.82
Saved genus and family summaries to data/
2. AMR Hotspot Families¶
Which families have disproportionately high AMR density (AMR clusters per species)?
In [3]:
# Family-level hotspot analysis
fam = rank_summaries['family'].copy()
fam_enough = fam[fam['n_species'] >= 5].copy() # at least 5 species
overall_mean = species['n_amr'].mean()
print(f"Overall mean AMR clusters per species: {overall_mean:.1f}")
print(f"Families with >= 5 species: {len(fam_enough):,}")
# Top families by mean AMR density
print(f"\n=== Top 20 Families by Mean AMR Clusters/Species (min 5 species) ===")
top_fam = fam_enough.nlargest(20, 'mean_amr')
print(top_fam[['n_species', 'total_amr', 'mean_amr', 'median_amr', 'mean_openness']].to_string())
# Visualization: Top 20 families
fig, ax = plt.subplots(figsize=(14, 8))
top_fam_sorted = top_fam.sort_values('mean_amr')
colors = plt.cm.Reds(top_fam_sorted['mean_amr'] / top_fam_sorted['mean_amr'].max())
ax.barh(range(len(top_fam_sorted)), top_fam_sorted['mean_amr'], color=colors)
ax.set_yticks(range(len(top_fam_sorted)))
ax.set_yticklabels([idx.replace('f__', '') for idx in top_fam_sorted.index], fontsize=9)
ax.axvline(overall_mean, color='gray', linestyle='--', alpha=0.7, label=f'Overall mean ({overall_mean:.1f})')
ax.set_xlabel('Mean AMR Clusters per Species')
ax.set_title('Top 20 AMR Hotspot Families (min 5 species)')
ax.legend()
for i, (_, row) in enumerate(top_fam_sorted.iterrows()):
ax.text(row['mean_amr'] + 0.2, i, f'n={int(row["n_species"])}', va='center', fontsize=8)
plt.tight_layout()
plt.savefig(FIG_DIR / 'amr_hotspot_families.png')
plt.show()
Overall mean AMR clusters per species: 5.6
Families with >= 5 species: 335
=== Top 20 Families by Mean AMR Clusters/Species (min 5 species) ===
n_species total_amr mean_amr median_amr mean_openness
family
f__Enterobacteriaceae 426 15974 37.50 9.0 0.30
f__Staphylococcaceae 76 2832 37.26 18.0 0.32
f__Bacillaceae_G 52 1615 31.06 23.0 0.30
f__Aeromonadaceae 29 885 30.52 14.0 0.33
f__Enterococcaceae 64 1279 19.98 5.0 0.26
f__Moraxellaceae 155 2443 15.76 5.0 0.35
f__Listeriaceae 25 371 14.84 4.0 0.32
f__Pseudomonadaceae 488 6372 13.06 7.0 0.26
f__Bacillaceae_H 8 98 12.25 11.5 0.34
f__Brevibacillaceae 19 226 11.89 13.0 0.26
f__Burkholderiaceae_C 79 772 9.77 5.0 0.27
f__Bacillaceae 42 401 9.55 7.5 0.30
f__DSM-1321 12 113 9.42 7.5 0.32
f__Burkholderiaceae 280 2588 9.24 5.0 0.27
f__Peptostreptococcaceae 36 331 9.19 5.0 0.37
f__Vagococcaceae 8 71 8.88 7.0 0.19
f__Shewanellaceae 62 520 8.39 2.0 0.22
f__Bacillaceae_C 16 125 7.81 7.0 0.27
f__Streptococcaceae 200 1551 7.76 4.0 0.32
f__DSM-18226 52 402 7.73 5.5 0.22
3. AMR Mechanism Distribution Across Phyla¶
Do different phyla rely on different resistance mechanisms?
In [4]:
# Mechanism classification (re-derive if needed)
if 'mechanism' not in amr.columns:
def classify_mechanism(product):
if pd.isna(product):
return 'Unknown'
p = product.lower()
if any(k in p for k in ['efflux', 'pump', 'transporter', 'exporter', 'permease']):
return 'Efflux'
if any(k in p for k in ['beta-lactamase', 'lactamase', 'carbapenemase']):
return 'Beta-lactamase'
if any(k in p for k in ['acetyltransferase', 'phosphotransferase', 'nucleotidyltransferase',
'aminoglycoside-modifying', 'inactivat', 'hydrolase']):
return 'Enzymatic inactivation'
if any(k in p for k in ['methyltransferase', 'ribosomal protection', 'target protect']):
return 'Target modification'
if any(k in p for k in ['penicillin-binding', 'pbp', 'vancomycin', 'lipid a', 'mcr-']):
return 'Cell wall modification'
return 'Other'
amr['mechanism'] = amr['amr_product'].apply(classify_mechanism)
# Cross-tabulate: phylum x mechanism
top_phyla = amr['phylum'].value_counts().head(8).index
top_mechs = amr['mechanism'].value_counts().head(6).index
ct = pd.crosstab(amr[amr['phylum'].isin(top_phyla)]['phylum'],
amr[amr['mechanism'].isin(top_mechs)]['mechanism'],
normalize='index') * 100
fig, ax = plt.subplots(figsize=(12, 7))
ct.plot(kind='barh', stacked=True, ax=ax, colormap='Set2')
ax.set_xlabel('% of AMR Genes')
ax.set_title('AMR Mechanism Composition by Phylum')
ax.legend(bbox_to_anchor=(1.02, 1), loc='upper left', fontsize=9)
ax.set_yticklabels([t.replace('p__', '') for t in ct.index], fontsize=10)
plt.tight_layout()
plt.savefig(FIG_DIR / 'amr_mechanism_by_phylum.png')
plt.show()
# Chi-squared test for mechanism independence from phylum
ct_counts = pd.crosstab(amr[amr['phylum'].isin(top_phyla)]['phylum'],
amr[amr['mechanism'].isin(top_mechs)]['mechanism'])
chi2, p, dof, expected = stats.chi2_contingency(ct_counts)
print(f"\nChi-squared test (mechanism x phylum independence):")
print(f" chi2={chi2:.1f}, dof={dof}, p={p:.2e}")
print(f" Interpretation: {'Mechanisms differ by phylum' if p < 0.05 else 'No significant difference'}")
Chi-squared test (mechanism x phylum independence): chi2=14494.5, dof=35, p=0.00e+00 Interpretation: Mechanisms differ by phylum
4. AMR Density vs Pangenome Openness — Controlling for Phylogeny¶
The NB01 correlation between AMR count and openness could be driven by phylogeny. Test within-phylum correlations.
In [5]:
# Within-phylum correlations: AMR count vs openness
from scipy.stats import spearmanr
top_phyla_list = species['phylum'].value_counts().head(10).index.tolist()
print("=== Within-Phylum: AMR Count vs Pangenome Openness ===\n")
print(f"{'Phylum':<30} {'N spp':>6} {'rho':>8} {'p-value':>12} {'Sig':>5}")
print("-" * 65)
results = []
for phy in top_phyla_list:
sub = species[species['phylum'] == phy].dropna(subset=['openness', 'n_amr'])
if len(sub) >= 10:
rho, p = spearmanr(sub['openness'], sub['n_amr'])
sig = "***" if p < 0.001 else "**" if p < 0.01 else "*" if p < 0.05 else "ns"
print(f"{phy.replace('p__',''):<30} {len(sub):>6} {rho:>8.3f} {p:>12.2e} {sig:>5}")
results.append({'phylum': phy, 'n': len(sub), 'rho': rho, 'p': p})
# Overall
rho_all, p_all = spearmanr(species['openness'].dropna(), species['n_amr'].dropna())
print(f"\n{'OVERALL':<30} {len(species):>6} {rho_all:>8.3f} {p_all:>12.2e}")
# How many phyla show positive vs negative correlation?
pos = sum(1 for r in results if r['rho'] > 0)
neg = sum(1 for r in results if r['rho'] < 0)
print(f"\nPhyla with positive correlation: {pos}/{len(results)}")
print(f"Phyla with negative correlation: {neg}/{len(results)}")
=== Within-Phylum: AMR Count vs Pangenome Openness === Phylum N spp rho p-value Sig ----------------------------------------------------------------- Pseudomonadota 4992 0.119 2.99e-17 *** Bacillota_A 2254 0.042 4.37e-02 * Actinomycetota 2036 0.073 1.06e-03 ** Bacteroidota 1835 0.087 2.01e-04 *** Bacillota 1401 0.219 1.01e-16 *** Chloroflexota 176 -0.029 7.02e-01 ns Cyanobacteriota 176 0.107 1.58e-01 ns Acidobacteriota 175 -0.025 7.42e-01 ns Verrucomicrobiota 140 0.240 4.33e-03 ** Bacillota_C 118 0.374 3.07e-05 *** OVERALL 14723 0.006 4.73e-01 Phyla with positive correlation: 8/10 Phyla with negative correlation: 2/10
In [6]:
# Faceted scatter: AMR vs openness by phylum (top 6)
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes_flat = axes.flatten()
for i, phy in enumerate(top_phyla_list[:6]):
ax = axes_flat[i]
sub = species[species['phylum'] == phy].dropna(subset=['openness', 'n_amr'])
ax.scatter(sub['openness'], sub['n_amr'], alpha=0.4, s=12, color='#e74c3c')
rho, p = spearmanr(sub['openness'], sub['n_amr'])
ax.set_title(f"{phy.replace('p__', '')} (n={len(sub)}, rho={rho:.2f})", fontsize=10)
ax.set_xlabel('Openness')
ax.set_ylabel('AMR Clusters')
# Trend line
if len(sub) > 10:
z = np.polyfit(sub['openness'], sub['n_amr'], 1)
x_line = np.linspace(sub['openness'].min(), sub['openness'].max(), 100)
ax.plot(x_line, np.polyval(z, x_line), 'k--', alpha=0.5)
plt.suptitle('AMR Count vs Pangenome Openness by Phylum', fontsize=14)
plt.tight_layout()
plt.savefig(FIG_DIR / 'amr_openness_by_phylum.png')
plt.show()