03 Metal Conservation Analysis
Jupyter notebook from the Pan-Bacterial Metal Fitness Atlas project.
NB 03: Metal Fitness × Pangenome Conservation¶
Test the two-tier hypothesis: are metal-important genes enriched in the accessory genome (for toxic metals) or the core genome (for essential metals)?
Runs locally — joins NB02 output with existing FB-pangenome link data.
Inputs:
data/metal_fitness_scores.csv(from NB02)conservation_vs_fitness/data/fb_pangenome_link.tsv
Outputs:
data/metal_conservation_stats.csvfigures/core_fraction_by_metal.pngfigures/metal_conservation_by_organism.png
In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from scipy import stats
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
PROJECT_DIR = Path('..').resolve()
DATA_DIR = PROJECT_DIR / 'data'
FIGURES_DIR = PROJECT_DIR / 'figures'
# Load metal fitness scores from NB02
metal_fitness = pd.read_csv(DATA_DIR / 'metal_fitness_scores.csv')
print(f'Metal fitness records: {len(metal_fitness):,}')
# Load FB-pangenome link
fb_link = pd.read_csv(
PROJECT_DIR.parent / 'conservation_vs_fitness' / 'data' / 'fb_pangenome_link.tsv',
sep='\t'
)
# Ensure locusId types match
fb_link['locusId'] = fb_link['locusId'].astype(str)
metal_fitness['locusId'] = metal_fitness['locusId'].astype(str)
print(f'FB-pangenome links: {len(fb_link):,}')
print(f'Organisms in link: {fb_link["orgId"].nunique()}')
Metal fitness records: 383,349
FB-pangenome links: 177,863 Organisms in link: 44
1. Join Metal Fitness with Conservation Status¶
In [2]:
# Join metal fitness with conservation
# Keep only the conservation columns we need
link_cols = ['orgId', 'locusId', 'is_core', 'is_auxiliary', 'is_singleton']
# Some genes map to multiple clades; take the first (most are consistent)
fb_link_dedup = fb_link[link_cols].drop_duplicates(subset=['orgId', 'locusId'])
metal_cons = metal_fitness.merge(
fb_link_dedup,
on=['orgId', 'locusId'],
how='left'
)
n_mapped = metal_cons['is_core'].notna().sum()
print(f'Metal fitness records with conservation data: {n_mapped:,} / {len(metal_cons):,} '
f'({100*n_mapped/len(metal_cons):.1f}%)')
# Filter to mapped genes only
metal_cons_mapped = metal_cons[metal_cons['is_core'].notna()].copy()
metal_cons_mapped['is_core'] = metal_cons_mapped['is_core'].astype(int)
metal_cons_mapped['is_auxiliary'] = metal_cons_mapped['is_auxiliary'].astype(int)
metal_cons_mapped['is_singleton'] = metal_cons_mapped['is_singleton'].astype(int)
print(f'\nMapped records: {len(metal_cons_mapped):,}')
print(f'Organisms: {metal_cons_mapped["orgId"].nunique()}')
print(f'Metals: {metal_cons_mapped["metal_element"].nunique()}')
Metal fitness records with conservation data: 318,751 / 383,349 (83.1%) Mapped records: 318,751
Organisms: 22 Metals: 14
2. Core Fraction: Metal-Important vs All Genes¶
In [3]:
# Overall core fraction
baseline_core = metal_cons_mapped['is_core'].mean()
important_core = metal_cons_mapped.loc[
metal_cons_mapped['is_metal_important'], 'is_core'
].mean()
not_important_core = metal_cons_mapped.loc[
~metal_cons_mapped['is_metal_important'], 'is_core'
].mean()
print(f'Core fraction (all mapped genes): {baseline_core:.3f}')
print(f'Core fraction (metal-important): {important_core:.3f}')
print(f'Core fraction (not metal-important): {not_important_core:.3f}')
print(f'\nDifference: {important_core - not_important_core:+.3f}')
# Fisher exact test: metal-important vs core
imp = metal_cons_mapped[metal_cons_mapped['is_metal_important']]
not_imp = metal_cons_mapped[~metal_cons_mapped['is_metal_important']]
table = [
[imp['is_core'].sum(), len(imp) - imp['is_core'].sum()],
[not_imp['is_core'].sum(), len(not_imp) - not_imp['is_core'].sum()]
]
odds_ratio, p_value = stats.fisher_exact(table)
print(f'\nFisher exact test (metal-important vs core):')
print(f' Odds ratio: {odds_ratio:.3f}')
print(f' P-value: {p_value:.2e}')
if odds_ratio < 1:
print(f' => Metal-important genes are LESS likely to be core (accessory-enriched)')
else:
print(f' => Metal-important genes are MORE likely to be core')
Core fraction (all mapped genes): 0.773 Core fraction (metal-important): 0.874 Core fraction (not metal-important): 0.769 Difference: +0.105 Fisher exact test (metal-important vs core): Odds ratio: 2.083 P-value: 4.32e-162 => Metal-important genes are MORE likely to be core
3. Core Fraction by Metal Type (Toxic vs Essential)¶
In [4]:
# Add metal category
metal_categories = {
'Cobalt': 'toxic', 'Nickel': 'toxic', 'Copper': 'toxic',
'Zinc': 'toxic', 'Aluminum': 'toxic', 'Chromium': 'toxic',
'Mercury': 'toxic', 'Cadmium': 'toxic', 'Uranium': 'toxic',
'Iron': 'essential', 'Molybdenum': 'essential', 'Tungsten': 'essential',
'Selenium': 'essential', 'Manganese': 'essential',
}
metal_cons_mapped['metal_category'] = metal_cons_mapped['metal_element'].map(metal_categories)
# Core fraction per metal, for important genes only
results = []
print('Core fraction of metal-important genes by metal:')
print('=' * 80)
for metal in sorted(metal_cons_mapped['metal_element'].unique()):
subset = metal_cons_mapped[metal_cons_mapped['metal_element'] == metal]
imp_subset = subset[subset['is_metal_important']]
not_imp_subset = subset[~subset['is_metal_important']]
if len(imp_subset) == 0:
continue
core_imp = imp_subset['is_core'].mean()
core_baseline = subset['is_core'].mean()
n_orgs = imp_subset['orgId'].nunique()
cat = metal_categories.get(metal, '?')
# Fisher test per metal
if len(not_imp_subset) > 0:
table = [
[imp_subset['is_core'].sum(), len(imp_subset) - imp_subset['is_core'].sum()],
[not_imp_subset['is_core'].sum(), len(not_imp_subset) - not_imp_subset['is_core'].sum()]
]
or_val, p_val = stats.fisher_exact(table)
else:
or_val, p_val = np.nan, np.nan
results.append({
'metal': metal,
'category': cat,
'n_important': len(imp_subset),
'n_organisms': n_orgs,
'core_frac_important': core_imp,
'core_frac_baseline': core_baseline,
'delta': core_imp - core_baseline,
'odds_ratio': or_val,
'p_value': p_val,
})
sig = '*' if p_val < 0.05 else ''
print(f' {metal:12s} ({cat:9s}) core={core_imp:.3f} '
f'baseline={core_baseline:.3f} delta={core_imp-core_baseline:+.3f} '
f'OR={or_val:.2f} p={p_val:.2e} {sig} '
f'[{len(imp_subset)} genes, {n_orgs} orgs]')
results_df = pd.DataFrame(results)
# Summary by category
print(f'\nSummary by metal category:')
for cat in ['toxic', 'essential']:
cat_df = results_df[results_df['category'] == cat]
mean_delta = cat_df['delta'].mean()
print(f' {cat:10s}: mean delta = {mean_delta:+.3f} '
f'({len(cat_df)} metals)')
Core fraction of metal-important genes by metal: ================================================================================ Aluminum (toxic ) core=0.898 baseline=0.791 delta=+0.107 OR=2.37 p=1.19e-26 * [1381 genes, 14 orgs] Cadmium (toxic ) core=0.522 baseline=0.531 delta=-0.010 OR=0.96 p=9.16e-01 [92 genes, 1 orgs] Chromium (toxic ) core=0.710 baseline=0.650 delta=+0.060 OR=1.33 p=3.99e-02 * [262 genes, 2 orgs] Cobalt (toxic ) core=0.859 baseline=0.786 delta=+0.072 OR=1.67 p=8.07e-16 * [1859 genes, 19 orgs]
Copper (toxic ) core=0.867 baseline=0.776 delta=+0.090 OR=1.91 p=3.80e-27 * [2139 genes, 16 orgs] Iron (essential) core=0.919 baseline=0.803 delta=+0.116 OR=3.07 p=8.55e-18 * [651 genes, 3 orgs] Manganese (essential) core=1.000 baseline=0.802 delta=+0.198 OR=inf p=2.03e-03 * [30 genes, 1 orgs] Mercury (toxic ) core=0.934 baseline=0.802 delta=+0.132 OR=3.62 p=1.61e-04 * [106 genes, 1 orgs] Molybdenum (essential) core=0.950 baseline=0.802 delta=+0.148 OR=5.32 p=1.34e-14 * [302 genes, 1 orgs] Nickel (toxic ) core=0.877 baseline=0.789 delta=+0.088 OR=1.93 p=3.01e-22 * [1760 genes, 18 orgs] Selenium (essential) core=0.933 baseline=0.802 delta=+0.131 OR=3.59 p=2.88e-05 * [134 genes, 1 orgs] Tungsten (essential) core=0.947 baseline=0.802 delta=+0.145 OR=4.98 p=5.32e-14 * [303 genes, 1 orgs] Uranium (toxic ) core=0.685 baseline=0.650 delta=+0.035 OR=1.18 p=3.39e-01 [178 genes, 2 orgs] Zinc (toxic ) core=0.890 baseline=0.739 delta=+0.151 OR=2.94 p=2.02e-49 * [1517 genes, 12 orgs] Summary by metal category: toxic : mean delta = +0.081 (9 metals) essential : mean delta = +0.148 (5 metals)
4. Per-Organism Analysis¶
In [5]:
# Core fraction of metal-important genes per organism
org_results = []
print('Per-organism: core fraction of metal-important vs all genes')
print('=' * 80)
for org_id in sorted(metal_cons_mapped['orgId'].unique()):
org_data = metal_cons_mapped[metal_cons_mapped['orgId'] == org_id]
# Deduplicate: a gene may appear for multiple metals; for organism-level,
# we want unique genes
org_genes = org_data.drop_duplicates(subset='locusId')
org_important = org_data[org_data['is_metal_important']].drop_duplicates(subset='locusId')
core_all = org_genes['is_core'].mean()
core_imp = org_important['is_core'].mean() if len(org_important) > 0 else np.nan
if len(org_important) > 0 and len(org_genes) > len(org_important):
org_not_imp = org_genes[~org_genes['locusId'].isin(org_important['locusId'])]
table = [
[org_important['is_core'].sum(), len(org_important) - org_important['is_core'].sum()],
[org_not_imp['is_core'].sum(), len(org_not_imp) - org_not_imp['is_core'].sum()]
]
or_val, p_val = stats.fisher_exact(table)
else:
or_val, p_val = np.nan, np.nan
n_metals = org_data['metal_element'].nunique()
org_results.append({
'orgId': org_id,
'n_genes': len(org_genes),
'n_important': len(org_important),
'n_metals': n_metals,
'core_frac_all': core_all,
'core_frac_important': core_imp,
'delta': core_imp - core_all if not np.isnan(core_imp) else np.nan,
'odds_ratio': or_val,
'p_value': p_val,
})
sig = '*' if p_val < 0.05 else '' if not np.isnan(p_val) else ''
print(f' {org_id:25s} core_imp={core_imp:.3f} core_all={core_all:.3f} '
f'delta={core_imp-core_all:+.3f} OR={or_val:.2f} p={p_val:.2e} {sig}')
org_results_df = pd.DataFrame(org_results)
# How many organisms show metal-important genes as LESS core?
n_less_core = (org_results_df['delta'] < 0).sum()
n_more_core = (org_results_df['delta'] > 0).sum()
n_sig = (org_results_df['p_value'] < 0.05).sum()
print(f'\n{n_less_core}/{len(org_results_df)} organisms: metal-important genes are LESS core')
print(f'{n_more_core}/{len(org_results_df)} organisms: metal-important genes are MORE core')
print(f'{n_sig}/{len(org_results_df)} significant at p<0.05')
Per-organism: core fraction of metal-important vs all genes ================================================================================ BFirm core_imp=0.917 core_all=0.756 delta=+0.161 OR=3.68 p=7.76e-08 * Btheta core_imp=0.897 core_all=0.633 delta=+0.265 OR=5.51 p=1.93e-24 * Caulo core_imp=0.918 core_all=0.911 delta=+0.007 OR=1.11 p=6.55e-01 Cup4G11 core_imp=0.761 core_all=0.630 delta=+0.131 OR=1.97 p=2.28e-11 * DvH core_imp=0.929 core_all=0.802 delta=+0.127 OR=3.82 p=7.61e-17 *
Korea core_imp=0.887 core_all=0.840 delta=+0.047 OR=1.52 p=1.12e-01 Koxy core_imp=0.885 core_all=0.833 delta=+0.052 OR=1.59 p=9.23e-03 *
Marino core_imp=0.842 core_all=0.730 delta=+0.112 OR=2.07 p=1.91e-06 * Methanococcus_JJ core_imp=0.936 core_all=0.809 delta=+0.127 OR=4.70 p=3.16e-16 * Methanococcus_S2 core_imp=0.896 core_all=0.797 delta=+0.099 OR=2.55 p=5.80e-06 * Pedo557 core_imp=0.911 core_all=0.875 delta=+0.035 OR=1.52 p=3.35e-01 Phaeo core_imp=0.984 core_all=0.851 delta=+0.132 OR=11.13 p=1.05e-09 * Ponti core_imp=1.000 core_all=0.997 delta=+0.003 OR=inf p=6.20e-01 SynE core_imp=0.993 core_all=0.980 delta=+0.013 OR=3.37 p=2.95e-02 * WCS417 core_imp=0.978 core_all=0.921 delta=+0.057 OR=3.87 p=4.59e-02 *
acidovorax_3H11 core_imp=0.789 core_all=0.748 delta=+0.041 OR=1.29 p=9.56e-02 psRCH2 core_imp=0.571 core_all=0.531 delta=+0.040 OR=1.20 p=6.93e-02 pseudo13_GW456_L13 core_imp=0.972 core_all=0.832 delta=+0.140 OR=7.35 p=1.32e-07 * pseudo1_N1B4 core_imp=0.755 core_all=0.741 delta=+0.015 OR=1.08 p=7.30e-01 pseudo3_N2E3 core_imp=0.972 core_all=0.975 delta=-0.003 OR=0.89 p=6.96e-01 pseudo5_N2C3_1 core_imp=0.798 core_all=0.684 delta=+0.114 OR=1.90 p=7.24e-06 * pseudo6_N2E2 core_imp=0.944 core_all=0.798 delta=+0.146 OR=4.49 p=8.69e-10 * 1/22 organisms: metal-important genes are LESS core 21/22 organisms: metal-important genes are MORE core 14/22 significant at p<0.05
5. Figures¶
In [6]:
# Bar chart: core fraction by metal (important vs baseline)
fig, ax = plt.subplots(figsize=(14, 6))
# Sort by delta
plot_df = results_df.sort_values('delta')
x = range(len(plot_df))
colors = ['#e74c3c' if c == 'toxic' else '#3498db' for c in plot_df['category']]
bars = ax.bar(x, plot_df['core_frac_important'], color=colors, alpha=0.8, edgecolor='black', linewidth=0.5)
ax.scatter(x, plot_df['core_frac_baseline'], color='black', marker='_', s=200, linewidths=2, zorder=5, label='Baseline (all genes)')
ax.set_xticks(x)
ax.set_xticklabels(plot_df['metal'], rotation=45, ha='right')
ax.set_ylabel('Fraction Core Genome')
ax.set_title('Core Genome Fraction: Metal-Important Genes vs Baseline', fontsize=14)
ax.legend()
# Add significance stars
for i, (_, row) in enumerate(plot_df.iterrows()):
if row['p_value'] < 0.001:
ax.text(i, row['core_frac_important'] + 0.01, '***', ha='center', fontsize=10)
elif row['p_value'] < 0.01:
ax.text(i, row['core_frac_important'] + 0.01, '**', ha='center', fontsize=10)
elif row['p_value'] < 0.05:
ax.text(i, row['core_frac_important'] + 0.01, '*', ha='center', fontsize=10)
# Legend for colors
from matplotlib.patches import Patch
legend_elements = [
Patch(facecolor='#e74c3c', alpha=0.8, label='Toxic metal'),
Patch(facecolor='#3498db', alpha=0.8, label='Essential metal'),
plt.Line2D([0], [0], marker='_', color='black', markersize=10, linestyle='None', label='Baseline'),
]
ax.legend(handles=legend_elements, loc='lower right')
plt.tight_layout()
fig.savefig(FIGURES_DIR / 'core_fraction_by_metal.png', dpi=150, bbox_inches='tight')
plt.show()
print(f'Saved: figures/core_fraction_by_metal.png')
Saved: figures/core_fraction_by_metal.png
In [7]:
# Per-organism delta plot
fig, ax = plt.subplots(figsize=(12, 6))
plot_org = org_results_df.sort_values('delta')
colors = ['#e74c3c' if d < 0 else '#2ecc71' for d in plot_org['delta']]
sig_markers = ['*' if p < 0.05 else '' for p in plot_org['p_value']]
ax.barh(range(len(plot_org)), plot_org['delta'], color=colors, alpha=0.8, edgecolor='black', linewidth=0.5)
ax.set_yticks(range(len(plot_org)))
ax.set_yticklabels(plot_org['orgId'], fontsize=9)
ax.axvline(0, color='black', linestyle='-', linewidth=0.5)
ax.set_xlabel('Delta (core fraction: metal-important - all genes)')
ax.set_title('Metal-Important Genes: Core Enrichment by Organism', fontsize=14)
for i, (_, row) in enumerate(plot_org.iterrows()):
if row['p_value'] < 0.05:
ax.text(row['delta'] + 0.002 * np.sign(row['delta']), i, '*',
ha='center', va='center', fontsize=12, fontweight='bold')
plt.tight_layout()
fig.savefig(FIGURES_DIR / 'metal_conservation_by_organism.png', dpi=150, bbox_inches='tight')
plt.show()
print(f'Saved: figures/metal_conservation_by_organism.png')
Saved: figures/metal_conservation_by_organism.png
6. Save Results¶
In [8]:
# --- Issue #3: Formal toxic vs essential comparison ---
from scipy.stats import mannwhitneyu
toxic_deltas = results_df[results_df['category'] == 'toxic']['delta'].values
essential_deltas = results_df[results_df['category'] == 'essential']['delta'].values
u_stat, mw_p = mannwhitneyu(essential_deltas, toxic_deltas, alternative='greater')
print('Formal comparison: Essential vs Toxic metal core enrichment')
print(f' Essential mean delta: {essential_deltas.mean():+.3f} (n={len(essential_deltas)})')
print(f' Toxic mean delta: {toxic_deltas.mean():+.3f} (n={len(toxic_deltas)})')
print(f' Mann-Whitney U: {u_stat:.1f}, p={mw_p:.4f} (one-sided: essential > toxic)')
print()
# --- Issue #7: Phylogenetic sensitivity analysis ---
# Exclude duplicate Pseudomonas fluorescens strains (keep only pseudo3_N2E3 as representative)
pseudo_strains = ['pseudo1_N1B4', 'pseudo5_N2C3_1', 'pseudo6_N2E2', 'pseudo13_GW456_L13']
metal_cons_no_pseudo = metal_cons_mapped[~metal_cons_mapped['orgId'].isin(pseudo_strains)]
imp_np = metal_cons_no_pseudo[metal_cons_no_pseudo['is_metal_important']]
not_imp_np = metal_cons_no_pseudo[~metal_cons_no_pseudo['is_metal_important']]
core_imp_np = imp_np['is_core'].mean()
core_not_np = not_imp_np['is_core'].mean()
table_np = [
[imp_np['is_core'].sum(), len(imp_np) - imp_np['is_core'].sum()],
[not_imp_np['is_core'].sum(), len(not_imp_np) - not_imp_np['is_core'].sum()]
]
or_np, p_np = stats.fisher_exact(table_np)
print('Phylogenetic sensitivity: excluding 4 duplicate P. fluorescens strains')
print(f' Organisms: {metal_cons_no_pseudo["orgId"].nunique()} (was {metal_cons_mapped["orgId"].nunique()})')
print(f' Core fraction (metal-important): {core_imp_np:.3f} (was {important_core:.3f})')
print(f' Core fraction (not important): {core_not_np:.3f} (was {not_important_core:.3f})')
print(f' OR={or_np:.3f}, p={p_np:.2e}')
print(f' Result: {"ROBUST" if or_np > 1 and p_np < 0.05 else "NOT ROBUST"} — '
f'core enrichment {"persists" if or_np > 1 else "lost"} after removing Pseudomonas duplicates')
print()
# Save results
results_df.to_csv(DATA_DIR / 'metal_conservation_stats.csv', index=False)
org_results_df.to_csv(DATA_DIR / 'organism_conservation_stats.csv', index=False)
# Save sensitivity analysis results
sensitivity = pd.DataFrame([{
'analysis': 'full', 'n_organisms': metal_cons_mapped['orgId'].nunique(),
'core_imp': important_core, 'core_not': not_important_core,
'OR': odds_ratio, 'p_value': p_value,
}, {
'analysis': 'no_pseudo_duplicates', 'n_organisms': metal_cons_no_pseudo['orgId'].nunique(),
'core_imp': core_imp_np, 'core_not': core_not_np,
'OR': or_np, 'p_value': p_np,
}])
sensitivity.to_csv(DATA_DIR / 'sensitivity_analysis.csv', index=False)
print(f'Saved: data/metal_conservation_stats.csv ({len(results_df)} metals)')
print(f'Saved: data/organism_conservation_stats.csv ({len(org_results_df)} organisms)')
print(f'Saved: data/sensitivity_analysis.csv')
print('\n' + '=' * 80)
print('NB03 SUMMARY: Metal Fitness x Pangenome Conservation')
print('=' * 80)
print(f'Mapped gene-metal records: {len(metal_cons_mapped):,}')
print(f'Overall: metal-important genes are {"LESS" if important_core < not_important_core else "MORE"} '
f'likely core (OR={odds_ratio:.3f}, p={p_value:.2e})')
print(f'Toxic metals mean delta: {results_df[results_df["category"]=="toxic"]["delta"].mean():+.3f}')
print(f'Essential metals mean delta: {results_df[results_df["category"]=="essential"]["delta"].mean():+.3f}')
print(f'Essential > Toxic: Mann-Whitney p={mw_p:.4f}')
print(f'{n_less_core}/{len(org_results_df)} organisms show metal genes as less core')
print(f'{n_sig}/{len(org_results_df)} significant at p<0.05')
print(f'Sensitivity (excl. Pseudomonas duplicates): OR={or_np:.3f}, p={p_np:.2e} — ROBUST')
print('=' * 80)
Formal comparison: Essential vs Toxic metal core enrichment Essential mean delta: +0.148 (n=5) Toxic mean delta: +0.081 (n=9) Mann-Whitney U: 39.0, p=0.0145 (one-sided: essential > toxic)
Phylogenetic sensitivity: excluding 4 duplicate P. fluorescens strains Organisms: 18 (was 22) Core fraction (metal-important): 0.875 (was 0.874) Core fraction (not important): 0.772 (was 0.769) OR=2.065, p=5.86e-141 Result: ROBUST — core enrichment persists after removing Pseudomonas duplicates Saved: data/metal_conservation_stats.csv (14 metals) Saved: data/organism_conservation_stats.csv (22 organisms) Saved: data/sensitivity_analysis.csv ================================================================================ NB03 SUMMARY: Metal Fitness x Pangenome Conservation ================================================================================ Mapped gene-metal records: 318,751 Overall: metal-important genes are MORE likely core (OR=2.083, p=4.32e-162) Toxic metals mean delta: +0.081 Essential metals mean delta: +0.148 Essential > Toxic: Mann-Whitney p=0.0145 1/22 organisms show metal genes as less core 14/22 significant at p<0.05 Sensitivity (excl. Pseudomonas duplicates): OR=2.065, p=5.86e-141 — ROBUST ================================================================================