02 Metal Fitness Extraction
Jupyter notebook from the Pan-Bacterial Metal Fitness Atlas project.
NB 02: Metal Fitness Extraction¶
Extract gene-level fitness scores for all metal experiments identified in NB01. For each organism × metal, compute per-gene fitness summaries: mean fitness, min fitness, number of sick/beneficial experiments.
Runs locally — reads cached fitness matrices from fitness_modules/data/matrices/.
Inputs: data/metal_experiments.csv (from NB01)
Outputs:
data/metal_fitness_scores.csv— per-gene per-metal fitness summariesdata/metal_important_genes.csv— genes with significant metal fitness defectsfigures/metal_important_genes_by_organism.png
In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
PROJECT_DIR = Path('..').resolve()
DATA_DIR = PROJECT_DIR / 'data'
FIGURES_DIR = PROJECT_DIR / 'figures'
# Fitness matrices from fitness_modules project
FM_MATRICES = PROJECT_DIR.parent / 'fitness_modules' / 'data' / 'matrices'
print(f'Fitness matrix dir: {FM_MATRICES}')
print(f'Matrix files: {len(list(FM_MATRICES.glob("*_fitness_matrix.csv")))}')
Fitness matrix dir: /home/psdehal/pangenome_science/BERIL-research-observatory/projects/fitness_modules/data/matrices Matrix files: 32
1. Load Metal Experiments from NB01¶
In [2]:
metal_exps = pd.read_csv(DATA_DIR / 'metal_experiments.csv')
print(f'Metal experiments loaded: {len(metal_exps)}')
print(f'Organisms: {metal_exps["orgId"].nunique()}')
print(f'Metals: {metal_exps["metal_element"].nunique()}')
# Focus on analysis metals (exclude Platinum/Cisplatin and Metal_limitation)
exclude_metals = ['Platinum', 'Metal_limitation']
metal_exps_analysis = metal_exps[~metal_exps['metal_element'].isin(exclude_metals)].copy()
print(f'\nAnalysis subset (excl. Pt, Metal_limitation): {len(metal_exps_analysis)}')
print(f'Metals: {sorted(metal_exps_analysis["metal_element"].unique())}')
Metal experiments loaded: 559 Organisms: 31 Metals: 16 Analysis subset (excl. Pt, Metal_limitation): 483 Metals: ['Aluminum', 'Cadmium', 'Chromium', 'Cobalt', 'Copper', 'Iron', 'Manganese', 'Mercury', 'Molybdenum', 'Nickel', 'Selenium', 'Tungsten', 'Uranium', 'Zinc']
2. Extract Fitness Scores for Metal Experiments¶
For each organism, load the fitness matrix and t-statistic matrix, then extract columns corresponding to metal experiments.
In [3]:
# Thresholds for "significant" fitness effects (standard FB criteria)
FIT_SICK_THRESHOLD = -1.0 # fitness < -1 = important for growth
FIT_BENEFIT_THRESHOLD = 1.0 # fitness > 1 = beneficial
T_THRESHOLD = 4.0 # |t| > 4 = statistically significant
all_scores = []
organisms_processed = []
organisms_skipped = []
for org_id in sorted(metal_exps_analysis['orgId'].unique()):
fit_path = FM_MATRICES / f'{org_id}_fitness_matrix.csv'
t_path = FM_MATRICES / f'{org_id}_t_matrix.csv'
if not fit_path.exists():
organisms_skipped.append(org_id)
continue
# Load matrices
fit_df = pd.read_csv(fit_path, index_col='locusId')
t_df = pd.read_csv(t_path, index_col='locusId')
# Get metal experiments for this organism
org_metal_exps = metal_exps_analysis[metal_exps_analysis['orgId'] == org_id]
# Process each metal separately
for metal in org_metal_exps['metal_element'].unique():
metal_exp_names = org_metal_exps[
org_metal_exps['metal_element'] == metal
]['expName'].tolist()
# Filter to experiments present in both matrices
available_fit = [e for e in metal_exp_names if e in fit_df.columns]
available_t = [e for e in metal_exp_names if e in t_df.columns]
shared_exps = [e for e in available_fit if e in available_t]
if not shared_exps:
continue
# Align on common genes and experiments
common_genes = fit_df.index.intersection(t_df.index)
fit_metal = fit_df.loc[common_genes, shared_exps]
t_metal = t_df.loc[common_genes, shared_exps]
n_genes = len(common_genes)
# Compute per-gene summaries
gene_summary = pd.DataFrame({
'orgId': [org_id] * n_genes,
'locusId': common_genes.tolist(),
'metal_element': [metal] * n_genes,
'n_experiments': [len(shared_exps)] * n_genes,
'mean_fit': fit_metal.mean(axis=1).values,
'min_fit': fit_metal.min(axis=1).values,
'max_fit': fit_metal.max(axis=1).values,
'std_fit': fit_metal.std(axis=1).values,
# Sick: fit < threshold AND |t| > t_threshold
'n_sick': ((fit_metal < FIT_SICK_THRESHOLD) &
(t_metal.abs() > T_THRESHOLD)).sum(axis=1).values,
# Beneficial: fit > threshold AND |t| > t_threshold
'n_beneficial': ((fit_metal > FIT_BENEFIT_THRESHOLD) &
(t_metal.abs() > T_THRESHOLD)).sum(axis=1).values,
'max_abs_t': t_metal.abs().max(axis=1).values,
})
all_scores.append(gene_summary)
organisms_processed.append(org_id)
metal_fitness = pd.concat(all_scores, ignore_index=True)
print(f'Organisms processed: {len(organisms_processed)}')
if organisms_skipped:
print(f'Organisms skipped (no matrix): {organisms_skipped}')
print(f'\nTotal gene × metal records: {len(metal_fitness):,}')
print(f'Unique genes: {metal_fitness["locusId"].nunique():,}')
print(f'Unique metals: {metal_fitness["metal_element"].nunique()}')
print(f'Unique organisms: {metal_fitness["orgId"].nunique()}')
Organisms processed: 31 Total gene × metal records: 383,349 Unique genes: 81,761 Unique metals: 14 Unique organisms: 24
3. Identify Metal-Important Genes¶
A gene is "metal-important" for a given metal if it has a significant fitness
defect in at least one metal experiment: n_sick >= 1 (fit < -1, |t| > 4)
or mean_fit < -1.
In [4]:
# Define metal-important: at least 1 sick experiment OR mean fitness < -1
metal_fitness['is_metal_important'] = (
(metal_fitness['n_sick'] >= 1) |
(metal_fitness['mean_fit'] < FIT_SICK_THRESHOLD)
)
# Stricter definition: at least 2 sick experiments (for metals with >= 3 experiments)
metal_fitness['is_metal_important_strict'] = (
(metal_fitness['n_sick'] >= 2) |
((metal_fitness['n_sick'] >= 1) & (metal_fitness['mean_fit'] < -1.5))
)
n_important = metal_fitness['is_metal_important'].sum()
n_important_strict = metal_fitness['is_metal_important_strict'].sum()
print(f'Metal-important genes (broad, n_sick>=1 or mean<-1): '
f'{n_important:,} / {len(metal_fitness):,} '
f'({100*n_important/len(metal_fitness):.1f}%)')
print(f'Metal-important genes (strict, n_sick>=2 or n_sick>=1 & mean<-1.5): '
f'{n_important_strict:,} / {len(metal_fitness):,} '
f'({100*n_important_strict/len(metal_fitness):.1f}%)')
Metal-important genes (broad, n_sick>=1 or mean<-1): 12,838 / 383,349 (3.3%) Metal-important genes (strict, n_sick>=2 or n_sick>=1 & mean<-1.5): 5,667 / 383,349 (1.5%)
In [5]:
# Summary: metal-important genes by metal and organism
print('\nMetal-important genes per metal (broad definition):')
print('=' * 70)
for metal in sorted(metal_fitness['metal_element'].unique()):
subset = metal_fitness[metal_fitness['metal_element'] == metal]
n_imp = subset['is_metal_important'].sum()
n_total = len(subset)
n_orgs = subset['orgId'].nunique()
n_orgs_imp = subset[subset['is_metal_important']]['orgId'].nunique()
print(f' {metal:12s} {n_imp:5d} / {n_total:6d} genes important '
f'({100*n_imp/n_total:5.1f}%) in {n_orgs_imp}/{n_orgs} organisms')
Metal-important genes per metal (broad definition): ====================================================================== Aluminum 1772 / 57186 genes important ( 3.1%) in 15/15 organisms Cadmium 93 / 3349 genes important ( 2.8%) in 1/1 organisms Chromium 268 / 6090 genes important ( 4.4%) in 2/2 organisms Cobalt 2324 / 86478 genes important ( 2.7%) in 21/21 organisms Copper 2594 / 72281 genes important ( 3.6%) in 18/18 organisms Iron 659 / 5368 genes important ( 12.3%) in 3/3 organisms Manganese 33 / 2741 genes important ( 1.2%) in 1/1 organisms
Mercury 107 / 2741 genes important ( 3.9%) in 1/1 organisms Molybdenum 306 / 2741 genes important ( 11.2%) in 1/1 organisms
Nickel 2271 / 82142 genes important ( 2.8%) in 20/20 organisms
Selenium 138 / 2741 genes important ( 5.0%) in 1/1 organisms Tungsten 306 / 2741 genes important ( 11.2%) in 1/1 organisms Uranium 181 / 6090 genes important ( 3.0%) in 2/2 organisms Zinc 1786 / 50660 genes important ( 3.5%) in 13/13 organisms
In [6]:
# Metal-important genes per organism
print('\nMetal-important genes per organism (broad definition):')
print('=' * 70)
org_imp = metal_fitness.groupby('orgId').agg(
n_important=('is_metal_important', 'sum'),
n_total=('locusId', 'nunique'),
n_metals=('metal_element', 'nunique'),
).sort_values('n_important', ascending=False)
for _, row in org_imp.iterrows():
pct = 100 * row.n_important / row.n_total if row.n_total > 0 else 0
print(f' {row.name:25s} {int(row.n_important):4d} important / '
f'{int(row.n_total):4d} genes ({pct:5.1f}%) '
f'{int(row.n_metals)} metals')
Metal-important genes per organism (broad definition): ====================================================================== DvH 1366 important / 2741 genes ( 49.8%) 13 metals psRCH2 1055 important / 3349 genes ( 31.5%) 8 metals Pedo557 1039 important / 4423 genes ( 23.5%) 5 metals Caulo 869 important / 3312 genes ( 26.2%) 4 metals Cup4G11 825 important / 6384 genes ( 12.9%) 4 metals acidovorax_3H11 671 important / 3935 genes ( 17.1%) 5 metals SynE 638 important / 1899 genes ( 33.6%) 2 metals Ponti 617 important / 3685 genes ( 16.7%) 3 metals Btheta 602 important / 4055 genes ( 14.8%) 4 metals pseudo5_N2C3_1 584 important / 5193 genes ( 11.2%) 4 metals pseudo3_N2E3 459 important / 5028 genes ( 9.1%) 3 metals Koxy 441 important / 4608 genes ( 9.6%) 3 metals Marino 441 important / 3650 genes ( 12.1%) 5 metals Methanococcus_JJ 395 important / 1383 genes ( 28.6%) 1 metals Cola 381 important / 3954 genes ( 9.6%) 5 metals pseudo6_N2E2 372 important / 5133 genes ( 7.2%) 4 metals Phaeo 321 important / 3099 genes ( 10.4%) 5 metals Kang 319 important / 2003 genes ( 15.9%) 3 metals Korea 319 important / 3393 genes ( 9.4%) 5 metals pseudo13_GW456_L13 298 important / 4350 genes ( 6.9%) 4 metals Methanococcus_S2 255 important / 1244 genes ( 20.5%) 1 metals BFirm 246 important / 5428 genes ( 4.5%) 3 metals pseudo1_N1B4 213 important / 4336 genes ( 4.9%) 3 metals WCS417 112 important / 4419 genes ( 2.5%) 3 metals
4. Visualization¶
In [7]:
# Heatmap: fraction of genes that are metal-important, per organism × metal
imp_by_org_metal = metal_fitness.groupby(['orgId', 'metal_element']).agg(
frac_important=('is_metal_important', 'mean'),
n_important=('is_metal_important', 'sum'),
n_genes=('locusId', 'count'),
).reset_index()
# Pivot for heatmap
heatmap_data = imp_by_org_metal.pivot(
index='orgId', columns='metal_element', values='frac_important'
) * 100 # Convert to percentage
# Sort by mean importance across metals
heatmap_data = heatmap_data.loc[
heatmap_data.mean(axis=1).sort_values(ascending=False).index
]
# Sort columns by number of organisms
col_order = heatmap_data.notna().sum().sort_values(ascending=False).index
heatmap_data = heatmap_data[col_order]
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(
heatmap_data,
cmap='YlOrRd',
annot=True,
fmt='.1f',
linewidths=0.5,
ax=ax,
mask=heatmap_data.isna(),
cbar_kws={'label': '% genes with fitness defect'},
vmin=0,
vmax=20,
)
ax.set_title('% Metal-Important Genes per Organism × Metal', fontsize=14)
ax.set_xlabel('Metal')
ax.set_ylabel('Organism')
plt.tight_layout()
fig.savefig(FIGURES_DIR / 'metal_important_genes_by_organism.png', dpi=150, bbox_inches='tight')
plt.show()
print(f'Saved: figures/metal_important_genes_by_organism.png')
Saved: figures/metal_important_genes_by_organism.png
In [8]:
# Distribution of mean fitness scores by metal (across all organisms)
# Focus on the 6 cross-species metals
cross_species = ['Cobalt', 'Nickel', 'Copper', 'Aluminum', 'Zinc', 'Iron']
cs_data = metal_fitness[metal_fitness['metal_element'].isin(cross_species)]
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
for ax, metal in zip(axes.flat, cross_species):
subset = cs_data[cs_data['metal_element'] == metal]['mean_fit']
ax.hist(subset, bins=50, color='steelblue', alpha=0.7, edgecolor='black', linewidth=0.3)
ax.axvline(-1, color='red', linestyle='--', alpha=0.7, label='sick threshold')
ax.set_title(f'{metal} (n={len(subset):,})', fontsize=12)
ax.set_xlabel('Mean fitness score')
n_sick = (subset < -1).sum()
ax.text(0.95, 0.95, f'{n_sick} sick\n({100*n_sick/len(subset):.1f}%)',
transform=ax.transAxes, ha='right', va='top', fontsize=9,
bbox=dict(boxstyle='round', facecolor='lightyellow'))
plt.suptitle('Mean Fitness Score Distributions Under Metal Stress', fontsize=14, y=1.02)
plt.tight_layout()
fig.savefig(FIGURES_DIR / 'metal_fitness_distributions.png', dpi=150, bbox_inches='tight')
plt.show()
print(f'Saved: figures/metal_fitness_distributions.png')
Saved: figures/metal_fitness_distributions.png
5. Save Outputs¶
In [9]:
# Save full metal fitness scores
metal_fitness.to_csv(DATA_DIR / 'metal_fitness_scores.csv', index=False)
print(f'Saved: data/metal_fitness_scores.csv')
print(f' Rows: {len(metal_fitness):,}')
print(f' Columns: {list(metal_fitness.columns)}')
# Save metal-important genes only
important = metal_fitness[metal_fitness['is_metal_important']].copy()
important.to_csv(DATA_DIR / 'metal_important_genes.csv', index=False)
print(f'\nSaved: data/metal_important_genes.csv')
print(f' Rows: {len(important):,}')
print(f' Unique genes: {important["locusId"].nunique():,}')
print(f' Organisms: {important["orgId"].nunique()}')
Saved: data/metal_fitness_scores.csv Rows: 383,349 Columns: ['orgId', 'locusId', 'metal_element', 'n_experiments', 'mean_fit', 'min_fit', 'max_fit', 'std_fit', 'n_sick', 'n_beneficial', 'max_abs_t', 'is_metal_important', 'is_metal_important_strict'] Saved: data/metal_important_genes.csv Rows: 12,838 Unique genes: 7,544 Organisms: 24
In [10]:
print('=' * 80)
print('NB02 SUMMARY: Metal Fitness Extraction')
print('=' * 80)
print(f'Organisms processed: {len(organisms_processed)}')
print(f'Gene × metal records: {len(metal_fitness):,}')
print(f'Metal-important genes (broad): {n_important:,} ({100*n_important/len(metal_fitness):.1f}%)')
print(f'Metal-important genes (strict): {n_important_strict:,} ({100*n_important_strict/len(metal_fitness):.1f}%)')
print(f'\nOutputs:')
print(f' data/metal_fitness_scores.csv — {len(metal_fitness):,} gene × metal fitness summaries')
print(f' data/metal_important_genes.csv — {len(important):,} metal-important gene records')
print(f' figures/metal_important_genes_by_organism.png')
print(f' figures/metal_fitness_distributions.png')
print('=' * 80)
================================================================================ NB02 SUMMARY: Metal Fitness Extraction ================================================================================ Organisms processed: 31 Gene × metal records: 383,349 Metal-important genes (broad): 12,838 (3.3%) Metal-important genes (strict): 5,667 (1.5%) Outputs: data/metal_fitness_scores.csv — 383,349 gene × metal fitness summaries data/metal_important_genes.csv — 12,838 metal-important gene records figures/metal_important_genes_by_organism.png figures/metal_fitness_distributions.png ================================================================================