01 Metal Experiment Classification
Jupyter notebook from the Pan-Bacterial Metal Fitness Atlas project.
NB 01: Metal Experiment Classification¶
Identify and classify all metal-related experiments across the Fitness Browser's 48 organisms. Build a master table mapping each metal experiment to its metal element, organism, concentration, and experiment metadata.
Runs locally — reads cached experiment files from fitness_modules/data/annotations/.
Outputs: data/metal_experiments.csv — master table of all metal experiments.
import pandas as pd
import numpy as np
from pathlib import Path
import re
# Paths
PROJECT_DIR = Path('..').resolve()
DATA_DIR = PROJECT_DIR / 'data'
DATA_DIR.mkdir(exist_ok=True)
FIGURES_DIR = PROJECT_DIR / 'figures'
FIGURES_DIR.mkdir(exist_ok=True)
# Cached experiment files from fitness_modules project
FM_ANNOTATIONS = PROJECT_DIR.parent / 'fitness_modules' / 'data' / 'annotations'
print(f'Project dir: {PROJECT_DIR}')
print(f'Experiment annotations dir: {FM_ANNOTATIONS}')
print(f'Annotation files exist: {FM_ANNOTATIONS.exists()}')
Project dir: /home/psdehal/pangenome_science/BERIL-research-observatory/projects/metal_fitness_atlas Experiment annotations dir: /home/psdehal/pangenome_science/BERIL-research-observatory/projects/fitness_modules/data/annotations Annotation files exist: True
1. Define Metal Compound → Element Mapping¶
Map compound names in the FB condition_1 field to standardized metal element names.
Also flag the metal category (toxic vs essential) and whether it's on the USGS
critical minerals list.
# Compound name → (metal_element, metal_category, usgs_critical)
# metal_category: 'toxic' = external stress, 'essential' = nutrient/cofactor
METAL_COMPOUND_MAP = {
# Toxic/stress metals — broad panel
'Nickel (II) chloride hexahydrate': ('Nickel', 'toxic', True),
'Cobalt chloride hexahydrate': ('Cobalt', 'toxic', True),
'copper (II) chloride dihydrate': ('Copper', 'toxic', False),
'Copper (II) sulfate pentahydrate': ('Copper', 'toxic', False),
'Zinc sulfate heptahydrate': ('Zinc', 'toxic', False),
'Zinc Pyrithione': ('Zinc', 'toxic', False), # antimicrobial, not pure metal
'Aluminum chloride hydrate': ('Aluminum', 'toxic', True),
'Uranyl acetate': ('Uranium', 'toxic', True),
'Sodium Chromate': ('Chromium', 'toxic', True),
'Potassium dichromate': ('Chromium', 'toxic', True),
'mercury (II) chloride': ('Mercury', 'toxic', False),
'Cadmium chloride hemipentahydrate': ('Cadmium', 'toxic', False),
'Cisplatin': ('Platinum', 'toxic', False), # DNA-damaging agent
# Essential/nutrient metals
'Iron (II) chloride tetrahydrate': ('Iron', 'essential', False),
'Sodium molybdate': ('Molybdenum', 'essential', False),
'Sodium tungstate dihydrate': ('Tungsten', 'essential', True),
'Sodium selenate': ('Selenium', 'essential', False),
'Manganese (II) chloride tetrahydrate': ('Manganese', 'essential', True),
}
# Also match by keyword in condition_1 or expDesc for edge cases
METAL_KEYWORDS = {
'nickel': 'Nickel',
'cobalt': 'Cobalt',
'copper': 'Copper',
'zinc': 'Zinc',
'aluminum': 'Aluminum',
'uranyl': 'Uranium',
'uranium': 'Uranium',
'chromat': 'Chromium',
'dichromat': 'Chromium',
'mercury': 'Mercury',
'cadmium': 'Cadmium',
'molybdat': 'Molybdenum',
'tungstat': 'Tungsten',
'selenat': 'Selenium',
'selenite': 'Selenium',
'manganese': 'Manganese',
'iron': 'Iron',
'cisplatin': 'Platinum',
}
print(f'Defined {len(METAL_COMPOUND_MAP)} compound mappings')
print(f'Defined {len(METAL_KEYWORDS)} keyword patterns')
Defined 18 compound mappings Defined 18 keyword patterns
2. Load All Cached Experiment Files¶
# Load experiment metadata for all 32 organisms with cached data
exp_files = sorted(FM_ANNOTATIONS.glob('*_experiments.csv'))
print(f'Found {len(exp_files)} experiment files')
all_experiments = []
for f in exp_files:
org_id = f.stem.replace('_experiments', '')
df = pd.read_csv(f)
df['orgId'] = org_id
all_experiments.append(df)
experiments = pd.concat(all_experiments, ignore_index=True)
print(f'\nTotal experiments loaded: {len(experiments):,}')
print(f'Organisms: {experiments["orgId"].nunique()}')
print(f'\nColumns: {list(experiments.columns)}')
print(f'\nExperiments per organism:')
print(experiments.groupby('orgId').size().sort_values(ascending=False).to_string())
Found 32 experiment files Total experiments loaded: 6,804 Organisms: 32 Columns: ['expName', 'expDesc', 'expGroup', 'condition_1', 'media', 'cor12', 'mad12', 'nMapped', 'orgId'] Experiments per organism: orgId DvH 757 Btheta 519 Methanococcus_S2 371 psRCH2 350 Putida 300 Phaeo 274 Marino 255 pseudo3_N2E3 211 Koxy 208 Cola 202 WCS417 201 Caulo 198 SB2B 190 pseudo6_N2E2 188 Dino 186 pseudo5_N2C3_1 184 Miya 178 Pedo557 177 MR1 176 Keio 168 Korea 162 PV4 160 pseudo1_N1B4 147 acidovorax_3H11 140 Methanococcus_JJ 129 SynE 129 BFirm 113 Kang 108 ANA3 107 Cup4G11 106 pseudo13_GW456_L13 106 Ponti 104
3. Classify Metal Experiments¶
Match experiments to metals using compound name mapping and keyword fallback.
def classify_metal(row):
"""Classify an experiment row as metal-related or not.
Returns (metal_element, metal_category, usgs_critical, match_method)
or (None, None, None, None) if not metal-related.
"""
condition = str(row.get('condition_1', ''))
desc = str(row.get('expDesc', ''))
group = str(row.get('expGroup', ''))
# Method 1: Exact compound match
if condition in METAL_COMPOUND_MAP:
metal, cat, crit = METAL_COMPOUND_MAP[condition]
return metal, cat, crit, 'compound_match'
# Method 2: Keyword match in condition_1
for keyword, metal in METAL_KEYWORDS.items():
if keyword.lower() in condition.lower():
cat = 'essential' if metal in ('Iron', 'Molybdenum', 'Tungsten', 'Selenium', 'Manganese') else 'toxic'
crit = metal in ('Nickel', 'Cobalt', 'Aluminum', 'Uranium', 'Chromium', 'Tungsten', 'Manganese')
return metal, cat, crit, 'keyword_condition'
# Method 3: Keyword match in expDesc
for keyword, metal in METAL_KEYWORDS.items():
if keyword.lower() in desc.lower():
cat = 'essential' if metal in ('Iron', 'Molybdenum', 'Tungsten', 'Selenium', 'Manganese') else 'toxic'
crit = metal in ('Nickel', 'Cobalt', 'Aluminum', 'Uranium', 'Chromium', 'Tungsten', 'Manganese')
return metal, cat, crit, 'keyword_desc'
# Method 4: expGroup = 'metal limitation' or 'iron'
if group == 'metal limitation':
# psRCH2 metal limitation experiments — check desc for specific metal
for keyword, metal in METAL_KEYWORDS.items():
if keyword.lower() in desc.lower():
cat = 'essential' if metal in ('Iron', 'Molybdenum', 'Tungsten', 'Selenium', 'Manganese') else 'toxic'
crit = metal in ('Nickel', 'Cobalt', 'Aluminum', 'Uranium', 'Chromium', 'Tungsten', 'Manganese')
return metal, cat, crit, 'metal_limitation_group'
# If no specific metal found, classify as general metal limitation
return 'Metal_limitation', 'essential', False, 'metal_limitation_group'
if group == 'iron':
return 'Iron', 'essential', False, 'iron_group'
return None, None, None, None
# Apply classification
results = experiments.apply(classify_metal, axis=1, result_type='expand')
results.columns = ['metal_element', 'metal_category', 'usgs_critical', 'match_method']
experiments_annotated = pd.concat([experiments, results], axis=1)
# Filter to metal experiments only
metal_exps = experiments_annotated[experiments_annotated['metal_element'].notna()].copy()
print(f'Total experiments: {len(experiments):,}')
print(f'Metal experiments: {len(metal_exps):,} ({100*len(metal_exps)/len(experiments):.1f}%)')
print(f'\nMetal experiments by match method:')
print(metal_exps['match_method'].value_counts().to_string())
Total experiments: 6,804
Metal experiments: 559 (8.2%) Metal experiments by match method: match_method compound_match 463 keyword_desc 87 metal_limitation_group 9
4. Extract Concentrations¶
Parse concentration from expDesc where possible.
def extract_concentration(desc):
"""Extract concentration from experiment description.
Returns (value, unit) or (None, None).
"""
desc = str(desc)
# Pattern: number followed by mM or uM
match = re.search(r'(\d+\.?\d*)\s*(mM|uM|µM|mg/L|ppm)', desc, re.IGNORECASE)
if match:
return float(match.group(1)), match.group(2)
# Pattern: "X.Xx" concentration at end of desc (common in FB)
match = re.search(r'(\d+\.?\d*)\s*(?:mM|uM)', desc)
if match:
return float(match.group(1)), 'mM'
# Pattern: "limitation (0.2x)" for iron
match = re.search(r'(\d+\.?\d*)x', desc)
if match and ('limitation' in desc.lower() or 'excess' in desc.lower()):
return float(match.group(1)), 'x_relative'
return None, None
metal_exps[['concentration', 'conc_unit']] = metal_exps['expDesc'].apply(
lambda x: pd.Series(extract_concentration(x))
)
n_with_conc = metal_exps['concentration'].notna().sum()
print(f'Experiments with parsed concentration: {n_with_conc}/{len(metal_exps)} '
f'({100*n_with_conc/len(metal_exps):.0f}%)')
print(f'\nConcentration units found:')
print(metal_exps['conc_unit'].value_counts(dropna=False).to_string())
Experiments with parsed concentration: 411/559 (74%) Concentration units found: conc_unit mM 354 NaN 148 uM 39 x_relative 18
5. Summary Statistics¶
# Summary by metal element
metal_summary = metal_exps.groupby('metal_element').agg(
n_experiments=('expName', 'count'),
n_organisms=('orgId', 'nunique'),
organisms=('orgId', lambda x: ', '.join(sorted(x.unique()))),
category=('metal_category', 'first'),
usgs_critical=('usgs_critical', 'first'),
).sort_values('n_experiments', ascending=False)
print('=' * 80)
print('METAL EXPERIMENT SUMMARY')
print('=' * 80)
print(f'\nTotal metal experiments: {len(metal_exps)}')
print(f'Metals: {metal_exps["metal_element"].nunique()}')
print(f'Organisms with metal data: {metal_exps["orgId"].nunique()}')
print()
# Display summary table (without organism list for readability)
display_df = metal_summary[['n_experiments', 'n_organisms', 'category', 'usgs_critical']].copy()
display_df.columns = ['Experiments', 'Organisms', 'Category', 'USGS Critical']
print(display_df.to_string())
================================================================================
METAL EXPERIMENT SUMMARY
================================================================================
Total metal experiments: 559
Metals: 16
Organisms with metal data: 31
Experiments Organisms Category USGS Critical
metal_element
Cobalt 89 27 toxic True
Nickel 79 26 toxic True
Platinum 67 24 toxic False
Copper 60 23 toxic False
Aluminum 54 22 toxic True
Zinc 52 17 toxic False
Iron 45 3 essential False
Molybdenum 39 1 essential False
Tungsten 22 1 essential True
Chromium 11 2 toxic True
Selenium 9 1 essential False
Metal_limitation 9 1 essential False
Uranium 9 2 toxic True
Manganese 6 1 essential True
Mercury 5 1 toxic False
Cadmium 3 1 toxic False
# Summary by organism
org_summary = metal_exps.groupby('orgId').agg(
n_metal_experiments=('expName', 'count'),
n_metals=('metal_element', 'nunique'),
metals=('metal_element', lambda x: ', '.join(sorted(x.unique()))),
).sort_values('n_metals', ascending=False)
print('\nMetal experiments by organism (sorted by number of metals tested):')
print('=' * 80)
for _, row in org_summary.iterrows():
print(f'{row.name:25s} {row.n_metal_experiments:3d} exps '
f'{row.n_metals:2d} metals [{row.metals}]')
Metal experiments by organism (sorted by number of metals tested): ================================================================================ DvH 149 exps 13 metals [Aluminum, Chromium, Cobalt, Copper, Iron, Manganese, Mercury, Molybdenum, Nickel, Selenium, Tungsten, Uranium, Zinc] psRCH2 61 exps 10 metals [Aluminum, Cadmium, Chromium, Cobalt, Copper, Metal_limitation, Nickel, Platinum, Uranium, Zinc] Dino 19 exps 6 metals [Aluminum, Cobalt, Copper, Nickel, Platinum, Zinc] MR1 12 exps 6 metals [Aluminum, Cobalt, Copper, Nickel, Platinum, Zinc] Marino 24 exps 6 metals [Aluminum, Cobalt, Copper, Nickel, Platinum, Zinc] Korea 8 exps 6 metals [Aluminum, Cobalt, Copper, Nickel, Platinum, Zinc] Cola 20 exps 6 metals [Aluminum, Cobalt, Copper, Nickel, Platinum, Zinc] PV4 15 exps 6 metals [Aluminum, Cobalt, Copper, Nickel, Platinum, Zinc] SB2B 12 exps 6 metals [Aluminum, Cobalt, Copper, Nickel, Platinum, Zinc] Phaeo 19 exps 6 metals [Aluminum, Cobalt, Copper, Nickel, Platinum, Zinc] acidovorax_3H11 16 exps 6 metals [Aluminum, Cobalt, Copper, Nickel, Platinum, Zinc] Pedo557 18 exps 6 metals [Aluminum, Cobalt, Copper, Nickel, Platinum, Zinc] Cup4G11 11 exps 5 metals [Cobalt, Copper, Nickel, Platinum, Zinc] Keio 7 exps 5 metals [Aluminum, Cobalt, Copper, Nickel, Platinum] Caulo 16 exps 5 metals [Aluminum, Cobalt, Copper, Nickel, Platinum] pseudo5_N2C3_1 8 exps 5 metals [Aluminum, Cobalt, Copper, Nickel, Platinum] pseudo13_GW456_L13 6 exps 5 metals [Aluminum, Cobalt, Nickel, Platinum, Zinc] pseudo6_N2E2 7 exps 5 metals [Aluminum, Cobalt, Copper, Nickel, Platinum] WCS417 4 exps 4 metals [Aluminum, Cobalt, Nickel, Platinum] Btheta 26 exps 4 metals [Cobalt, Copper, Nickel, Zinc] ANA3 4 exps 4 metals [Aluminum, Cobalt, Nickel, Platinum] pseudo3_N2E3 10 exps 4 metals [Cobalt, Copper, Nickel, Platinum] Kang 8 exps 4 metals [Cobalt, Copper, Nickel, Platinum] pseudo1_N1B4 6 exps 4 metals [Aluminum, Cobalt, Copper, Platinum] Ponti 14 exps 4 metals [Cobalt, Copper, Nickel, Platinum] BFirm 4 exps 3 metals [Cobalt, Nickel, Zinc] Koxy 8 exps 3 metals [Cobalt, Copper, Nickel] SynE 9 exps 3 metals [Aluminum, Platinum, Zinc] Methanococcus_JJ 18 exps 1 metals [Iron] Miya 2 exps 1 metals [Aluminum] Methanococcus_S2 18 exps 1 metals [Iron]
# Focus: metals tested in >= 3 organisms (usable for cross-species comparison)
cross_species_metals = metal_summary[metal_summary['n_organisms'] >= 3].index.tolist()
print('\nMetals with >= 3 organisms (usable for cross-species analysis):')
print('=' * 80)
for metal in cross_species_metals:
row = metal_summary.loc[metal]
print(f' {metal:12s} {row.n_experiments:3d} experiments '
f'{row.n_organisms:2d} organisms '
f'category={row.category} critical={row.usgs_critical}')
print(f'\n{len(cross_species_metals)} metals suitable for cross-species analysis')
# Exclude Platinum/Cisplatin (DNA damage agent, not metal stress)
analysis_metals = [m for m in cross_species_metals if m != 'Platinum']
print(f'{len(analysis_metals)} metals after excluding Platinum/Cisplatin: {analysis_metals}')
Metals with >= 3 organisms (usable for cross-species analysis): ================================================================================ Cobalt 89 experiments 27 organisms category=toxic critical=True Nickel 79 experiments 26 organisms category=toxic critical=True Platinum 67 experiments 24 organisms category=toxic critical=False Copper 60 experiments 23 organisms category=toxic critical=False Aluminum 54 experiments 22 organisms category=toxic critical=True Zinc 52 experiments 17 organisms category=toxic critical=False Iron 45 experiments 3 organisms category=essential critical=False 7 metals suitable for cross-species analysis 6 metals after excluding Platinum/Cisplatin: ['Cobalt', 'Nickel', 'Copper', 'Aluminum', 'Zinc', 'Iron']
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
# Build organism × metal experiment count matrix
# Focus on the analysis metals (exclude Platinum)
metal_exps_analysis = metal_exps[metal_exps['metal_element'].isin(analysis_metals)]
org_metal_matrix = metal_exps_analysis.groupby(
['orgId', 'metal_element']
).size().unstack(fill_value=0)
# Sort: metals by total experiments (descending), organisms by total metals tested
metal_order = org_metal_matrix.sum().sort_values(ascending=False).index.tolist()
org_order = (org_metal_matrix > 0).sum(axis=1).sort_values(ascending=False).index.tolist()
org_metal_matrix = org_metal_matrix.loc[org_order, metal_order]
# Plot heatmap
fig, ax = plt.subplots(figsize=(14, 10))
sns.heatmap(
org_metal_matrix,
cmap='YlOrRd',
annot=True,
fmt='d',
linewidths=0.5,
ax=ax,
cbar_kws={'label': 'Number of experiments'}
)
ax.set_title('Fitness Browser: Metal Experiments per Organism', fontsize=14)
ax.set_xlabel('Metal Element')
ax.set_ylabel('Organism (orgId)')
plt.tight_layout()
fig.savefig(FIGURES_DIR / 'organism_metal_matrix.png', dpi=150, bbox_inches='tight')
plt.show()
print(f'Saved: figures/organism_metal_matrix.png')
Saved: figures/organism_metal_matrix.png
# Summary by metal category
print('\nMetal experiments by category:')
print('=' * 60)
for cat in ['toxic', 'essential']:
subset = metal_exps_analysis[metal_exps_analysis['metal_category'] == cat]
metals = sorted(subset['metal_element'].unique())
print(f'\n {cat.upper()} metals ({len(metals)}): {metals}')
print(f' Experiments: {len(subset)}')
print(f' Organisms: {subset["orgId"].nunique()}')
# USGS critical minerals
critical = metal_exps_analysis[metal_exps_analysis['usgs_critical'] == True]
critical_metals = sorted(critical['metal_element'].unique())
print(f'\n USGS CRITICAL MINERALS in FB: {critical_metals}')
print(f' Experiments: {len(critical)}')
print(f' Organisms: {critical["orgId"].nunique()}')
Metal experiments by category:
============================================================
TOXIC metals (5): ['Aluminum', 'Cobalt', 'Copper', 'Nickel', 'Zinc']
Experiments: 334
Organisms: 29
ESSENTIAL metals (1): ['Iron']
Experiments: 45
Organisms: 3
USGS CRITICAL MINERALS in FB: ['Aluminum', 'Cobalt', 'Nickel']
Experiments: 222
Organisms: 29
6. Save Master Table¶
# Save the full metal experiments table
output_cols = [
'orgId', 'expName', 'expDesc', 'expGroup', 'condition_1', 'media',
'metal_element', 'metal_category', 'usgs_critical', 'match_method',
'concentration', 'conc_unit',
'cor12', 'mad12', 'nMapped'
]
metal_exps_out = metal_exps[output_cols].sort_values(['metal_element', 'orgId', 'expName'])
metal_exps_out.to_csv(DATA_DIR / 'metal_experiments.csv', index=False)
print(f'Saved: data/metal_experiments.csv')
print(f' Rows: {len(metal_exps_out):,}')
print(f' Metals: {metal_exps_out["metal_element"].nunique()}')
print(f' Organisms: {metal_exps_out["orgId"].nunique()}')
# Also save the analysis-ready subset (excluding Platinum)
analysis_out = metal_exps_out[metal_exps_out['metal_element'].isin(analysis_metals)]
analysis_out.to_csv(DATA_DIR / 'metal_experiments_analysis.csv', index=False)
print(f'\nSaved: data/metal_experiments_analysis.csv (excluding Platinum/Cisplatin)')
print(f' Rows: {len(analysis_out):,}')
print(f' Metals: {analysis_out["metal_element"].nunique()}')
print(f' Organisms: {analysis_out["orgId"].nunique()}')
Saved: data/metal_experiments.csv Rows: 559 Metals: 16 Organisms: 31 Saved: data/metal_experiments_analysis.csv (excluding Platinum/Cisplatin) Rows: 379 Metals: 6 Organisms: 31
print('=' * 80)
print('NB01 SUMMARY: Metal Experiment Classification')
print('=' * 80)
print(f'Total FB experiments scanned: {len(experiments):,}')
print(f'Metal experiments identified: {len(metal_exps):,} ({100*len(metal_exps)/len(experiments):.1f}%)')
print(f'Unique metals: {metal_exps["metal_element"].nunique()}')
print(f'Organisms with metal data: {metal_exps["orgId"].nunique()}')
print(f'\nCross-species analysis metals (>= 3 orgs, excl. Pt): {len(analysis_metals)}')
print(f' {analysis_metals}')
print(f'\nUSGS critical minerals covered: {critical_metals}')
print(f'\nOutputs:')
print(f' data/metal_experiments.csv — all metal experiments')
print(f' data/metal_experiments_analysis.csv — analysis subset')
print(f' figures/organism_metal_matrix.png — organism x metal heatmap')
print('=' * 80)
================================================================================ NB01 SUMMARY: Metal Experiment Classification ================================================================================ Total FB experiments scanned: 6,804 Metal experiments identified: 559 (8.2%) Unique metals: 16 Organisms with metal data: 31 Cross-species analysis metals (>= 3 orgs, excl. Pt): 6 ['Cobalt', 'Nickel', 'Copper', 'Aluminum', 'Zinc', 'Iron'] USGS critical minerals covered: ['Aluminum', 'Cobalt', 'Nickel'] Outputs: data/metal_experiments.csv — all metal experiments data/metal_experiments_analysis.csv — analysis subset figures/organism_metal_matrix.png — organism x metal heatmap ================================================================================