04 Cross Species Metal Families
Jupyter notebook from the Pan-Bacterial Metal Fitness Atlas project.
NB 04: Cross-Species Metal Fitness Families¶
Map metal-important genes to ortholog groups and identify gene families with conserved metal fitness phenotypes across multiple species.
Runs locally — uses ortholog groups from essential_genome/data/.
Inputs:
data/metal_important_genes.csv(from NB02)essential_genome/data/ortholog_groups.csvessential_genome/data/family_conservation.tsvessential_genome/data/essential_families.tsv
Outputs:
data/conserved_metal_families.csvdata/novel_metal_candidates.csvfigures/metal_family_conservation_heatmap.png
In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
from scipy import stats
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
PROJECT_DIR = Path('..').resolve()
DATA_DIR = PROJECT_DIR / 'data'
FIGURES_DIR = PROJECT_DIR / 'figures'
EG_DATA = PROJECT_DIR.parent / 'essential_genome' / 'data'
FM_DATA = PROJECT_DIR.parent / 'fitness_modules' / 'data'
# Load data
metal_important = pd.read_csv(DATA_DIR / 'metal_important_genes.csv')
metal_important['locusId'] = metal_important['locusId'].astype(str)
print(f'Metal-important gene records: {len(metal_important):,}')
# Ortholog groups — try essential_genome first, then fitness_modules
og_path = EG_DATA / 'all_ortholog_groups.csv'
if not og_path.exists():
og_path = FM_DATA / 'orthologs' / 'ortholog_groups.csv'
ortholog_groups = pd.read_csv(og_path)
ortholog_groups['locusId'] = ortholog_groups['locusId'].astype(str)
print(f'Ortholog group assignments: {len(ortholog_groups):,}')
print(f'Unique ortholog groups: {ortholog_groups["OG_id"].nunique():,}')
family_cons = pd.read_csv(EG_DATA / 'family_conservation.tsv', sep='\t')
print(f'Family conservation records: {len(family_cons):,}')
essential_fam = pd.read_csv(EG_DATA / 'essential_families.tsv', sep='\t')
print(f'Essential family records: {len(essential_fam):,}')
Metal-important gene records: 12,838 Ortholog group assignments: 179,237 Unique ortholog groups: 17,222 Family conservation records: 16,758 Essential family records: 17,222
1. Map Metal-Important Genes to Ortholog Groups¶
In [2]:
# Join metal-important genes with ortholog groups
metal_og = metal_important.merge(
ortholog_groups[['orgId', 'locusId', 'OG_id']],
on=['orgId', 'locusId'],
how='left'
)
n_mapped = metal_og['OG_id'].notna().sum()
print(f'Metal-important genes mapped to OGs: {n_mapped:,} / {len(metal_og):,} '
f'({100*n_mapped/len(metal_og):.1f}%)')
metal_og_mapped = metal_og[metal_og['OG_id'].notna()].copy()
print(f'Unique OGs with metal phenotype: {metal_og_mapped["OG_id"].nunique():,}')
Metal-important genes mapped to OGs: 10,876 / 12,838 (84.7%) Unique OGs with metal phenotype: 2,891
2. Identify Conserved Metal Families¶
A "conserved metal family" has metal fitness defects in ≥2 organisms for the same metal.
In [3]:
# For each OG × metal, count organisms with metal phenotype
og_metal_counts = metal_og_mapped.groupby(['OG_id', 'metal_element']).agg(
n_organisms=('orgId', 'nunique'),
organisms=('orgId', lambda x: ','.join(sorted(x.unique()))),
mean_fitness=('mean_fit', 'mean'),
min_fitness=('min_fit', 'min'),
).reset_index()
# Conserved: metal phenotype in >= 2 organisms
conserved_2 = og_metal_counts[og_metal_counts['n_organisms'] >= 2]
conserved_3 = og_metal_counts[og_metal_counts['n_organisms'] >= 3]
print(f'OG × metal combinations: {len(og_metal_counts):,}')
print(f'Conserved (≥2 organisms): {len(conserved_2):,} families × metals '
f'({conserved_2["OG_id"].nunique()} unique OGs)')
print(f'Conserved (≥3 organisms): {len(conserved_3):,} families × metals '
f'({conserved_3["OG_id"].nunique()} unique OGs)')
# Also count across ANY metal (organism has ANY metal phenotype)
og_any_metal = metal_og_mapped.groupby('OG_id').agg(
n_organisms_any=('orgId', 'nunique'),
n_metals=('metal_element', 'nunique'),
metals=('metal_element', lambda x: ','.join(sorted(x.unique()))),
).reset_index()
conserved_any_2 = og_any_metal[og_any_metal['n_organisms_any'] >= 2]
conserved_any_3 = og_any_metal[og_any_metal['n_organisms_any'] >= 3]
print(f'\nOGs with metal phenotype in ≥2 organisms (any metal): {len(conserved_any_2):,}')
print(f'OGs with metal phenotype in ≥3 organisms (any metal): {len(conserved_any_3):,}')
OG × metal combinations: 7,231 Conserved (≥2 organisms): 1,836 families × metals (906 unique OGs) Conserved (≥3 organisms): 704 families × metals (353 unique OGs)
OGs with metal phenotype in ≥2 organisms (any metal): 1,182 OGs with metal phenotype in ≥3 organisms (any metal): 601
In [4]:
# Top conserved families by organism breadth
print('\nTop 30 conserved metal families (by # organisms with metal phenotype):')
print('=' * 90)
top_families = og_any_metal.sort_values('n_organisms_any', ascending=False).head(30)
for _, row in top_families.iterrows():
print(f' {row.OG_id:12s} {row.n_organisms_any:2d} organisms '
f'{row.n_metals:2d} metals [{row.metals}]')
Top 30 conserved metal families (by # organisms with metal phenotype): ========================================================================================== OG00424 18 organisms 7 metals [Aluminum,Cobalt,Copper,Molybdenum,Nickel,Tungsten,Zinc] OG00245 18 organisms 5 metals [Aluminum,Cobalt,Copper,Nickel,Zinc] OG00128 17 organisms 9 metals [Aluminum,Chromium,Cobalt,Copper,Manganese,Molybdenum,Nickel,Tungsten,Zinc] OG00605 17 organisms 7 metals [Aluminum,Cobalt,Copper,Molybdenum,Nickel,Tungsten,Zinc] OG00016 16 organisms 5 metals [Aluminum,Cobalt,Copper,Nickel,Zinc] OG00082 15 organisms 8 metals [Aluminum,Chromium,Cobalt,Copper,Molybdenum,Nickel,Tungsten,Zinc] OG00342 15 organisms 5 metals [Aluminum,Cobalt,Copper,Nickel,Zinc] OG00019 15 organisms 9 metals [Aluminum,Cadmium,Chromium,Cobalt,Copper,Iron,Nickel,Uranium,Zinc] OG00151 14 organisms 5 metals [Aluminum,Cobalt,Copper,Nickel,Zinc] OG00075 14 organisms 7 metals [Aluminum,Cobalt,Copper,Iron,Molybdenum,Nickel,Zinc] OG01082 13 organisms 6 metals [Aluminum,Cobalt,Copper,Iron,Nickel,Zinc] OG00742 13 organisms 5 metals [Aluminum,Cobalt,Copper,Nickel,Zinc] OG00143 13 organisms 5 metals [Aluminum,Cobalt,Copper,Nickel,Zinc] OG00270 13 organisms 5 metals [Aluminum,Cobalt,Copper,Nickel,Zinc] OG00898 12 organisms 6 metals [Aluminum,Cadmium,Cobalt,Copper,Nickel,Zinc] OG01509 12 organisms 11 metals [Aluminum,Chromium,Cobalt,Copper,Iron,Mercury,Molybdenum,Nickel,Selenium,Tungsten,Zinc] OG00231 12 organisms 5 metals [Aluminum,Cobalt,Copper,Nickel,Zinc] OG00079 12 organisms 7 metals [Aluminum,Cobalt,Copper,Iron,Nickel,Uranium,Zinc] OG00412 12 organisms 5 metals [Aluminum,Cobalt,Copper,Nickel,Zinc] OG00798 12 organisms 7 metals [Aluminum,Cobalt,Copper,Iron,Nickel,Uranium,Zinc] OG00102 12 organisms 6 metals [Aluminum,Cobalt,Copper,Iron,Nickel,Zinc] OG00666 11 organisms 5 metals [Aluminum,Cobalt,Copper,Nickel,Zinc] OG01714 11 organisms 6 metals [Aluminum,Cobalt,Copper,Nickel,Tungsten,Zinc] OG01424 11 organisms 7 metals [Aluminum,Cobalt,Copper,Molybdenum,Nickel,Tungsten,Zinc] OG00302 11 organisms 5 metals [Aluminum,Cobalt,Copper,Nickel,Zinc] OG00345 11 organisms 5 metals [Aluminum,Cobalt,Copper,Nickel,Zinc] OG00003 11 organisms 5 metals [Aluminum,Cobalt,Copper,Nickel,Zinc] OG01101 11 organisms 12 metals [Aluminum,Chromium,Cobalt,Copper,Iron,Mercury,Molybdenum,Nickel,Selenium,Tungsten,Uranium,Zinc] OG00164 11 organisms 9 metals [Aluminum,Cadmium,Chromium,Cobalt,Copper,Nickel,Selenium,Uranium,Zinc] OG01383 11 organisms 6 metals [Aluminum,Chromium,Cobalt,Copper,Tungsten,Zinc]
3. Annotate Conserved Families¶
In [5]:
# Load SEED annotations for functional context
seed_path = EG_DATA / 'all_seed_annotations.tsv'
if seed_path.exists():
seed_annot = pd.read_csv(seed_path, sep='\t')
seed_annot['locusId'] = seed_annot['locusId'].astype(str)
print(f'SEED annotations: {len(seed_annot):,}')
og_seed = ortholog_groups.merge(
seed_annot[['orgId', 'locusId', 'seed_desc']].drop_duplicates(),
on=['orgId', 'locusId'], how='left'
)
og_seed_summary = og_seed.groupby('OG_id')['seed_desc'].agg(
lambda x: x.dropna().mode().iloc[0] if len(x.dropna()) > 0 and len(x.dropna().mode()) > 0 else 'hypothetical'
).reset_index()
og_seed_summary.columns = ['OG_id', 'seed_description']
else:
print('No SEED annotations available')
og_seed_summary = pd.DataFrame(columns=['OG_id', 'seed_description'])
# Add conservation and essentiality
conserved_annotated = og_any_metal.merge(family_cons, left_on='OG_id', right_on='OG_id', how='left')
conserved_annotated = conserved_annotated.merge(og_seed_summary, on='OG_id', how='left')
conserved_annotated = conserved_annotated.merge(
essential_fam[['OG_id', 'essentiality_class', 'frac_essential']],
on='OG_id', how='left'
)
print(f'\nAnnotated families: {len(conserved_annotated):,}')
# Identify novel candidates with finer categories
conserved_with_data = conserved_annotated[conserved_annotated['n_organisms_any'] >= 2].copy()
# Category 1: Truly unknown — no informative annotation at all
def classify_novelty(row):
desc = str(row.get('rep_desc', ''))
seed = str(row.get('seed_description', ''))
combined = (desc + ' ' + seed).lower()
if any(k in combined for k in ['hypothetical', 'unknown', 'uncharacterized', 'duf', 'upf']):
if any(k in combined for k in ['duf', 'upf']):
return 'novel_domain' # Has a DUF/UPF domain — known domain, unknown metal function
elif any(k in combined for k in ['transporter', 'regulator', 'kinase', 'transferase',
'oxidoreductase', 'hydrolase', 'permease', 'efflux']):
return 'novel_metal_function' # Has functional hint, unknown metal role
else:
return 'truly_unknown' # No functional annotation at all
return 'annotated'
conserved_with_data['novelty_class'] = conserved_with_data.apply(classify_novelty, axis=1)
novel = conserved_with_data[conserved_with_data['novelty_class'] != 'annotated'].copy()
truly_unknown = conserved_with_data[conserved_with_data['novelty_class'] == 'truly_unknown']
novel_domain = conserved_with_data[conserved_with_data['novelty_class'] == 'novel_domain']
novel_metal = conserved_with_data[conserved_with_data['novelty_class'] == 'novel_metal_function']
print(f'Conserved families (>=2 orgs): {len(conserved_with_data):,}')
print(f'Novel candidates total: {len(novel):,}')
print(f' Truly unknown (no annotation): {len(truly_unknown)}')
print(f' Novel domain (DUF/UPF, unknown metal role): {len(novel_domain)}')
print(f' Novel metal function (known domain, uncharacterized metal role): {len(novel_metal)}')
SEED annotations: 177,519
Annotated families: 2,891 Conserved families (>=2 orgs): 1,182 Novel candidates total: 149 Truly unknown (no annotation): 89 Novel domain (DUF/UPF, unknown metal role): 43 Novel metal function (known domain, uncharacterized metal role): 17
4. Figures¶
In [6]:
# Distribution of organism breadth for metal OGs
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Left: histogram of organism count per OG
ax = axes[0]
ax.hist(og_any_metal['n_organisms_any'], bins=range(1, og_any_metal['n_organisms_any'].max()+2),
color='steelblue', alpha=0.8, edgecolor='black', linewidth=0.5, align='left')
ax.set_xlabel('Number of organisms with metal fitness defect')
ax.set_ylabel('Number of ortholog families')
ax.set_title('Metal Fitness Gene Family Breadth')
ax.axvline(2, color='red', linestyle='--', alpha=0.7, label='≥2 = conserved')
ax.legend()
# Right: conservation (pct_core) vs organism breadth
ax = axes[1]
plot_data = conserved_annotated[conserved_annotated['pct_core'].notna()]
ax.scatter(plot_data['n_organisms_any'], plot_data['pct_core'],
alpha=0.3, s=10, color='steelblue')
# Add trend
if len(plot_data) > 10:
bins = plot_data.groupby('n_organisms_any')['pct_core'].mean()
ax.plot(bins.index, bins.values, 'ro-', markersize=6, linewidth=2, label='Mean')
ax.set_xlabel('# organisms with metal phenotype')
ax.set_ylabel('Pangenome conservation (% core)')
ax.set_title('Metal Family Breadth vs Conservation')
ax.legend()
plt.tight_layout()
fig.savefig(FIGURES_DIR / 'metal_family_conservation_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()
print(f'Saved: figures/metal_family_conservation_heatmap.png')
Saved: figures/metal_family_conservation_heatmap.png
5. Save Results¶
In [7]:
# --- Issue #8: Characterize novel candidates further ---
print('Top 20 Novel Metal Biology Candidates (hypothetical + conserved):')
print('=' * 100)
novel_sorted = novel.sort_values('n_organisms_any', ascending=False)
for _, row in novel_sorted.head(20).iterrows():
desc = row.get('rep_desc', 'unknown')
if pd.isna(desc):
desc = row.get('seed_description', 'hypothetical')
ess = row.get('essentiality_class_x', 'unknown')
pct = row.get('pct_core', 0)
print(f' {row.OG_id:10s} {int(row.n_organisms_any):2d} orgs '
f'{int(row.n_metals):2d} metals core={pct:.0f}% '
f'ess={ess} [{desc[:60]}]')
print(f' metals: {row.metals}')
# Summarize novel candidates by functional hints
print(f'\nNovel candidate summary:')
print(f' Total: {len(novel)}')
duf_count = novel['rep_desc'].fillna('').str.contains('DUF', case=False).sum()
membrane_count = novel['rep_desc'].fillna('').str.contains('membrane|transport', case=False).sum()
enzyme_count = novel['rep_desc'].fillna('').str.contains('ase|enzyme|oxidoreductase|hydrolase', case=False).sum()
print(f' With DUF domain: {duf_count}')
print(f' Membrane/transport hint: {membrane_count}')
print(f' Enzyme hint: {enzyme_count}')
print(f' Completely unknown: {len(novel) - duf_count - membrane_count - enzyme_count}')
# Essentiality distribution of novel candidates
if 'essentiality_class_x' in novel.columns:
print(f'\n Essentiality classes:')
print(novel['essentiality_class_x'].value_counts().to_string())
# Save
conserved_with_data.to_csv(DATA_DIR / 'conserved_metal_families.csv', index=False)
novel.to_csv(DATA_DIR / 'novel_metal_candidates.csv', index=False)
print(f'\nSaved: data/conserved_metal_families.csv ({len(conserved_with_data)} families)')
print(f'Saved: data/novel_metal_candidates.csv ({len(novel)} hypothetical families)')
print('\n' + '=' * 80)
print('NB04 SUMMARY: Cross-Species Metal Fitness Families')
print('=' * 80)
print(f'Metal-important genes mapped to OGs: {n_mapped:,}')
print(f'Unique OGs with metal phenotype: {metal_og_mapped["OG_id"].nunique():,}')
print(f'Conserved families (>=2 organisms): {len(conserved_with_data):,}')
print(f'Conserved families (>=3 organisms): {len(conserved_any_3):,}')
print(f'Novel candidates (hypothetical + conserved): {len(novel):,}')
print(f' DUF-containing: {duf_count}, Membrane/transport: {membrane_count}, Enzyme: {enzyme_count}')
print('=' * 80)
Top 20 Novel Metal Biology Candidates (hypothetical + conserved):
====================================================================================================
OG01383 11 orgs 6 metals core=100% ess=variably_essential [YebC/PmpR transcriptional regulator]
metals: Aluminum,Chromium,Cobalt,Copper,Tungsten,Zinc
OG02233 8 orgs 4 metals core=92% ess=variably_essential [phospholipid transport system substrate-binding protein]
metals: Cobalt,Copper,Nickel,Zinc
OG02094 8 orgs 7 metals core=97% ess=variably_essential [Uncharacterized P-loop ATPase protein UPF0042]
metals: Aluminum,Cobalt,Copper,Molybdenum,Nickel,Tungsten,Zinc
OG00391 7 orgs 9 metals core=95% ess=variably_essential [Uncharacterized PLP-dependent aminotransferase YfdZ]
metals: Cobalt,Copper,Iron,Mercury,Molybdenum,Nickel,Selenium,Tungsten,Zinc
OG02716 7 orgs 4 metals core=92% ess=never_essential [Protein of unknown function (DUF3108).]
metals: Aluminum,Copper,Nickel,Zinc
OG03264 6 orgs 5 metals core=100% ess=variably_essential [DUF1043 family protein]
metals: Aluminum,Chromium,Copper,Nickel,Zinc
OG00660 6 orgs 3 metals core=97% ess=never_essential [membrane protein, putative]
metals: Aluminum,Copper,Zinc
OG03534 6 orgs 5 metals core=97% ess=variably_essential [ABC transporter permease]
metals: Aluminum,Cobalt,Copper,Nickel,Zinc
OG03229 5 orgs 3 metals core=100% ess=never_essential [Uncharacterized protein conserved in bacteria]
metals: Aluminum,Nickel,Zinc
OG00116 5 orgs 4 metals core=95% ess=variably_essential [YggS family pyridoxal phosphate enzyme]
metals: Cobalt,Copper,Molybdenum,Nickel
OG13545 5 orgs 3 metals core=50% ess=variably_essential [MYND finger.]
metals: Copper,Nickel,Zinc
OG02641 4 orgs 5 metals core=100% ess=variably_essential [Transcription factor jumonji, jmjC]
metals: Aluminum,Cobalt,Copper,Nickel,Zinc
OG03214 4 orgs 4 metals core=74% ess=never_essential [putative exported protein]
metals: Aluminum,Cobalt,Copper,Nickel
OG06816 4 orgs 1 metals core=90% ess=never_essential [membrane protein]
metals: Cobalt
OG04003 4 orgs 4 metals core=100% ess=variably_essential [outer membrane lipid asymmetry maintenance protein MlaD]
metals: Cobalt,Copper,Nickel,Zinc
OG02034 4 orgs 3 metals core=97% ess=variably_essential [Protein of unknown function DUF484]
metals: Cobalt,Nickel,Uranium
OG02399 4 orgs 2 metals core=92% ess=variably_essential [DUF4115 domain-containing protein]
metals: Aluminum,Zinc
OG00173 4 orgs 4 metals core=100% ess=variably_essential [Uncharacterized protein EC-HemY, likely associated with heme]
metals: Cobalt,Copper,Nickel,Zinc
OG01917 3 orgs 3 metals core=97% ess=variably_essential [UPF0301 protein YqgE]
metals: Aluminum,Cobalt,Copper
OG01015 3 orgs 7 metals core=96% ess=variably_essential [Uncharacterized conserved protein UCP030820]
metals: Aluminum,Cadmium,Chromium,Cobalt,Copper,Nickel,Zinc
Novel candidate summary:
Total: 149
With DUF domain: 28
Membrane/transport hint: 12
Enzyme hint: 27
Completely unknown: 82
Essentiality classes:
essentiality_class_x
never_essential 86
variably_essential 61
Saved: data/conserved_metal_families.csv (1182 families)
Saved: data/novel_metal_candidates.csv (149 hypothetical families)
================================================================================
NB04 SUMMARY: Cross-Species Metal Fitness Families
================================================================================
Metal-important genes mapped to OGs: 10,876
Unique OGs with metal phenotype: 2,891
Conserved families (>=2 organisms): 1,182
Conserved families (>=3 organisms): 601
Novel candidates (hypothetical + conserved): 149
DUF-containing: 28, Membrane/transport: 12, Enzyme: 27
================================================================================