01 Database Overview
Jupyter notebook from the Web of Microbes Data Explorer project.
NB01: Web of Microbes — Database Overview & Fitness Browser Overlap¶
Characterize the kescience_webofmicrobes collection:
- Inventory: organisms, metabolites, environments, projects
- Categorize metabolites by chemical class
- Profile metabolite actions (consumed vs produced) per organism
- Match WoM organisms to Fitness Browser organisms
- For matched organisms: compare metabolite profiles to fitness data coverage
In [1]:
spark = get_spark_session()
import pandas as pd
import os
DATA_DIR = '../data'
FIG_DIR = '../figures'
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(FIG_DIR, exist_ok=True)
1. Database Inventory¶
Quick census of all 5 tables.
In [2]:
for table in ['compound', 'environment', 'organism', 'project', 'observation']:
n = spark.sql(f"SELECT COUNT(*) as n FROM kescience_webofmicrobes.{table}").collect()[0]['n']
print(f" {table:15s} {n:>8,} rows")
compound 589 rows
environment 10 rows
organism 37 rows
project 5 rows
observation 10,744 rows
2. Organisms¶
Who's in the database? Categorize by type.
In [3]:
organisms_df = spark.sql("""
SELECT o.id, o.common_name, o.NCBI_taxid,
COUNT(obs.id) as n_observations,
SUM(CASE WHEN obs.action = 'D' THEN 1 ELSE 0 END) as n_decreased,
SUM(CASE WHEN obs.action = 'I' THEN 1 ELSE 0 END) as n_increased,
SUM(CASE WHEN obs.action = 'N' THEN 1 ELSE 0 END) as n_nochange,
SUM(CASE WHEN obs.action = 'E' THEN 1 ELSE 0 END) as n_excreted
FROM kescience_webofmicrobes.organism o
LEFT JOIN kescience_webofmicrobes.observation obs ON o.id = obs.organism_id
GROUP BY o.id, o.common_name, o.NCBI_taxid
ORDER BY n_observations DESC
""").toPandas()
# Categorize organisms
def categorize_organism(name):
if name == 'The Environment':
return 'Control'
elif name.startswith('Theoretical'):
return 'Theoretical auxotroph'
elif name.startswith('Native microbiome'):
return 'Native community'
elif 'Triculture' in name:
return 'Triculture time-series'
elif any(tag in name for tag in ['FW', 'GW']):
return 'ENIGMA isolate'
elif any(tag in name for tag in ['D1B', 'L1B', 'L2B']):
return 'Biocrust isolate'
else:
return 'Other isolate'
organisms_df['category'] = organisms_df['common_name'].apply(categorize_organism)
print("Organism categories:")
print(organisms_df.groupby('category').agg(
n_organisms=('id', 'count'),
total_observations=('n_observations', 'sum')
).sort_values('total_observations', ascending=False).to_string())
print(f"\nTotal: {len(organisms_df)} organisms, {organisms_df['n_observations'].sum():,} observations")
Organism categories:
n_organisms total_observations
category
Biocrust isolate 6 5604
Control 1 1765
Other isolate 4 1176
ENIGMA isolate 10 1050
Native community 2 934
Triculture time-series 10 199
Theoretical auxotroph 4 16
Total: 37 organisms, 10,744 observations
In [4]:
# Show experimental organisms (exclude controls and theoretical)
exp_orgs = organisms_df[organisms_df['category'].isin(['ENIGMA isolate', 'Biocrust isolate', 'Other isolate'])].copy()
exp_orgs['pct_increased'] = (exp_orgs['n_increased'] / exp_orgs['n_observations'] * 100).round(1)
exp_orgs['pct_decreased'] = (exp_orgs['n_decreased'] / exp_orgs['n_observations'] * 100).round(1)
print(f"Experimental organisms: {len(exp_orgs)}")
display(exp_orgs[['common_name', 'category', 'n_observations', 'n_decreased', 'n_increased', 'n_nochange', 'pct_increased', 'pct_decreased']].to_string(index=False))
Experimental organisms: 20
' common_name category n_observations n_decreased n_increased n_nochange pct_increased pct_decreased\n Arthrobacter sp. (D1B45) Biocrust isolate 934 0 110 766 11.8 0.0\n Modestobacter sp. (L1B44) Biocrust isolate 934 0 100 800 10.7 0.0\n Microcoleus vaginatus (PCC9802) Other isolate 934 0 92 483 9.9 0.0\n Bacillus sp. 2 (D1B51) Biocrust isolate 934 0 64 819 6.9 0.0\n Bacillus sp. 2 (L2B47) Biocrust isolate 934 0 81 814 8.7 0.0\n Methylobacterium sp. (D1B20) Biocrust isolate 934 0 66 842 7.1 0.0\n Bosea sp. (L1B56) Biocrust isolate 934 0 108 786 11.6 0.0\n Synechococcus sp. PCC7002 Other isolate 208 0 21 137 10.1 0.0\n Pseudomonas sp. (FW300-N2E3) ENIGMA isolate 105 0 31 47 29.5 0.0\n Pseudomonas sp. (FW300-N2F2) ENIGMA isolate 105 0 33 40 31.4 0.0\n Pseudomonas sp. (FW300-N2A2) ENIGMA isolate 105 0 26 47 24.8 0.0\n Acidovorax sp. (GW101-3E06) ENIGMA isolate 105 0 48 27 45.7 0.0\n Pseudomonas sp. (FW507-14TSA) ENIGMA isolate 105 0 44 28 41.9 0.0\n Phenylobacterium sp. (GW123-8A04) ENIGMA isolate 105 0 50 27 47.6 0.0\n Pseudomonas sp. (GW456-L15) ENIGMA isolate 105 0 44 36 41.9 0.0\n Pseudomonas sp. (GW456-L13) ENIGMA isolate 105 0 49 22 46.7 0.0\n Rhizobium sp. (GW101-3B10) ENIGMA isolate 105 0 49 31 46.7 0.0\n Bacillus sp. (FW507-8R2A) ENIGMA isolate 105 0 39 50 37.1 0.0\nZymomonas mobilis strain ZM4 (ATCC 31821) Other isolate 22 0 0 13 0.0 0.0\n Escherichia coli strain BW25113 Other isolate 12 0 0 8 0.0 0.0'
3. Environments (Growth Media)¶
In [5]:
env_df = spark.sql("""
SELECT e.id, e.env_name, e.method,
COUNT(DISTINCT obs.organism_id) as n_organisms,
COUNT(obs.id) as n_observations
FROM kescience_webofmicrobes.environment e
LEFT JOIN kescience_webofmicrobes.observation obs ON e.id = obs.environment_id
GROUP BY e.id, e.env_name, e.method
ORDER BY n_observations DESC
""").toPandas()
display(env_df.to_string(index=False))
' id env_name method n_organisms n_observations\n 4 BG11+MvExtract WoM 8 3736\n 3 BG11+6crustisoExtract WoM 8 3736\n 2 Biological Soil Crust Porewater (Green Butte Site, Collection: Rajeev 2013 ISME J. 7(11):2178) WoM 3 1401\n 10 R2A NA 11 1155\n 9 OE-MinimalGlucoseAA NA 11 219\n 6 A+:MEBM (1:1) NA 2 104\n 5 A+ minimal NA 2 104\n 7 A+ with yeast extract NA 2 104\n 8 A+ with Synechococcus extract NA 2 104\n 1 ZMMG NA 7 81'
4. Metabolites¶
Categorize the 589 compounds by chemical class.
In [6]:
compounds_df = spark.sql("""
SELECT c.*,
COUNT(CASE WHEN obs.action = 'D' THEN 1 END) as times_decreased,
COUNT(CASE WHEN obs.action = 'I' THEN 1 END) as times_increased,
COUNT(CASE WHEN obs.action = 'N' THEN 1 END) as times_nochange,
COUNT(DISTINCT obs.organism_id) as n_organisms_tested
FROM kescience_webofmicrobes.compound c
LEFT JOIN kescience_webofmicrobes.observation obs ON c.id = obs.compound_id
GROUP BY c.id, c.metabolite_atlas_id, c.formula, c.fragments, c.compound_name,
c.synonyms, c.masscharge, c.neutralmass, c.retention, c.protocol,
c.pubchem_id, c.inchi_key, c.smiles_string
ORDER BY (times_decreased + times_increased) DESC
""").toPandas()
# Categorize metabolites
amino_acids = {'glycine','alanine','valine','leucine','isoleucine','proline',
'phenylalanine','tryptophan','methionine','serine','threonine',
'cysteine','tyrosine','asparagine','glutamine','aspartate',
'glutamate','lysine','arginine','histidine'}
def classify_metabolite(name):
lower = name.lower()
if name.startswith('Unk_'):
return 'Unknown'
elif any(aa in lower for aa in amino_acids) or 'amino' in lower:
return 'Amino acid / derivative'
elif any(n in lower for n in ['adenine','adenosine','guanine','guanosine',
'cytosine','uracil','thymine','inosine','xanthine','hypoxanthine',
'nucleotide','nucleoside']):
return 'Nucleotide / base'
elif any(s in lower for s in ['glucose','fructose','sucrose','trehalose',
'hexose','galactose','ribose','xylose','maltose','dihexose','sugar']):
return 'Sugar'
elif any(v in lower for v in ['vitamin','thiamine','riboflavin','niacin',
'pantothen','biotin','folate','cobalamin','pyridox','nicotinamide']):
return 'Vitamin / cofactor'
elif any(o in lower for o in ['malate','succinate','citrate','fumarate',
'pyruvate','lactate','oxaloacetate','ketoglutarate']):
return 'Organic acid (TCA)'
elif any(l in lower for l in ['phospholipid','lipid','fatty','sterol',
'sphingo','ceramide']):
return 'Lipid'
else:
return 'Other identified'
compounds_df['category'] = compounds_df['compound_name'].apply(classify_metabolite)
cat_summary = compounds_df.groupby('category').agg(
n_compounds=('id', 'count'),
total_decreased=('times_decreased', 'sum'),
total_increased=('times_increased', 'sum')
).sort_values('n_compounds', ascending=False)
print("Metabolite categories:")
display(cat_summary.to_string())
print(f"\nTotal: {len(compounds_df)} metabolites")
print(f" With any change (D or I): {(compounds_df['times_decreased'] + compounds_df['times_increased'] > 0).sum()}")
print(f" Never changed: {(compounds_df['times_decreased'] + compounds_df['times_increased'] == 0).sum()}")
Metabolite categories:
' n_compounds total_decreased total_increased\ncategory \nUnknown 332 299 369\nOther identified 162 212 330\nAmino acid / derivative 52 133 400\nNucleotide / base 21 54 141\nSugar 6 11 37\nOrganic acid (TCA) 6 9 19\nVitamin / cofactor 6 12 34\nLipid 4 12 8'
Total: 589 metabolites With any change (D or I): 408 Never changed: 181
In [7]:
# Top 30 most active metabolites (most organisms show a change)
active = compounds_df[compounds_df['times_decreased'] + compounds_df['times_increased'] > 0].copy()
active['total_changes'] = active['times_decreased'] + active['times_increased']
active = active.sort_values('total_changes', ascending=False).head(30)
print("Top 30 most metabolically active compounds:")
print(f"{'Compound':40s} {'Category':25s} {'Decreased':>10s} {'Increased':>10s} {'Total':>8s}")
print('-' * 95)
for _, row in active.iterrows():
print(f"{row['compound_name'][:40]:40s} {row['category']:25s} {row['times_decreased']:>10d} {row['times_increased']:>10d} {row['total_changes']:>8d}")
Top 30 most metabolically active compounds: Compound Category Decreased Increased Total ----------------------------------------------------------------------------------------------- glutamine Amino acid / derivative 8 28 36 proline Amino acid / derivative 8 27 35 phenylalanine Amino acid / derivative 8 26 34 glutamate Amino acid / derivative 7 25 32 Adenine Nucleotide / base 7 22 29 methionine Amino acid / derivative 6 20 26 Adenosine Nucleotide / base 6 18 24 aspartate Amino acid / derivative 4 19 23 tryptophan Amino acid / derivative 6 17 23 isoleucine Amino acid / derivative 5 17 22 alanine Amino acid / derivative 5 17 22 arginine Amino acid / derivative 5 17 22 histidine Amino acid / derivative 5 16 21 Guanine Nucleotide / base 4 15 19 tyrosine Amino acid / derivative 5 14 19 Unk_331.2834p_0n_4.1_RB_BiocrustExp Unknown 3 16 19 Hypoxanthine Nucleotide / base 5 13 18 nicotinamide Vitamin / cofactor 5 13 18 valine Amino acid / derivative 5 13 18 citrulline Other identified 4 14 18 Dihexose Sugar 3 14 17 Uracil Nucleotide / base 5 12 17 Malate Organic acid (TCA) 3 13 16 lysine Amino acid / derivative 5 10 15 glycine Amino acid / derivative 2 13 15 Xanthine Nucleotide / base 4 10 14 Leucine/Isoleucine Amino acid / derivative 3 10 13 glucose Sugar 3 10 13 guanosine Nucleotide / base 3 10 13 Unk_359.3152p_0n_4.1_RB_BiocrustExp Unknown 3 10 13
5. Fitness Browser Organism Overlap¶
The critical question: which WoM organisms are also in the Fitness Browser?
In [8]:
# Get all FB organisms
fb_orgs = spark.sql("""
SELECT orgId, genus, species, strain, taxonomyId
FROM kescience_fitnessbrowser.organism
""").toPandas()
print(f"Fitness Browser organisms: {len(fb_orgs)}")
print(f"WoM organisms: {len(organisms_df)}")
print()
# Try matching WoM ENIGMA isolates to FB by strain ID
import re
matches = []
wom_enigma = organisms_df[organisms_df['category'] == 'ENIGMA isolate'].copy()
for _, wom_row in wom_enigma.iterrows():
# Extract strain ID from parentheses, e.g. "Pseudomonas sp. (FW300-N2E3)" -> "FW300-N2E3"
m = re.search(r'\(([^)]+)\)', wom_row['common_name'])
if not m:
continue
strain_id = m.group(1)
# Normalize: remove hyphens, lowercase
strain_norm = strain_id.replace('-', '').replace('_', '').lower()
for _, fb_row in fb_orgs.iterrows():
fb_strain_norm = str(fb_row['strain']).replace('-', '').replace('_', '').lower()
fb_org_norm = str(fb_row['orgId']).replace('-', '').replace('_', '').lower()
if strain_norm in fb_strain_norm or strain_norm in fb_org_norm:
matches.append({
'wom_name': wom_row['common_name'],
'wom_strain': strain_id,
'fb_orgId': fb_row['orgId'],
'fb_genus': fb_row['genus'],
'fb_species': fb_row['species'],
'fb_strain': fb_row['strain'],
})
if matches:
match_df = pd.DataFrame(matches)
print(f"\nDirect matches found: {len(match_df)}")
display(match_df.to_string(index=False))
else:
print("\nNo direct strain-ID matches. Trying genus-level matching...")
# Fall back to genus-level
for _, wom_row in wom_enigma.iterrows():
genus = wom_row['common_name'].split()[0]
genus_matches = fb_orgs[fb_orgs['genus'].str.lower() == genus.lower()]
if not genus_matches.empty:
for _, fb_row in genus_matches.iterrows():
matches.append({
'wom_name': wom_row['common_name'],
'fb_orgId': fb_row['orgId'],
'fb_strain': fb_row['strain'],
'match_type': 'genus'
})
if matches:
match_df = pd.DataFrame(matches)
print(f"Genus-level matches: {len(match_df)}")
display(match_df.to_string(index=False))
else:
print("No matches found at any level.")
Fitness Browser organisms: 48 WoM organisms: 37 Direct matches found: 2
' wom_name wom_strain fb_orgId fb_genus fb_species fb_strain\nPseudomonas sp. (FW300-N2E3) FW300-N2E3 pseudo3_N2E3 Pseudomonas fluorescens FW300-N2E3\n Pseudomonas sp. (GW456-L13) GW456-L13 pseudo13_GW456_L13 Pseudomonas fluorescens GW456-L13'
In [9]:
# Also check non-ENIGMA organisms (E. coli, Synechococcus, Zymomonas)
print("Checking all WoM organisms for FB matches (broader search):")
print()
other_orgs = organisms_df[organisms_df['category'].isin(['Other isolate'])]
for _, wom_row in other_orgs.iterrows():
name = wom_row['common_name']
# Try matching genus
genus = name.split()[0]
genus_matches = fb_orgs[fb_orgs['genus'].str.lower() == genus.lower()]
if not genus_matches.empty:
print(f" WoM: {name}")
for _, fb_row in genus_matches.iterrows():
print(f" FB match: {fb_row['orgId']} ({fb_row['genus']} {fb_row['species']} {fb_row['strain']})")
print()
Checking all WoM organisms for FB matches (broader search):
WoM: Synechococcus sp. PCC7002
FB match: SynE (Synechococcus elongatus PCC 7942)
WoM: Escherichia coli strain BW25113
FB match: Keio (Escherichia coli BW25113)
6. Projects & Publications¶
In [10]:
projects_df = spark.sql("""
SELECT p.*, COUNT(DISTINCT obs.organism_id) as n_organisms,
COUNT(DISTINCT obs.compound_id) as n_metabolites,
COUNT(obs.id) as n_observations
FROM kescience_webofmicrobes.project p
LEFT JOIN kescience_webofmicrobes.observation obs ON p.id = obs.project_id
GROUP BY p.id, p.contributor, p.project_description, p.project_name
ORDER BY n_observations DESC
""").toPandas()
display(projects_df.to_string(index=False))
' id contributor project_description project_name n_organisms n_metabolites n_observations\n 2 Richard Baran 10.1038/ncomms9289 EarlyCareer_RB_BioSoilCrustCommunity2015 10 467 8873\n 5 Suzie Kosina WoM manuscript - submitted ENIGMA_SK_BEMC2016 11 105 1155\n 3 Richard Baran 10.1039/c1mb05196b ENIGMA_RB_Synechococcus2011 2 52 416\n 4 Onur Erbilgin 10.1186/s12859-017-1478-2 ENIGMA_OE_TricultureModeling2015 11 20 219\n 1 Suzie Kosina 10.1021/acssynbio.5b00236 ENIGMA_SK_SyntheticMutualism2015 7 32 81'
7. Action Distribution by Organism Category¶
In [11]:
# Merge organism categories into observation data
obs_summary = spark.sql("""
SELECT o.common_name, obs.action,
e.env_name,
COUNT(*) as n
FROM kescience_webofmicrobes.observation obs
JOIN kescience_webofmicrobes.organism o ON obs.organism_id = o.id
JOIN kescience_webofmicrobes.environment e ON obs.environment_id = e.id
WHERE o.common_name != 'The Environment'
AND o.common_name NOT LIKE 'Theoretical%'
GROUP BY o.common_name, obs.action, e.env_name
ORDER BY o.common_name, obs.action
""").toPandas()
# Pivot: organism × action
pivot = obs_summary.groupby(['common_name', 'action'])['n'].sum().unstack(fill_value=0)
pivot['total'] = pivot.sum(axis=1)
pivot = pivot.sort_values('total', ascending=False)
print("Observations by organism and action type:")
display(pivot.to_string())
Observations by organism and action type:
'action E I N total\ncommon_name \nArthrobacter sp. (D1B45) 58 110 766 934\nBacillus sp. 2 (D1B51) 51 64 819 934\nBosea sp. (L1B56) 40 108 786 934\nBacillus sp. 2 (L2B47) 39 81 814 934\nMethylobacterium sp. (D1B20) 26 66 842 934\nMicrocoleus vaginatus (PCC9802) 359 92 483 934\nModestobacter sp. (L1B44) 34 100 800 934\nNative microbiome: 018min Incubation 85 72 310 467\nNative microbiome: 540min Incubation 105 56 306 467\nSynechococcus sp. PCC7002 50 21 137 208\nAcidovorax sp. (GW101-3E06) 30 48 27 105\nPhenylobacterium sp. (GW123-8A04) 28 50 27 105\nPseudomonas sp. (FW300-N2E3) 27 31 47 105\nPseudomonas sp. (FW300-N2A2) 32 26 47 105\nPseudomonas sp. (FW300-N2F2) 32 33 40 105\nBacillus sp. (FW507-8R2A) 16 39 50 105\nRhizobium sp. (GW101-3B10) 25 49 31 105\nPseudomonas sp. (FW507-14TSA) 33 44 28 105\nPseudomonas sp. (GW456-L15) 25 44 36 105\nPseudomonas sp. (GW456-L13) 34 49 22 105\nZymomonas mobilis strain ZM4 (ATCC 31821) 9 0 13 22\nt2 Triculture: Bacillus cereus, Pseudomonas lini and Pseudomonas baetica 0 3 17 20\nt0 Triculture: Bacillus cereus, Pseudomonas lini and Pseudomonas baetica 0 1 19 20\nt1 Triculture: Bacillus cereus, Pseudomonas lini and Pseudomonas baetica 0 13 7 20\nt8 Triculture: Bacillus cereus, Pseudomonas lini and Pseudomonas baetica 0 20 0 20\nt3 Triculture: Bacillus cereus, Pseudomonas lini and Pseudomonas baetica 1 16 3 20\nt5 Triculture: Bacillus cereus, Pseudomonas lini and Pseudomonas baetica 0 19 1 20\nt6 Triculture: Bacillus cereus, Pseudomonas lini and Pseudomonas baetica 0 20 0 20\nt9 Triculture: Bacillus cereus, Pseudomonas lini and Pseudomonas baetica 0 20 0 20\nt7 Triculture: Bacillus cereus, Pseudomonas lini and Pseudomonas baetica 0 20 0 20\nt4 Triculture: Bacillus cereus, Pseudomonas lini and Pseudomonas baetica 0 19 0 19\nEscherichia coli strain BW25113 4 0 8 12'
9. Save Outputs¶
In [12]:
organisms_df.to_csv(f'{DATA_DIR}/wom_organisms.csv', index=False)
compounds_df.to_csv(f'{DATA_DIR}/wom_compounds.csv', index=False)
env_df.to_csv(f'{DATA_DIR}/wom_environments.csv', index=False)
projects_df.to_csv(f'{DATA_DIR}/wom_projects.csv', index=False)
if matches:
match_df.to_csv(f'{DATA_DIR}/fb_overlap.csv', index=False)
print(f"Saved FB overlap: {len(match_df)} matches")
print(f"Saved {len(organisms_df)} organisms, {len(compounds_df)} compounds, {len(env_df)} environments, {len(projects_df)} projects")
Saved FB overlap: 2 matches Saved 37 organisms, 589 compounds, 10 environments, 5 projects
In [13]:
print("="*60)
print("NB01 SUMMARY")
print("="*60)
print(f"Organisms: {len(organisms_df)} total, {len(exp_orgs)} experimental")
print(f" ENIGMA: {len(wom_enigma)} groundwater isolates")
print(f" Biocrust: {(organisms_df['category'] == 'Biocrust isolate').sum()} isolates")
print(f"Metabolites: {len(compounds_df)} total")
n_active = (compounds_df['times_decreased'] + compounds_df['times_increased'] > 0).sum()
print(f" Active: {n_active} ({n_active/len(compounds_df)*100:.1f}%) show at least 1 change")
n_unknown = (compounds_df['category'] == 'Unknown').sum()
print(f" Unknown: {n_unknown} ({n_unknown/len(compounds_df)*100:.1f}%)")
print(f"Environments: {len(env_df)}")
print(f"Projects: {len(projects_df)} (all ENIGMA-related)")
print(f"Observations: {organisms_df['n_observations'].sum():,}")
print(f"FB overlap: {len(matches)} organism matches")
============================================================ NB01 SUMMARY ============================================================ Organisms: 37 total, 20 experimental ENIGMA: 10 groundwater isolates Biocrust: 6 isolates Metabolites: 589 total Active: 408 (69.3%) show at least 1 change Unknown: 332 (56.4%) Environments: 10 Projects: 5 (all ENIGMA-related) Observations: 10,744 FB overlap: 2 organism matches
8. Visualization: ENIGMA Isolate Metabolite Profiles¶
Heatmap of metabolite actions for the 10 ENIGMA groundwater isolates in R2A medium.
In [14]:
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import numpy as np
# Build ENIGMA isolate x metabolite matrix from the observation data
enigma_obs = spark.sql("""
SELECT o.common_name, c.compound_name, obs.action
FROM kescience_webofmicrobes.observation obs
JOIN kescience_webofmicrobes.organism o ON obs.organism_id = o.id
JOIN kescience_webofmicrobes.compound c ON obs.compound_id = c.id
JOIN kescience_webofmicrobes.environment e ON obs.environment_id = e.id
WHERE e.env_name = 'R2A'
AND o.common_name != 'The Environment'
AND obs.action IN ('I', 'E', 'N')
""").toPandas()
# Encode actions numerically: E=2 (emerged), I=1 (increased), N=0 (no change)
action_map = {'E': 2, 'I': 1, 'N': 0}
enigma_obs['action_num'] = enigma_obs['action'].map(action_map)
# Pivot to organism x metabolite matrix
matrix = enigma_obs.pivot_table(
index='common_name', columns='compound_name',
values='action_num', aggfunc='first'
).fillna(0)
# Shorten organism names for display
matrix.index = [n.replace('Pseudomonas sp. ', 'P. ').replace('Acidovorax sp. ', 'Acid. ')
.replace('Phenylobacterium sp. ', 'Phen. ').replace('Rhizobium sp. ', 'Rhiz. ')
.replace('Bacillus sp. ', 'Bac. ').replace('(', '').replace(')', '')
for n in matrix.index]
# Filter to metabolites with at least 1 change
active_cols = matrix.columns[matrix.sum(axis=0) > 0]
matrix_active = matrix[active_cols]
# Sort organisms by total activity and metabolites by frequency
org_order = matrix_active.sum(axis=1).sort_values(ascending=False).index
met_order = matrix_active.sum(axis=0).sort_values(ascending=False).index
matrix_active = matrix_active.loc[org_order, met_order]
# Create heatmap
fig, ax = plt.subplots(figsize=(16, 5))
cmap = matplotlib.colors.ListedColormap(['#f0f0f0', '#4393c3', '#d6604d'])
im = ax.imshow(matrix_active.values, aspect='auto', cmap=cmap, vmin=0, vmax=2)
ax.set_yticks(range(len(matrix_active.index)))
ax.set_yticklabels(matrix_active.index, fontsize=9)
ax.set_xlabel(f'Metabolites ({len(active_cols)} with changes)', fontsize=10)
ax.set_title('ENIGMA Isolate Metabolite Profiles in R2A Medium', fontsize=12, fontweight='bold')
# Remove x tick labels (too many metabolites)
ax.set_xticks([])
# Legend
from matplotlib.patches import Patch
legend_elements = [
Patch(facecolor='#f0f0f0', edgecolor='gray', label='No change'),
Patch(facecolor='#4393c3', edgecolor='gray', label='Increased (I)'),
Patch(facecolor='#d6604d', edgecolor='gray', label='Emerged (E)'),
]
ax.legend(handles=legend_elements, loc='lower right', fontsize=9)
plt.tight_layout()
plt.savefig(f'{FIG_DIR}/enigma_metabolite_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()
print(f"Saved: {FIG_DIR}/enigma_metabolite_heatmap.png")
print(f"Matrix: {matrix_active.shape[0]} organisms x {matrix_active.shape[1]} active metabolites")
Saved: ../figures/enigma_metabolite_heatmap.png Matrix: 10 organisms x 96 active metabolites
In [15]:
spark.stop()