02 Cross Collection Links
Jupyter notebook from the Web of Microbes Data Explorer project.
NB02: Cross-Collection Linking — WoM ↔ Fitness Browser ↔ Pangenome ↔ ModelSEED¶
Now that we understand WoM's contents (NB01), test how well it connects to other BERDL collections:
- WoM ↔ Fitness Browser: For the 2 direct-match organisms (pseudo3_N2E3, pseudo13_GW456_L13), compare metabolite profiles to fitness conditions
- WoM ↔ ModelSEED: Match WoM compound names to ModelSEED compound database
- WoM ↔ GapMind: For matched organisms, check if produced metabolites correspond to predicted pathway products
- WoM ↔ Pangenome: Check if WoM organisms have pangenome species matches
In [1]:
spark = get_spark_session()
import pandas as pd
import os
DATA_DIR = '../data'
FIG_DIR = '../figures'
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(FIG_DIR, exist_ok=True)
1. WoM ↔ Fitness Browser: Matched Organism Deep Dive¶
Two direct strain matches from NB01:
- Pseudomonas sp. FW300-N2E3 →
pseudo3_N2E3 - Pseudomonas sp. GW456-L13 →
pseudo13_GW456_L13
Plus two genus-level matches:
- E. coli BW25113 →
Keio(same strain) - Synechococcus PCC7002 →
SynE(different strain: PCC 7942)
In [2]:
# Scale of FB data for matched organisms
matched_orgs = ['pseudo3_N2E3', 'pseudo13_GW456_L13', 'Keio', 'SynE']
print("Fitness Browser data scale for WoM-matched organisms:")
print(f"{'orgId':25s} {'Genes':>8s} {'Experiments':>12s}")
print('-' * 50)
for org in matched_orgs:
n_genes = spark.sql(f"SELECT COUNT(*) as n FROM kescience_fitnessbrowser.gene WHERE orgId = '{org}'").collect()[0]['n']
n_exps = spark.sql(f"SELECT COUNT(DISTINCT expName) as n FROM kescience_fitnessbrowser.experiment WHERE orgId = '{org}'").collect()[0]['n']
print(f"{org:25s} {n_genes:>8,} {n_exps:>12,}")
Fitness Browser data scale for WoM-matched organisms: orgId Genes Experiments --------------------------------------------------
pseudo3_N2E3 5,854 211
pseudo13_GW456_L13 5,243 106
Keio 4,610 168
SynE 2,722 129
In [3]:
# What conditions were tested in FB for the Pseudomonas matches?
for org in ['pseudo3_N2E3', 'pseudo13_GW456_L13']:
print(f"\n{'='*60}")
print(f"FB conditions for {org}:")
print('='*60)
spark.sql(f"""
SELECT condition_1, expGroup, COUNT(*) as n_exps
FROM kescience_fitnessbrowser.experiment
WHERE orgId = '{org}'
GROUP BY condition_1, expGroup
ORDER BY n_exps DESC
""").show(30, truncate=False)
============================================================ FB conditions for pseudo3_N2E3: ============================================================
+------------------------------------------+---------------+------+ |condition_1 |expGroup |n_exps| +------------------------------------------+---------------+------+ |NULL |pH |7 | |NULL |temperature |6 | |L-Citrulline |carbon source |5 | |Thallium(I) acetate |stress |4 | |Cobalt chloride hexahydrate |stress |4 | |L-Ornithine |carbon source |4 | |D-Glucose |carbon source |4 | |NULL |lb |4 | |Furfuryl Alcohol |stress |3 | |NULL |survival |3 | |Spectinomycin dihydrochloride pentahydrate|stress |3 | |Agar |motility |3 | |Sodium acetate |stress |3 | |Cholin acetate |stress |3 | |1-ethyl-3-methylimidazolium acetate |stress |3 | |Potassium acetate |carbon source |2 | |D-Fructose |carbon source |2 | |Sodium D-Lactate |carbon source |2 | |L-Glutamine |carbon source |2 | |D-Gluconic Acid sodium salt |carbon source |2 | |L-Malic acid disodium salt monohydrate |carbon source |2 | |L-Histidine |nitrogen source|2 | |Tetracycline hydrochloride |stress |2 | |Sodium Fumarate dibasic |carbon source |2 | |L-Arginine |nitrogen source|2 | |Ethanol |carbon source |2 | |L-Lysine |nitrogen source|2 | |m-Inositol |carbon source |2 | |L-Histidine |carbon source |2 | |D-Ribose |carbon source |2 | +------------------------------------------+---------------+------+ only showing top 30 rows ============================================================ FB conditions for pseudo13_GW456_L13: ============================================================
+-----------------------------------------+---------------+------+ |condition_1 |expGroup |n_exps| +-----------------------------------------+---------------+------+ |Thallium(I) acetate |stress |3 | |Vancomycin Hydrochloride Hydrate |stress |3 | |Fusidic acid sodium salt |stress |2 | |D-Fructose |carbon source |2 | |L-Glutamine |carbon source |2 | |D-Glucosamine Hydrochloride |carbon source |2 | |L-Asparagine |carbon source |2 | |L-Histidine |carbon source |2 | |Tween 20 |carbon source |2 | |L-Valine |carbon source |2 | |D-Galactose |carbon source |2 | |Sodium D-Lactate |carbon source |2 | |D-Gluconic Acid sodium salt |carbon source |2 | |D-Serine |carbon source |2 | |L-Malic acid disodium salt monohydrate |carbon source |2 | |a-Ketoglutaric acid disodium salt hydrate|carbon source |2 | |L-Phenylalanine |carbon source |2 | |Bacitracin |stress |2 | |L-Alanine |carbon source |2 | |Glycerol |carbon source |2 | |casamino acids |carbon source |2 | |L-Arabinose |carbon source |2 | |D-Alanine |carbon source |2 | |Aluminum chloride hydrate |stress |2 | |Sodium nitrite |stress |2 | |L-Tryptophan |carbon source |2 | |D-Mannose |carbon source |2 | |Ammonium chloride |nitrogen source|1 | |L-Alanine |nitrogen source|1 | |L-Arginine |nitrogen source|1 | +-----------------------------------------+---------------+------+ only showing top 30 rows
In [4]:
# Key question: Do any FB conditions test the SAME metabolites that WoM detected?
# Get WoM metabolites produced by pseudo3_N2E3
wom_n2e3 = spark.sql("""
SELECT c.compound_name, obs.action
FROM kescience_webofmicrobes.observation obs
JOIN kescience_webofmicrobes.organism o ON obs.organism_id = o.id
JOIN kescience_webofmicrobes.compound c ON obs.compound_id = c.id
WHERE o.common_name = 'Pseudomonas sp. (FW300-N2E3)'
AND obs.action IN ('I', 'E')
""").toPandas()
# Get FB conditions for pseudo3_N2E3
fb_n2e3_conditions = spark.sql("""
SELECT DISTINCT condition_1, condition_2, expGroup
FROM kescience_fitnessbrowser.experiment
WHERE orgId = 'pseudo3_N2E3'
""").toPandas()
print(f"WoM metabolites produced by FW300-N2E3: {len(wom_n2e3)}")
print(f"FB unique conditions for pseudo3_N2E3: {len(fb_n2e3_conditions)}")
# Curated matching: map WoM metabolite names to FB condition names
# Only count matches where the metabolite IS the condition (not substring noise)
curated_map = {
'alanine': ['L-Alanine', 'D-Alanine'],
'arginine': ['L-Arginine'],
'glycine': ['Glycine'],
'lactate': ['Sodium D-Lactate', 'Sodium L-Lactate', 'Sodium D,L-Lactate'],
'proline': ['L-Proline'],
'phenylalanine': ['L-Phenylalanine'],
'tryptophan': ['L-Tryptophan'],
'tyrosine': ['L-Tyrosine disodium salt'],
'valine': ['L-Valine'],
'lysine': ['L-Lysine'],
'threonine': ['L-Threonine'],
'trehalose': ['D-Trehalose dihydrate'],
'adenine': ['Adenine hydrochloride hydrate'],
'adenosine': ['Adenosine'],
'inosine': ['Inosine'],
'thymine': ['Thymine'],
'nicotinamide': ['Nicotinamide'],
'carnitine': ['Carnitine hydrochloride'],
'malate': ['L-Malic acid disodium salt monohydrate'],
}
fb_cond1_set = set(fb_n2e3_conditions['condition_1'].dropna().str.strip())
wom_metabolite_set = set(wom_n2e3['compound_name'].str.lower())
overlaps = []
for wom_key, fb_conds in curated_map.items():
# Check if WoM metabolite is actually produced
wom_rows = wom_n2e3[wom_n2e3['compound_name'].str.lower().str.contains(wom_key, na=False)]
if len(wom_rows) == 0:
continue
wom_name = wom_rows.iloc[0]['compound_name']
wom_action = wom_rows.iloc[0]['action']
# Check if any FB condition matches
for fb_cond in fb_conds:
if fb_cond in fb_cond1_set:
fb_rows = fb_n2e3_conditions[fb_n2e3_conditions['condition_1'] == fb_cond]
fb_group = fb_rows.iloc[0]['expGroup'] if len(fb_rows) > 0 else 'unknown'
overlaps.append({
'wom_metabolite': wom_name,
'wom_action': wom_action,
'fb_condition': fb_cond,
'fb_group': fb_group
})
overlap_df = pd.DataFrame(overlaps)
print(f"\nCurated WoM-metabolite <-> FB-condition matches: {len(overlap_df)}")
print(f"\n{'WoM Metabolite':45s} {'Action':>7s} {'FB Condition':35s} {'FB Type'}")
print('-' * 100)
for _, row in overlap_df.iterrows():
print(f"{row['wom_metabolite']:45s} {row['wom_action']:>7s} {row['fb_condition']:35s} {row['fb_group']}")
WoM metabolites produced by FW300-N2E3: 58 FB unique conditions for pseudo3_N2E3: 123 Curated WoM-metabolite <-> FB-condition matches: 19 WoM Metabolite Action FB Condition FB Type ---------------------------------------------------------------------------------------------------- alanine I L-Alanine carbon source alanine I D-Alanine nitrogen source arginine I L-Arginine nitrogen source glycine I Glycine nitrogen source lactate E Sodium D-Lactate carbon source lactate E Sodium L-Lactate carbon source lactate E Sodium D,L-Lactate carbon source proline I L-Proline carbon source phenylalanine I L-Phenylalanine carbon source tryptophan I L-Tryptophan nitrogen source valine E L-Valine carbon source lysine E L-Lysine nitrogen source threonine isomers (coeluters: threonine, homoserine, allothreonine) I L-Threonine nitrogen source trehalose I D-Trehalose dihydrate carbon source Adenine I Adenine hydrochloride hydrate nitrogen source Adenosine I Adenosine nitrogen source inosine I Inosine nitrogen source thymine E Thymine nitrogen source Malate I L-Malic acid disodium salt monohydrate carbon source
2. WoM ↔ ModelSEED Biochemistry: Compound Name Matching¶
Match WoM compound names to kbase_msd_biochemistry.compound to assess
how many WoM metabolites have reaction-level annotations.
In [5]:
# Check what columns ModelSEED molecule table has
spark.sql("DESCRIBE kbase_msd_biochemistry.molecule").show(30, truncate=False)
+------------+---------+-------+ |col_name |data_type|comment| +------------+---------+-------+ |abbreviation|string |NULL | |charge |int |NULL | |deltag |float |NULL | |deltagerr |float |NULL | |formula |string |NULL | |id |string |NULL | |inchikey |string |NULL | |mass |float |NULL | |name |string |NULL | |pka |string |NULL | |pkb |string |NULL | |smiles |string |NULL | |source |string |NULL | +------------+---------+-------+
In [6]:
# Match WoM compounds to ModelSEED by name
# First, get WoM compounds (excluding unknowns)
wom_compounds = spark.sql("""
SELECT id, compound_name, formula
FROM kescience_webofmicrobes.compound
WHERE compound_name NOT LIKE 'Unk_%'
""").toPandas()
# Get ModelSEED molecules
ms_compounds = spark.sql("""
SELECT id as ms_id, name as ms_name, formula as ms_formula, inchikey as ms_inchikey
FROM kbase_msd_biochemistry.molecule
""").toPandas()
print(f"WoM identified compounds: {len(wom_compounds)}")
print(f"ModelSEED molecules: {len(ms_compounds):,}")
# Exact name match (case-insensitive)
wom_compounds['name_lower'] = wom_compounds['compound_name'].str.lower().str.strip()
ms_compounds['name_lower'] = ms_compounds['ms_name'].str.lower().str.strip()
exact_matches = wom_compounds.merge(ms_compounds, on='name_lower', how='inner')
n_exact_wom = exact_matches['compound_name'].nunique()
print(f"\nExact name matches: {n_exact_wom} WoM compounds (mapping to {len(exact_matches)} MS molecules)")
# Show matches
if len(exact_matches) > 0:
# Deduplicate to one MS match per WoM compound
deduped = exact_matches.groupby('compound_name').first().reset_index()
print(f"\n{'WoM Name':35s} {'ModelSEED Name':35s} {'MS ID':25s}")
print('-' * 95)
for _, row in deduped.sort_values('compound_name').head(30).iterrows():
print(f"{row['compound_name'][:35]:35s} {str(row['ms_name'])[:35]:35s} {str(row['ms_id']):25s}")
WoM identified compounds: 257 ModelSEED molecules: 45,708 Exact name matches: 69 WoM compounds (mapping to 75 MS molecules) WoM Name ModelSEED Name MS ID ----------------------------------------------------------------------------------------------- 2,3-dihydroxybenzoate 2,3-Dihydroxybenzoate seed.compound:cpd00168 2-isopropylmalate 2-Isopropylmalate seed.compound:cpd01646 3-hydroxybenzoate 3-Hydroxybenzoate seed.compound:cpd00456 4-acetamidobutanoate 4-Acetamidobutanoate seed.compound:cpd01889 4-guanidinobutanoate 4-Guanidinobutanoate seed.compound:cpd00762 4-hydroxy-L-proline 4-hydroxy-L-proline seed.compound:cpd29747 4-hydroxyphenylacetate 4-Hydroxyphenylacetate seed.compound:cpd00489 5'-deoxyadenosine 5'-Deoxyadenosine seed.compound:cpd03091 5-aminopentanoate 5-Aminopentanoate seed.compound:cpd00339 5-hydroxylysine 5-Hydroxylysine seed.compound:cpd00889 6-hydroxynicotinate 6-Hydroxynicotinate seed.compound:cpd00752 AMP AMP seed.compound:cpd00018 Adenine Adenine seed.compound:cpd00128 Adenosine Adenosine seed.compound:cpd00182 Ala-Ala Ala-Ala seed.compound:cpd34361 Cytosine Cytosine seed.compound:cpd00307 Gly-Phe Gly-Phe seed.compound:cpd35679 Glycerophosphocholine Glycerophosphocholine seed.compound:cpd00507 Guanine Guanine seed.compound:cpd00207 N-acetylputrescine N-Acetylputrescine seed.compound:cpd01758 Oligosaccharide Oligosaccharide seed.compound:cpd11781 Succinate Succinate seed.compound:cpd00036 Uracil Uracil seed.compound:cpd00092 alanine Alanine seed.compound:cpd01003 alpha-aminoadipate alpha-aminoadipate seed.compound:cpd29948 anthranilate Anthranilate seed.compound:cpd00093 asparagine Asparagine seed.compound:cpd15142 aspartate Aspartate seed.compound:cpd19181 carnitine Carnitine seed.compound:cpd00266 carnosine Carnosine seed.compound:cpd00310
In [7]:
# Try formula-based matching for compounds that didn't match by name
unmatched = wom_compounds[~wom_compounds['name_lower'].isin(exact_matches['name_lower'])].copy()
unmatched = unmatched[unmatched['formula'].notna() & (unmatched['formula'] != '')].copy()
# Formula match
ms_with_formula = ms_compounds[ms_compounds['ms_formula'].notna() & (ms_compounds['ms_formula'] != '')].copy()
formula_matches = unmatched.merge(ms_with_formula, left_on='formula', right_on='ms_formula', how='inner')
# Multiple MS compounds may match same formula - count unique WoM compounds
n_wom_formula_matched = formula_matches['compound_name'].nunique()
print(f"Additional formula-only matches: {n_wom_formula_matched} WoM compounds")
print(f" (mapping to {len(formula_matches)} ModelSEED molecules — multiple MS per formula)")
# Show sample of formula matches
if len(formula_matches) > 0:
sample = formula_matches.groupby('compound_name').first().reset_index()
print(f"\nSample formula matches (first MS match per WoM compound, first 20):")
print(f"{'WoM Name':35s} {'Formula':12s} {'MS Name (first match)':35s}")
print('-' * 85)
for _, row in sample.head(20).iterrows():
print(f"{row['compound_name'][:35]:35s} {str(row['formula']):12s} {str(row['ms_name'])[:35]:35s}")
# Summary
total_wom = len(wom_compounds)
total_matched = n_exact_wom + n_wom_formula_matched
print(f"\n{'='*60}")
print(f"ModelSEED matching summary:")
print(f" WoM identified compounds: {total_wom}")
print(f" Exact name matches: {n_exact_wom} ({n_exact_wom/total_wom*100:.1f}%)")
print(f" Formula-only matches: {n_wom_formula_matched} ({n_wom_formula_matched/total_wom*100:.1f}%)")
print(f" Total matched: {total_matched} ({total_matched/total_wom*100:.1f}%)")
print(f" Unmatched: {total_wom - total_matched} ({(total_wom - total_matched)/total_wom*100:.1f}%)")
Additional formula-only matches: 107 WoM compounds (mapping to 900 ModelSEED molecules — multiple MS per formula) Sample formula matches (first MS match per WoM compound, first 20): WoM Name Formula MS Name (first match) ------------------------------------------------------------------------------------- 2'-deoxyadenosine C10H13N5O3 Deoxyadenosine 2-methylmaleate C5H6O4 methylsuccinate 3',5'-cyclic AMP C10H12N5O6P 3'-dAMP 3-hydroxy-3-methylglutarate C6H10O5 L-streptose 4-aminobutanoate C4H9NO2 n-Propyl carbamate 4-hydroxy-2-quinolinecarboxylic aci C10H7NO3 (2E)-3-(4-hydroxyphenyl)-2-isocyano 4-imidazoleacetic acid C5H6N2O2 Thymine 5'-methylthioadenosine C11H15N5O3S 5'-Mtf 5-oxo-proline C5H7NO3 (S)-4-amino-4,5-dihydro-2-furancarb C10:0 fatty acid C10H20O2 octyl acetate C10H7NO3_190.0461p_0n_14.7_RB_Biocr C10H7NO3 (2E)-3-(4-hydroxyphenyl)-2-isocyano C11H17NO7_276.1152p_0n_16.5_RB_Bioc C11H17NO7 sutherlandin C11H22N2O3_231.1709p_0n_24_RB_Biocr C11H22N2O3 beta-homoalanine-beta-homoleucine C11H23NO2_202.1758p_0n_16.4_RB_Bioc C11H23NO2 11-Aminoundecanoic acid C12H25NO2_216.1965p_0n_12.4_RB_Bioc C12H25NO2 N-decanoyl ethanolamine C14:0 fatty acid C14H28O2 hexyl octanoate C15H22O3_0p_249.149n_6_RB_BiocrustE C15H22O3 2-trans,4-trans-xanthoxin C16:0 fatty acid C16H32O2 tetradecan-1-yl acetate C18:0 fatty acid C18H36O2 heptadecanoate, methyl ester C21H42O4_376.34p_0n_4.2_RB_Biocrust C21H42O4 1-glyceryl stearate ============================================================ ModelSEED matching summary: WoM identified compounds: 257 Exact name matches: 69 (26.8%) Formula-only matches: 107 (41.6%) Total matched: 176 (68.5%) Unmatched: 81 (31.5%)
3. WoM ↔ GapMind: Pathway Mapping¶
Check which WoM metabolites correspond to GapMind amino acid biosynthesis or carbon source utilization pathways.
In [8]:
# Get distinct GapMind pathway names
gapmind_pathways = spark.sql("""
SELECT DISTINCT pathway
FROM kbase_ke_pangenome.gapmind_pathways
LIMIT 200
""").toPandas()
print(f"GapMind pathways: {len(gapmind_pathways)}")
print()
# Map WoM amino acid metabolites to GapMind biosynthesis pathways
# GapMind pathway names are like: L-arginine biosynthesis, L-histidine biosynthesis, etc.
aa_map = {
'alanine': 'L-alanine',
'arginine': 'L-arginine',
'asparagine': 'L-asparagine',
'aspartate': 'L-aspartate',
'cysteine': 'L-cysteine',
'glutamate': 'L-glutamate',
'glutamine': 'L-glutamine',
'glycine': 'glycine',
'histidine': 'L-histidine',
'isoleucine': 'L-isoleucine',
'leucine': 'L-leucine',
'lysine': 'L-lysine',
'methionine': 'L-methionine',
'phenylalanine': 'L-phenylalanine',
'proline': 'L-proline',
'serine': 'L-serine',
'threonine': 'L-threonine',
'tryptophan': 'L-tryptophan',
'tyrosine': 'L-tyrosine',
'valine': 'L-valine',
}
# Find which GapMind pathways mention these amino acids
gm_names = set(gapmind_pathways['pathway'].str.lower())
print("WoM amino acids → GapMind pathway mapping:")
print(f"{'WoM metabolite':25s} {'GapMind pathway':50s} {'Match?':>8s}")
print('-' * 85)
aa_pathway_map = []
for wom_aa, gm_prefix in sorted(aa_map.items()):
# Find GapMind pathway containing this amino acid name
matches = [p for p in gapmind_pathways['pathway']
if gm_prefix.lower() in p.lower()]
if matches:
for m in matches:
print(f"{wom_aa:25s} {m:50s} {'YES':>8s}")
aa_pathway_map.append({'wom_metabolite': wom_aa, 'gapmind_pathway': m})
else:
print(f"{wom_aa:25s} {'(no match)':50s} {'NO':>8s}")
print(f"\nMapped: {len(set(m['wom_metabolite'] for m in aa_pathway_map))} / {len(aa_map)} amino acids")
GapMind pathways: 80 WoM amino acids → GapMind pathway mapping: WoM metabolite GapMind pathway Match? ------------------------------------------------------------------------------------- alanine (no match) NO arginine (no match) NO asparagine (no match) NO aspartate (no match) NO cysteine (no match) NO glutamate (no match) NO glutamine (no match) NO glycine (no match) NO histidine (no match) NO isoleucine (no match) NO leucine (no match) NO lysine (no match) NO methionine (no match) NO phenylalanine (no match) NO proline (no match) NO serine (no match) NO threonine (no match) NO tryptophan (no match) NO tyrosine (no match) NO valine (no match) NO Mapped: 0 / 20 amino acids
In [9]:
# Also check carbon source utilization pathways
carbon_metabolites = {
'glucose': 'glucose',
'lactate': 'lactate',
'glycine': 'glycine',
'trehalose': 'trehalose',
'sucrose': 'sucrose',
}
print("WoM carbon-related metabolites → GapMind utilization pathways:")
for wom_met, search_term in sorted(carbon_metabolites.items()):
matches = [p for p in gapmind_pathways['pathway']
if search_term.lower() in p.lower() and 'utilization' in p.lower()]
if matches:
for m in matches:
print(f" {wom_met:25s} → {m}")
else:
print(f" {wom_met:25s} → (no utilization pathway)")
WoM carbon-related metabolites → GapMind utilization pathways: glucose → (no utilization pathway) glycine → (no utilization pathway) lactate → (no utilization pathway) sucrose → (no utilization pathway) trehalose → (no utilization pathway)
4. WoM ↔ Pangenome: Species-Level Matching¶
Check if WoM organisms (especially the ENIGMA Pseudomonas isolates) appear in the pangenome species clades.
In [10]:
# Search for WoM organism genera in the pangenome
print("Pangenome species clades matching WoM organism genera:")
genera_to_check = [
('Pseudomonas', 'fluorescens'),
('Acidovorax', None),
('Phenylobacterium', None),
('Rhizobium', None),
('Bacillus', None),
('Escherichia', 'coli'),
('Synechococcus', None),
('Zymomonas', 'mobilis'),
]
for genus, species in genera_to_check:
if species:
query = f"""SELECT gtdb_species_clade_id, no_genomes, no_core, no_gene_clusters
FROM kbase_ke_pangenome.pangenome
WHERE gtdb_species_clade_id LIKE '%{genus}%{species}%'
ORDER BY CAST(no_genomes AS INT) DESC
LIMIT 5"""
else:
query = f"""SELECT gtdb_species_clade_id, no_genomes, no_core, no_gene_clusters
FROM kbase_ke_pangenome.pangenome
WHERE gtdb_species_clade_id LIKE '%{genus}%'
ORDER BY CAST(no_genomes AS INT) DESC
LIMIT 5"""
results = spark.sql(query).toPandas()
n_species = len(results)
if n_species > 0:
total_genomes = results['no_genomes'].astype(int).sum()
print(f" {genus} {species or '(any)'}: {n_species} species, {total_genomes:,} genomes (top 5)")
for _, row in results.head(3).iterrows():
print(f" {row['gtdb_species_clade_id'][:60]:60s} genomes={row['no_genomes']}")
else:
print(f" {genus} {species or '(any)'}: NOT FOUND in pangenome")
print()
Pangenome species clades matching WoM organism genera:
Pseudomonas fluorescens: 5 species, 139 genomes (top 5)
s__Pseudomonas_E_fluorescens_E--RS_GCF_001307155.1 genomes=40
s__Pseudomonas_E_fluorescens_AN--RS_GCF_001708445.1 genomes=34
s__Pseudomonas_E_fluorescens_BV--RS_GCF_001902145.1 genomes=33
Acidovorax (any): 5 species, 79 genomes (top 5)
s__Acidovorax_facilis--RS_GCF_023913775.1 genomes=19
s__Acidovorax_A_avenae--RS_GCF_000176855.2 genomes=18
s__Acidovorax_A_citrulli--RS_GCF_900100305.1 genomes=14
Phenylobacterium (any): 5 species, 80 genomes (top 5)
s__Phenylobacterium_sp020401865--GB_GCA_020401865.1 genomes=51
s__Phenylobacterium_sp020402745--GB_GCA_020402745.1 genomes=14
s__Phenylobacterium_sp018821435--GB_GCA_018821435.1 genomes=6
Rhizobium (any): 5 species, 449 genomes (top 5)
s__Rhizobium_laguerreae--RS_GCF_002008165.1 genomes=175
s__Rhizobium_leguminosarum--RS_GCF_002008365.1 genomes=97
s__Rhizobium_leguminosarum_L--RS_GCF_000009265.1 genomes=71
Bacillus (any): 5 species, 2,557 genomes (top 5)
s__Bacillus_velezensis--RS_GCF_001461825.1 genomes=647
s__Bacillus_A_bombysepticus--RS_GCF_006384875.1 genomes=622
s__Bacillus_subtilis--RS_GCF_000009045.1 genomes=559
Escherichia coli: 1 species, 2 genomes (top 5)
s__Escherichia_coli_E--RS_GCF_011881725.1 genomes=2
Synechococcus (any): 5 species, 88 genomes (top 5)
s__Synechococcus_D_lacustris_A--GB_GCA_903943975.1 genomes=29
s__Synechococcus_D_lacustris--RS_GCF_003011125.1 genomes=28
s__Synechococcus_B_sp009836025--GB_GCA_009836025.1 genomes=12
Zymomonas mobilis: 1 species, 26 genomes (top 5)
s__Zymomonas_mobilis--RS_GCF_000175255.2 genomes=26
5. Integration: What Can We Do With These Links?¶
For FB-matched organisms, can we connect:
- WoM metabolite production → GapMind pathway predictions → gene fitness?
In [11]:
# Integration test: pseudo3_N2E3 metabolite production -> FB carbon source fitness
# This is the direct link: WoM shows what the organism produces,
# FB shows which genes matter when using those metabolites as carbon sources.
print("Integration: WoM metabolite production -> FB gene fitness")
print("=" * 70)
if len(overlap_df) > 0:
print(f"\n{len(overlap_df)} metabolites that pseudo3_N2E3 PRODUCES (WoM)")
print(f"are also tested as CARBON/NITROGEN sources in FB.")
print(f"\nThis enables the question: which genes are fitness-important")
print(f"for utilizing a metabolite that this organism itself produces?")
print(f"\nExample: pseudo3_N2E3 produces lactate (WoM action=E, de novo).")
print(f"In FB, lactate is tested as a carbon source. Genes with strong")
print(f"negative fitness on lactate are required for lactate utilization.")
print(f"This connects metabolite output -> gene function.")
print(f"\nNote: The WoM-GapMind link is currently BLOCKED because GapMind")
print(f"pathway names use internal IDs (e.g., 'L-arginine biosynthesis'")
print(f"is not in the pathway table). A dedicated pathway-to-metabolite")
print(f"lookup table is needed to complete this integration layer.")
else:
print("No metabolite-condition overlaps found.")
Integration: WoM metabolite production -> FB gene fitness ====================================================================== 19 metabolites that pseudo3_N2E3 PRODUCES (WoM) are also tested as CARBON/NITROGEN sources in FB. This enables the question: which genes are fitness-important for utilizing a metabolite that this organism itself produces? Example: pseudo3_N2E3 produces lactate (WoM action=E, de novo). In FB, lactate is tested as a carbon source. Genes with strong negative fitness on lactate are required for lactate utilization. This connects metabolite output -> gene function. Note: The WoM-GapMind link is currently BLOCKED because GapMind pathway names use internal IDs (e.g., 'L-arginine biosynthesis' is not in the pathway table). A dedicated pathway-to-metabolite lookup table is needed to complete this integration layer.
6. Summary & Cross-Collection Link Assessment¶
In [12]:
print('=' * 70)
print('NB02 CROSS-COLLECTION LINK ASSESSMENT')
print('=' * 70)
print("\n1. WoM <-> Fitness Browser")
print(" Direct strain matches: 2 (pseudo3_N2E3, pseudo13_GW456_L13)")
print(" Same-strain genus match: 1 (E. coli BW25113 = Keio)")
print(" Different-strain genus: 1 (Synechococcus PCC7002 vs PCC7942)")
print(" -> 3 organisms with direct gene-metabolite linking potential")
n_overlaps = len(overlap_df) if 'overlap_df' in dir() and len(overlap_df) > 0 else 0
print(f"\n Curated metabolite-condition matches: {n_overlaps}")
print(f" (metabolites N2E3 produces that FB tests as carbon/nitrogen sources)")
n_exact = n_exact_wom
n_formula = n_wom_formula_matched
print(f"\n2. WoM <-> ModelSEED Biochemistry")
print(f" Exact name matches: {n_exact} compounds (high confidence)")
print(f" Formula-only matches: {n_formula} compounds (ambiguous: 1:{len(formula_matches)//n_formula:.0f} avg expansion)")
print(f" Total with any link: {n_exact + n_formula} / {len(wom_compounds)} identified ({(n_exact+n_formula)/len(wom_compounds)*100:.0f}%)")
print(f" -> Name matches are definitive; formula matches provide candidate sets")
n_aa_mapped = len(set(m['wom_metabolite'] for m in aa_pathway_map)) if aa_pathway_map else 0
print(f"\n3. WoM <-> GapMind Pathways")
print(f" Amino acid matches: {n_aa_mapped} / 20 (BLOCKED: pathway names use internal IDs)")
print(f" -> Needs a pathway-to-metabolite lookup table to enable this link")
print(f"\n4. WoM <-> Pangenome")
print(f" All WoM genera found in pangenome species clades")
print(f" -> Pangenome context available at genus level")
print("\n" + '=' * 70)
print("BOTTOM LINE")
print('=' * 70)
print("The 2018 WoM snapshot is small (37 organisms, no consumption data)")
print("but the cross-collection links are REAL:")
print(f" - 3 organisms with direct WoM<->FB linking")
print(f" - {n_overlaps} curated metabolite-condition matches for pseudo3_N2E3")
print(f" - {n_exact} metabolites with definitive ModelSEED links ({n_exact/len(wom_compounds)*100:.0f}%)")
print(f" - {n_exact+n_formula} with any ModelSEED link ({(n_exact+n_formula)/len(wom_compounds)*100:.0f}%, but formula matches are ambiguous)")
print(f" - All WoM genera in pangenome; GapMind link blocked by naming convention")
print("\nThe main limitations are ABSENT consumption data and SMALL organism set.")
====================================================================== NB02 CROSS-COLLECTION LINK ASSESSMENT ====================================================================== 1. WoM <-> Fitness Browser Direct strain matches: 2 (pseudo3_N2E3, pseudo13_GW456_L13) Same-strain genus match: 1 (E. coli BW25113 = Keio) Different-strain genus: 1 (Synechococcus PCC7002 vs PCC7942) -> 3 organisms with direct gene-metabolite linking potential Curated metabolite-condition matches: 19 (metabolites N2E3 produces that FB tests as carbon/nitrogen sources) 2. WoM <-> ModelSEED Biochemistry Exact name matches: 69 compounds (high confidence) Formula-only matches: 107 compounds (ambiguous: 1:8 avg expansion) Total with any link: 176 / 257 identified (68%) -> Name matches are definitive; formula matches provide candidate sets 3. WoM <-> GapMind Pathways Amino acid matches: 0 / 20 (BLOCKED: pathway names use internal IDs) -> Needs a pathway-to-metabolite lookup table to enable this link 4. WoM <-> Pangenome All WoM genera found in pangenome species clades -> Pangenome context available at genus level ====================================================================== BOTTOM LINE ====================================================================== The 2018 WoM snapshot is small (37 organisms, no consumption data) but the cross-collection links are REAL: - 3 organisms with direct WoM<->FB linking - 19 curated metabolite-condition matches for pseudo3_N2E3 - 69 metabolites with definitive ModelSEED links (27%) - 176 with any ModelSEED link (68%, but formula matches are ambiguous) - All WoM genera in pangenome; GapMind link blocked by naming convention The main limitations are ABSENT consumption data and SMALL organism set.
In [13]:
# Save cross-collection link data
if len(exact_matches) > 0:
exact_matches.to_csv(f'{DATA_DIR}/modelseed_name_matches.csv', index=False)
if len(formula_matches) > 0:
formula_matches.to_csv(f'{DATA_DIR}/modelseed_formula_matches.csv', index=False)
if aa_pathway_map:
pd.DataFrame(aa_pathway_map).to_csv(f'{DATA_DIR}/gapmind_aa_map.csv', index=False)
print("Saved cross-collection link data to data/")
Saved cross-collection link data to data/
In [14]:
spark.stop()