01 Data Extraction
Jupyter notebook from the Metabolic Consistency of Pseudomonas FW300-N2E3 project.
NB01: Data Extraction and Metabolite Name Harmonization¶
Extract metabolic data for Pseudomonas fluorescens FW300-N2E3 from four BERDL databases and build a unified metabolite crosswalk table.
Databases:
- Web of Microbes (
kescience_webofmicrobes) — exometabolomic profile - Fitness Browser (
kescience_fitnessbrowser) — carbon/nitrogen source experiments - BacDive (
kescience_bacdive) — species-level metabolite utilization - Pangenome (
kbase_ke_pangenome) — GapMind pathway predictions
Outputs:
data/wom_profile.tsv— full WoM exometabolomic profiledata/fb_experiments.tsv— FB carbon/nitrogen source experimentsdata/bacdive_utilization.tsv— BacDive P. fluorescens utilizationdata/gapmind_pathways.tsv— GapMind pathway predictions for FW300-N2E3data/metabolite_crosswalk.tsv— unified metabolite mapping table
In [1]:
import os
import pandas as pd
import numpy as np
# Spark session
spark = get_spark_session()
DATA_DIR = '../data'
os.makedirs(DATA_DIR, exist_ok=True)
# Strain identifiers
WOM_NAME = 'Pseudomonas sp. (FW300-N2E3)'
FB_ORG = 'pseudo3_N2E3'
BACDIVE_SPECIES = 'Pseudomonas fluorescens'
PANGENOME_GENOME = 'RS_GCF_001307155.1'
PANGENOME_CLADE = 's__Pseudomonas_E_fluorescens_E--RS_GCF_001307155.1'
1. Web of Microbes — Exometabolomic Profile¶
In [2]:
wom_df = spark.sql(f"""
SELECT c.compound_name, obs.action, c.formula, c.pubchem_id,
c.inchi_key, c.smiles_string, c.metabolite_atlas_id,
e.env_name, p.project_name
FROM kescience_webofmicrobes.observation obs
JOIN kescience_webofmicrobes.organism o ON o.id = obs.organism_id
JOIN kescience_webofmicrobes.compound c ON c.id = obs.compound_id
JOIN kescience_webofmicrobes.environment e ON e.id = obs.environment_id
LEFT JOIN kescience_webofmicrobes.project p ON p.id = obs.project_id
WHERE o.common_name = '{WOM_NAME}'
ORDER BY obs.action, c.compound_name
""").toPandas()
print(f"WoM observations: {len(wom_df)}")
print(f"\nAction counts:")
print(wom_df['action'].value_counts().to_string())
print(f"\nEnvironment: {wom_df['env_name'].unique()}")
print(f"\nProduced metabolites (E+I): {len(wom_df[wom_df['action'].isin(['E','I'])])}")
wom_df.to_csv(f'{DATA_DIR}/wom_profile.tsv', sep='\t', index=False)
wom_df[wom_df['action'].isin(['E','I'])].head(20)
WoM observations: 105 Action counts: action N 47 I 31 E 27 Environment: <ArrowStringArray> ['R2A'] Length: 1, dtype: str Produced metabolites (E+I): 58
Out[2]:
| compound_name | action | formula | pubchem_id | inchi_key | smiles_string | metabolite_atlas_id | env_name | project_name | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2-hydroxy-4-(methylthio)butyric acid | E | C5H10O3S | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 1 | 3-hydroxybenzoate | E | C7H6O3 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 2 | 4-acetamidobutanoate | E | C6H11NO3 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 3 | 4-hydroxy-2-quinolinecarboxylic acid | E | C10H7NO3 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 4 | 4-imidazoleacetic acid | E | C5H6N2O2 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 5 | 5-aminopentanoate | E | C5H11NO2 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 6 | 5-hydroxylysine | E | C6H14N2O3 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 7 | Cytosine | E | NA | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 8 | N-acetyl-alanine | E | C5H9NO3 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 9 | N-acetyl-glutamic acid | E | C7H11NO5 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 10 | N-acetyl-serine | E | C5H9NO4 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 11 | N-alpha-acetyl-asparagine | E | C6H10N2O4 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 12 | Pipecolic acid | E | C6H11NO2 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 13 | betaine | E | C5H11NO2 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 14 | carnitine | E | C7H15NO3 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 15 | lactate | E | C3H6O3 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 16 | lysine | E | C6H14N2O2 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 17 | nicotinate | E | C6H5NO2 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 18 | salicylate | E | C7H6O3 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
| 19 | sarcosine | E | C3H7NO2 | NaN | NaN | NaN | NaN | R2A | ENIGMA_SK_BEMC2016 |
2. Fitness Browser — Carbon/Nitrogen Source Experiments¶
In [3]:
# Get all carbon/nitrogen source experiments for FW300-N2E3
fb_exps = spark.sql(f"""
SELECT expName, expGroup, condition_1
FROM kescience_fitnessbrowser.experiment
WHERE orgId = '{FB_ORG}'
AND expGroup IN ('carbon source', 'nitrogen source')
ORDER BY expGroup, condition_1
""").toPandas()
print(f"FB experiments: {len(fb_exps)}")
print(f" Carbon source: {len(fb_exps[fb_exps['expGroup']=='carbon source'])}")
print(f" Nitrogen source: {len(fb_exps[fb_exps['expGroup']=='nitrogen source'])}")
print(f"\nUnique conditions: {fb_exps['condition_1'].nunique()}")
# Deduplicate conditions (some have replicates)
fb_conditions = fb_exps.groupby(['condition_1', 'expGroup']).size().reset_index(name='n_replicates')
print(f"\nUnique condition × type pairs: {len(fb_conditions)}")
fb_exps.to_csv(f'{DATA_DIR}/fb_experiments.tsv', sep='\t', index=False)
fb_conditions.sort_values('condition_1')
FB experiments: 120 Carbon source: 82 Nitrogen source: 38 Unique conditions: 62 Unique condition × type pairs: 76
Out[3]:
| condition_1 | expGroup | n_replicates | |
|---|---|---|---|
| 0 | Adenine hydrochloride hydrate | nitrogen source | 1 |
| 1 | Adenosine | nitrogen source | 1 |
| 2 | Ammonium chloride | nitrogen source | 1 |
| 3 | Carnitine Hydrochloride | carbon source | 1 |
| 4 | Carnitine Hydrochloride | nitrogen source | 1 |
| ... | ... | ... | ... |
| 71 | Uridine | carbon source | 1 |
| 72 | Uridine | nitrogen source | 1 |
| 73 | a-Ketoglutaric acid disodium salt hydrate | carbon source | 2 |
| 74 | casamino acids | carbon source | 1 |
| 75 | m-Inositol | carbon source | 2 |
76 rows × 3 columns
In [4]:
# Get gene annotations for FW300-N2E3
fb_genes = spark.sql(f"""
SELECT g.locusId, g.sysName, g.gene, g.desc, g.type
FROM kescience_fitnessbrowser.gene g
WHERE g.orgId = '{FB_ORG}' AND g.type = '1'
""").toPandas()
print(f"Protein-coding genes: {len(fb_genes)}")
fb_genes.head()
Protein-coding genes: 5766
Out[4]:
| locusId | sysName | gene | desc | type | |
|---|---|---|---|---|---|
| 0 | AO353_00005 | AO353_00005 | NaN | isoprenylcysteine carboxyl methyltransferase | 1 |
| 1 | AO353_00010 | AO353_00010 | NaN | hypothetical protein | 1 |
| 2 | AO353_00015 | AO353_00015 | NaN | FAD-dependent oxidoreductase | 1 |
| 3 | AO353_00020 | AO353_00020 | NaN | delta-aminolevulinic acid dehydratase | 1 |
| 4 | AO353_00025 | AO353_00025 | NaN | phenazine biosynthesis protein PhzF | 1 |
3. BacDive — P. fluorescens Metabolite Utilization¶
In [5]:
bacdive_df = spark.sql(f"""
SELECT mu.compound_name, mu.chebi_id, mu.utilization,
s.bacdive_id, s.strain_designation, s.type_strain
FROM kescience_bacdive.metabolite_utilization mu
JOIN kescience_bacdive.strain s ON mu.bacdive_id = s.bacdive_id
JOIN kescience_bacdive.taxonomy t ON t.bacdive_id = s.bacdive_id
WHERE t.species = '{BACDIVE_SPECIES}'
ORDER BY mu.compound_name
""").toPandas()
print(f"BacDive records: {len(bacdive_df)}")
print(f"Unique compounds: {bacdive_df['compound_name'].nunique()}")
print(f"Unique strains: {bacdive_df['bacdive_id'].nunique()}")
# Show all distinct utilization values — BacDive uses +, -, produced, +/-
print(f"\nUtilization value counts:")
print(bacdive_df['utilization'].value_counts().to_string())
# Per-strain consensus: deduplicate multiple records per strain per compound
# Rule: for each strain-compound pair, take majority vote among +/- tests
# If tied, count as ambiguous (+/-)
def strain_consensus(group):
"""Compute one consensus utilization per strain per compound."""
n_pos = (group['utilization'] == '+').sum()
n_neg = (group['utilization'] == '-').sum()
n_prod = (group['utilization'] == 'produced').sum()
n_ambig = (group['utilization'] == '+/-').sum()
# If any +/- tests exist, take majority
if n_pos + n_neg > 0:
if n_pos > n_neg:
return '+'
elif n_neg > n_pos:
return '-'
else:
return '+/-' # tied
elif n_prod > 0:
return 'produced'
elif n_ambig > 0:
return '+/-'
return group['utilization'].iloc[0] # fallback
strain_level = bacdive_df.groupby(['compound_name', 'bacdive_id']).apply(
strain_consensus, include_groups=False
).reset_index(name='strain_consensus')
n_raw = len(bacdive_df)
n_deduped = len(strain_level)
print(f"\nPer-strain deduplication: {n_raw} raw records → {n_deduped} strain-compound pairs")
# Example: D-glucose had 104 records from 51 strains
glucose = bacdive_df[bacdive_df['compound_name'] == 'D-glucose']
glucose_dedup = strain_level[strain_level['compound_name'] == 'D-glucose']
print(f" D-glucose: {len(glucose)} records → {len(glucose_dedup)} strain consensus values")
# Summarize per compound using strain-level consensus
bacdive_summary = strain_level.groupby('compound_name').agg(
n_strains=('bacdive_id', 'nunique'),
n_positive=('strain_consensus', lambda x: (x == '+').sum()),
n_negative=('strain_consensus', lambda x: (x == '-').sum()),
n_produced=('strain_consensus', lambda x: (x == 'produced').sum()),
n_ambiguous=('strain_consensus', lambda x: (x == '+/-').sum()),
n_total=('strain_consensus', 'count')
).reset_index()
# pct_positive is calculated only from explicit +/- tests (not 'produced' or '+/-')
n_tested = bacdive_summary['n_positive'] + bacdive_summary['n_negative']
bacdive_summary['n_utilization_tested'] = n_tested
bacdive_summary['pct_positive'] = (bacdive_summary['n_positive'] / n_tested).round(3)
# NaN where n_tested == 0 (e.g., compound only has 'produced' entries)
bacdive_summary.to_csv(f'{DATA_DIR}/bacdive_utilization.tsv', sep='\t', index=False)
print(f"\nTop compounds by test frequency (per-strain consensus):")
bacdive_summary.sort_values('n_total', ascending=False).head(20)
BacDive records: 1262 Unique compounds: 83 Unique strains: 105 Utilization value counts: utilization - 633 + 513 produced 115 +/- 1
Per-strain deduplication: 1262 raw records → 1095 strain-compound pairs D-glucose: 104 records → 51 strain consensus values Top compounds by test frequency (per-strain consensus):
Out[5]:
| compound_name | n_strains | n_positive | n_negative | n_produced | n_ambiguous | n_total | n_utilization_tested | pct_positive | |
|---|---|---|---|---|---|---|---|---|---|
| 49 | indole | 53 | 0 | 1 | 52 | 0 | 53 | 1 | 0.000 |
| 36 | esculin | 52 | 2 | 50 | 0 | 0 | 52 | 52 | 0.038 |
| 64 | nitrate | 52 | 32 | 20 | 0 | 0 | 52 | 52 | 0.615 |
| 22 | N-acetylglucosamine | 51 | 33 | 16 | 0 | 2 | 51 | 49 | 0.673 |
| 16 | L-arabinose | 51 | 35 | 14 | 0 | 2 | 51 | 49 | 0.714 |
| 8 | D-glucose | 51 | 3 | 6 | 0 | 42 | 51 | 9 | 0.333 |
| 55 | maltose | 51 | 2 | 47 | 0 | 2 | 51 | 49 | 0.041 |
| 10 | D-mannitol | 51 | 43 | 6 | 0 | 2 | 51 | 49 | 0.878 |
| 11 | D-mannose | 50 | 45 | 3 | 0 | 2 | 50 | 48 | 0.938 |
| 77 | tryptophan | 50 | 0 | 50 | 0 | 0 | 50 | 50 | 0.000 |
| 43 | gluconate | 50 | 47 | 1 | 0 | 2 | 50 | 48 | 0.979 |
| 30 | arginine | 50 | 40 | 8 | 0 | 2 | 50 | 48 | 0.833 |
| 79 | urea | 50 | 2 | 47 | 0 | 1 | 50 | 49 | 0.041 |
| 34 | decanoate | 49 | 49 | 0 | 0 | 0 | 49 | 49 | 1.000 |
| 26 | adipate | 49 | 5 | 44 | 0 | 0 | 49 | 49 | 0.102 |
| 53 | malate | 49 | 49 | 0 | 0 | 0 | 49 | 49 | 1.000 |
| 41 | gelatin | 47 | 18 | 28 | 0 | 1 | 47 | 46 | 0.391 |
| 65 | nitrite | 8 | 2 | 6 | 0 | 0 | 8 | 8 | 0.250 |
| 33 | citrate | 7 | 6 | 1 | 0 | 0 | 7 | 7 | 0.857 |
| 71 | ribitol | 7 | 1 | 6 | 0 | 0 | 7 | 7 | 0.143 |
4. GapMind — Pathway Predictions for FW300-N2E3¶
In [6]:
# GapMind: take best score per genome-pathway pair (multiple rows per pair)
gapmind_df = spark.sql(f"""
WITH scored AS (
SELECT pathway, genome_id, score_category,
CASE score_category
WHEN 'complete' THEN 5
WHEN 'likely_complete' THEN 4
WHEN 'steps_missing_low' THEN 3
WHEN 'steps_missing_medium' THEN 2
WHEN 'not_present' THEN 1
ELSE 0
END as score_value
FROM kbase_ke_pangenome.gapmind_pathways
WHERE clade_name = '{PANGENOME_CLADE}'
)
SELECT pathway, genome_id, MAX(score_value) as best_score
FROM scored
GROUP BY pathway, genome_id
ORDER BY pathway, genome_id
""").toPandas()
# Map numeric scores back to category names
score_map = {5: 'complete', 4: 'likely_complete', 3: 'steps_missing_low',
2: 'steps_missing_medium', 1: 'not_present', 0: 'unknown'}
gapmind_df['best_category'] = gapmind_df['best_score'].map(score_map)
print(f"GapMind pathway-genome pairs: {len(gapmind_df)}")
print(f"Unique pathways: {gapmind_df['pathway'].nunique()}")
print(f"Genomes in clade: {gapmind_df['genome_id'].nunique()}")
# Debug: show sample genome_id values to understand format
sample_ids = sorted(gapmind_df['genome_id'].unique())[:5]
print(f"\nSample genome IDs: {sample_ids}")
print(f"Looking for: {PANGENOME_GENOME}")
# Try matching with and without RS_ prefix
gapmind_n2e3 = gapmind_df[gapmind_df['genome_id'] == PANGENOME_GENOME].copy()
if len(gapmind_n2e3) == 0:
# Try without RS_ prefix
alt_id = PANGENOME_GENOME.replace('RS_', '')
gapmind_n2e3 = gapmind_df[gapmind_df['genome_id'] == alt_id].copy()
if len(gapmind_n2e3) > 0:
print(f"Found with alternate ID: {alt_id}")
if len(gapmind_n2e3) == 0:
# Try partial match
matches = gapmind_df[gapmind_df['genome_id'].str.contains('001307155', na=False)]
if len(matches) > 0:
matched_id = matches['genome_id'].iloc[0]
gapmind_n2e3 = gapmind_df[gapmind_df['genome_id'] == matched_id].copy()
print(f"Found with partial match: {matched_id}")
print(f"\nPathways for FW300-N2E3: {len(gapmind_n2e3)}")
if len(gapmind_n2e3) == 0:
print("WARNING: FW300-N2E3 genome not in GapMind results, using clade-level data")
# Fall back to clade-level summary (median score across genomes)
gapmind_n2e3 = gapmind_df.groupby('pathway').agg(
best_score=('best_score', lambda x: int(x.median())),
n_genomes=('genome_id', 'nunique')
).reset_index()
gapmind_n2e3['best_category'] = gapmind_n2e3['best_score'].map(score_map)
gapmind_df.to_csv(f'{DATA_DIR}/gapmind_pathways.tsv', sep='\t', index=False)
print("\nPathway completeness for FW300-N2E3:")
print(gapmind_n2e3['best_category'].value_counts().to_string())
gapmind_n2e3.head(20)
GapMind pathway-genome pairs: 3200 Unique pathways: 80 Genomes in clade: 40 Sample genome IDs: ['GCA_002883675.1', 'GCA_002883775.1', 'GCF_001307155.1', 'GCF_002883595.1', 'GCF_002883635.1'] Looking for: RS_GCF_001307155.1 Found with alternate ID: GCF_001307155.1 Pathways for FW300-N2E3: 80 Pathway completeness for FW300-N2E3: best_category complete 73 steps_missing_low 5 steps_missing_medium 1 likely_complete 1
Out[6]:
| pathway | genome_id | best_score | best_category | |
|---|---|---|---|---|
| 2 | 2-oxoglutarate | GCF_001307155.1 | 5 | complete |
| 42 | 4-hydroxybenzoate | GCF_001307155.1 | 5 | complete |
| 82 | D-alanine | GCF_001307155.1 | 5 | complete |
| 122 | D-lactate | GCF_001307155.1 | 5 | complete |
| 162 | D-serine | GCF_001307155.1 | 5 | complete |
| 202 | L-lactate | GCF_001307155.1 | 5 | complete |
| 242 | L-malate | GCF_001307155.1 | 5 | complete |
| 282 | NAG | GCF_001307155.1 | 5 | complete |
| 322 | acetate | GCF_001307155.1 | 5 | complete |
| 362 | alanine | GCF_001307155.1 | 5 | complete |
| 402 | arabinose | GCF_001307155.1 | 3 | steps_missing_low |
| 442 | arg | GCF_001307155.1 | 5 | complete |
| 482 | arginine | GCF_001307155.1 | 5 | complete |
| 522 | asn | GCF_001307155.1 | 5 | complete |
| 562 | asparagine | GCF_001307155.1 | 5 | complete |
| 602 | aspartate | GCF_001307155.1 | 5 | complete |
| 642 | cellobiose | GCF_001307155.1 | 5 | complete |
| 682 | chorismate | GCF_001307155.1 | 5 | complete |
| 722 | citrate | GCF_001307155.1 | 5 | complete |
| 762 | citrulline | GCF_001307155.1 | 5 | complete |
5. Metabolite Name Harmonization¶
The hardest part: mapping metabolite names across four databases with different nomenclature.
Strategy:
- Exact case-insensitive name match
- Manual curation for known aliases (e.g., "Sodium D,L-Lactate" → "lactate")
- Formula match as fallback for unmatched compounds
In [7]:
import re
def normalize_compound_name(name):
"""Normalize compound names for cross-database matching."""
if pd.isna(name):
return ''
s = name.lower().strip()
# Remove salt forms, stereochemistry prefixes, and common suffixes
s = re.sub(r'\b(sodium|potassium|calcium|di?sodium|dibasic|monohydrate|hexahydrate|hydrochloride|dihydrate|monobasic|salt)\b', '', s)
s = re.sub(r'^(l-|d-|dl-|d,l-|d/l-)', '', s)
s = re.sub(r'\b(acid|monopotassium)\b', '', s)
# Clean up whitespace and punctuation
s = re.sub(r'[,;()]+', ' ', s)
s = re.sub(r'\s+', ' ', s).strip()
return s
# Manual crosswalk for known correspondences between WoM and FB condition names
MANUAL_WOM_TO_FB = {
'lactate': ['Sodium D,L-Lactate', 'Sodium D-Lactate', 'Sodium L-Lactate'],
'carnitine': ['Carnitine Hydrochloride'],
'valine': ['L-Valine'],
'alanine': ['L-Alanine', 'D-Alanine'],
'arginine': ['L-Arginine'],
'proline': ['L-Proline'],
'phenylalanine': ['L-Phenylalanine'],
'tryptophan': ['L-Tryptophan'],
'trehalose': ['D-Trehalose dihydrate'],
'Malate': ['L-Malic acid disodium salt monohydrate'],
'glutamic acid': ['L-Glutamic acid monopotassium salt monohydrate'],
'Adenine': ['Adenine hydrochloride hydrate'],
'Adenosine': ['Adenosine'],
'inosine': ['Inosine'],
'aspartate': ['L-Aspartic Acid'],
'glycine': ['Glycine'],
'Guanine': ['Guanine'],
'Cytosine': ['Cytidine'], # Note: cytosine vs cytidine — close but not identical
'Uracil': ['Uridine'], # Note: uracil vs uridine — base vs nucleoside
'thymine': ['Thymine'],
'nicotinamide': ['Nicotinamide'],
'4-aminobutanoate': ['4-aminobutanoate'], # GABA
'sarcosine': ['Sarcosine'],
'trans-aconitate': ['trans-Aconitate'],
'5-oxo-proline': ['5-oxo-proline'],
'betaine': ['Betaine'],
}
# Manual crosswalk for WoM to BacDive
MANUAL_WOM_TO_BACDIVE = {
'lysine': 'lysine',
'valine': 'valine',
'Malate': 'malate',
'arginine': 'arginine',
'trehalose': 'trehalose',
'tryptophan': 'tryptophan',
'glycine': 'glycine',
'alanine': 'D-alanine', # BacDive may test D-alanine separately
}
# Manual crosswalk for WoM/FB to GapMind pathway names
MANUAL_TO_GAPMIND = {
'lactate': ['D-lactate', 'L-lactate'],
'alanine': ['alanine', 'D-alanine'],
'arginine': ['arginine', 'arg'],
'proline': ['proline'],
'phenylalanine': ['phenylalanine', 'phe'],
'tryptophan': ['tryptophan', 'trp'],
'trehalose': ['trehalose'],
'malate': ['L-malate'],
'glutamic acid': ['glutamate'],
'valine': ['valine', 'val'],
'carnitine': ['carnitine'],
'aspartate': ['aspartate'],
'glycine': ['glycine', 'gly'],
'serine': ['serine', 'ser', 'D-serine'],
'histidine': ['histidine', 'his'],
'isoleucine': ['isoleucine', 'ile'],
'leucine': ['leucine', 'leu'],
'lysine': ['lysine', 'lys'],
'asparagine': ['asparagine', 'asn'],
'glutamine': ['glutamine', 'gln'],
'citrulline': ['citrulline'],
'ornithine': ['ornithine'],
'acetate': ['acetate'],
'citrate': ['citrate'],
'fumarate': ['fumarate'],
'succinate': ['succinate'],
'pyruvate': ['pyruvate'],
'glucose': ['glucose'],
'fructose': ['fructose'],
'ribose': ['ribose'],
'mannose': ['mannose'],
'gluconate': ['gluconate'],
'glycerol': ['glycerol'],
'ethanol': ['ethanol'],
'inositol': ['myo-inositol'],
}
print(f"Manual WoM→FB mappings: {len(MANUAL_WOM_TO_FB)}")
print(f"Manual WoM→BacDive mappings: {len(MANUAL_WOM_TO_BACDIVE)}")
print(f"Manual metabolite→GapMind mappings: {len(MANUAL_TO_GAPMIND)}")
Manual WoM→FB mappings: 26 Manual WoM→BacDive mappings: 8 Manual metabolite→GapMind mappings: 35
In [8]:
# Build the unified crosswalk table
# Start with WoM metabolites that are produced (E or I)
wom_produced = wom_df[wom_df['action'].isin(['E', 'I'])].copy()
# Approximate FB matches: base→nucleoside mappings that are related but not identical
APPROXIMATE_FB_MATCHES = {'Cytosine', 'Uracil'} # Cytosine→Cytidine, Uracil→Uridine
crosswalk_rows = []
for _, row in wom_produced.iterrows():
wom_name = row['compound_name']
wom_norm = normalize_compound_name(wom_name)
# FB match
fb_matches = MANUAL_WOM_TO_FB.get(wom_name, [])
if not fb_matches:
# Try normalized matching against FB conditions
for cond in fb_conditions['condition_1'].unique():
if normalize_compound_name(cond) == wom_norm:
fb_matches.append(cond)
# Flag match quality: approximate for base→nucleoside mappings
fb_match_quality = None
if fb_matches:
fb_match_quality = 'approximate' if wom_name in APPROXIMATE_FB_MATCHES else 'exact'
# BacDive match
bd_match = MANUAL_WOM_TO_BACDIVE.get(wom_name)
if bd_match is None:
# Try exact case-insensitive
bd_candidates = bacdive_summary[
bacdive_summary['compound_name'].str.lower() == wom_name.lower()
]
if len(bd_candidates) > 0:
bd_match = bd_candidates.iloc[0]['compound_name']
bd_util = None
bd_pct = None
bd_n_tested = None
bd_n_positive = None
bd_n_negative = None
bd_n_produced = None
if bd_match:
bd_row = bacdive_summary[bacdive_summary['compound_name'] == bd_match]
if len(bd_row) > 0:
r = bd_row.iloc[0]
bd_n_tested = int(r['n_utilization_tested'])
bd_n_positive = int(r['n_positive'])
bd_n_negative = int(r['n_negative'])
bd_n_produced = int(r['n_produced'])
bd_pct = r['pct_positive']
if bd_n_tested > 0:
bd_util = '+' if bd_pct > 0.5 else '-'
elif bd_n_produced > 0:
bd_util = 'produced' # only 'produced' entries, no +/- tests
# else: no usable data
# GapMind match
gm_matches = MANUAL_TO_GAPMIND.get(wom_name.lower(), [])
gm_best = None
if gm_matches and len(gapmind_n2e3) > 0:
for gm_path in gm_matches:
gm_row = gapmind_n2e3[gapmind_n2e3['pathway'] == gm_path]
if len(gm_row) > 0:
score = gm_row.iloc[0].get('best_score', None)
if score is not None:
gm_best = score_map.get(int(score), str(score))
break
crosswalk_rows.append({
'wom_compound': wom_name,
'wom_action': row['action'],
'wom_formula': row['formula'],
'fb_condition': '; '.join(fb_matches) if fb_matches else None,
'fb_matched': len(fb_matches) > 0,
'fb_match_quality': fb_match_quality,
'bacdive_compound': bd_match,
'bacdive_consensus': bd_util,
'bacdive_pct_positive': bd_pct,
'bacdive_n_tested': bd_n_tested,
'bacdive_n_positive': bd_n_positive,
'bacdive_n_negative': bd_n_negative,
'bacdive_n_produced': bd_n_produced,
'gapmind_pathway': '; '.join(gm_matches) if gm_matches else None,
'gapmind_prediction': gm_best,
})
crosswalk = pd.DataFrame(crosswalk_rows)
print(f"Crosswalk table: {len(crosswalk)} metabolites")
print(f" Matched to FB: {crosswalk['fb_matched'].sum()}")
print(f" Exact matches: {(crosswalk['fb_match_quality'] == 'exact').sum()}")
print(f" Approximate matches: {(crosswalk['fb_match_quality'] == 'approximate').sum()}")
print(f" (Cytosine→Cytidine, Uracil→Uridine: base→nucleoside, related but not identical)")
print(f" Matched to BacDive: {crosswalk['bacdive_compound'].notna().sum()}")
print(f" Matched to GapMind: {crosswalk['gapmind_prediction'].notna().sum()}")
# Show BacDive matches with full detail
bd_matched = crosswalk[crosswalk['bacdive_compound'].notna()][
['wom_compound', 'wom_action', 'bacdive_compound', 'bacdive_consensus',
'bacdive_n_tested', 'bacdive_n_positive', 'bacdive_n_negative', 'bacdive_n_produced']
]
print(f"\nBacDive matches (detailed):")
print(bd_matched.to_string(index=False))
crosswalk.to_csv(f'{DATA_DIR}/metabolite_crosswalk.tsv', sep='\t', index=False)
crosswalk
Crosswalk table: 58 metabolites
Matched to FB: 28
Exact matches: 26
Approximate matches: 2
(Cytosine→Cytidine, Uracil→Uridine: base→nucleoside, related but not identical)
Matched to BacDive: 8
Matched to GapMind: 13
BacDive matches (detailed):
wom_compound wom_action bacdive_compound bacdive_consensus bacdive_n_tested bacdive_n_positive bacdive_n_negative bacdive_n_produced
lysine E lysine - 3.0 0.0 3.0 0.0
valine E valine + 1.0 1.0 0.0 0.0
Malate I malate + 49.0 49.0 0.0 0.0
alanine I D-alanine NaN NaN NaN NaN NaN
arginine I arginine + 48.0 40.0 8.0 0.0
glycine I glycine - 1.0 0.0 1.0 0.0
trehalose I trehalose - 6.0 1.0 5.0 0.0
tryptophan I tryptophan - 50.0 0.0 50.0 0.0
Out[8]:
| wom_compound | wom_action | wom_formula | fb_condition | fb_matched | fb_match_quality | bacdive_compound | bacdive_consensus | bacdive_pct_positive | bacdive_n_tested | bacdive_n_positive | bacdive_n_negative | bacdive_n_produced | gapmind_pathway | gapmind_prediction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2-hydroxy-4-(methylthio)butyric acid | E | C5H10O3S | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 1 | 3-hydroxybenzoate | E | C7H6O3 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | 4-acetamidobutanoate | E | C6H11NO3 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 3 | 4-hydroxy-2-quinolinecarboxylic acid | E | C10H7NO3 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 4-imidazoleacetic acid | E | C5H6N2O2 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 5 | 5-aminopentanoate | E | C5H11NO2 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 6 | 5-hydroxylysine | E | C6H14N2O3 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 7 | Cytosine | E | NA | Cytidine | True | approximate | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 8 | N-acetyl-alanine | E | C5H9NO3 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 9 | N-acetyl-glutamic acid | E | C7H11NO5 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 10 | N-acetyl-serine | E | C5H9NO4 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 11 | N-alpha-acetyl-asparagine | E | C6H10N2O4 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 12 | Pipecolic acid | E | C6H11NO2 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 13 | betaine | E | C5H11NO2 | Betaine | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 14 | carnitine | E | C7H15NO3 | Carnitine Hydrochloride | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | carnitine | NaN |
| 15 | lactate | E | C3H6O3 | Sodium D,L-Lactate; Sodium D-Lactate; Sodium L... | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | D-lactate; L-lactate | complete |
| 16 | lysine | E | C6H14N2O2 | L-Lysine | True | exact | lysine | - | 0.000 | 3.0 | 0.0 | 3.0 | 0.0 | lysine; lys | complete |
| 17 | nicotinate | E | C6H5NO2 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 18 | salicylate | E | C7H6O3 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 19 | sarcosine | E | C3H7NO2 | Sarcosine | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 20 | taurine | E | C2H7NO3S | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 21 | thymine | E | C5H6N2O2 | Thymine | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 22 | trans-4-hydroxyproline | E | C5H9NO3 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 23 | trans-aconitate | E | C6H6O6 | trans-Aconitate | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 24 | tyrosine | E | C9H11NO3 | L-tyrosine disodium salt | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25 | urate | E | C5H4N4O3 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 26 | valine | E | C5H11NO2 | L-Valine | True | exact | valine | + | 1.000 | 1.0 | 1.0 | 0.0 | 0.0 | valine; val | complete |
| 27 | 2'-deoxyadenosine | I | C10H13N5O3 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 28 | 3',5'-cyclic AMP | I | C10H12N5O6P | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 29 | 4-aminobutanoate | I | C4H9NO2 | 4-aminobutanoate | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 30 | 5'-methylthioadenosine | I | C11H15N5O3S | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 31 | 5-oxo-proline | I | C5H7NO3 | 5-oxo-proline | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 32 | Adenine | I | NA | Adenine hydrochloride hydrate | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 33 | Adenosine | I | NA | Adenosine | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 34 | Guanine | I | NA | Guanine | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 35 | Malate | I | C4H6O5 | L-Malic acid disodium salt monohydrate | True | exact | malate | + | 1.000 | 49.0 | 49.0 | 0.0 | 0.0 | L-malate | complete |
| 36 | N-acetyl-leucine | I | C8H15NO3 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 37 | Uracil | I | C4H4N2O2 | Uridine | True | approximate | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 38 | Xanthine | I | NA | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 39 | alanine | I | C3H7NO2 | L-Alanine; D-Alanine | True | exact | D-alanine | NaN | NaN | NaN | NaN | NaN | NaN | alanine; D-alanine | complete |
| 40 | arginine | I | C6H14N4O2 | L-Arginine | True | exact | arginine | + | 0.833 | 48.0 | 40.0 | 8.0 | 0.0 | arginine; arg | complete |
| 41 | aspartate | I | C4H7NO4 | L-Aspartic Acid | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | aspartate | complete |
| 42 | glutamic acid | I | C5H9NO4 | L-Glutamic acid monopotassium salt monohydrate | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | glutamate | complete |
| 43 | glycine | I | C2H5NO2 | Glycine | True | exact | glycine | - | 0.000 | 1.0 | 0.0 | 1.0 | 0.0 | glycine; gly | complete |
| 44 | guanosine | I | C10H13N5O5 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 45 | inosine | I | C10H12N4O5 | Inosine | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 46 | leucine/norleucine | I | C6H13NO2 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 47 | maleic acid | I | C4H4O4 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 48 | nicotinamide | I | C6H6N2O | Nicotinamide | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 49 | pantothenic acid | I | C9H17NO5 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 50 | phenylalanine | I | C9H11NO2 | L-Phenylalanine | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | phenylalanine; phe | complete |
| 51 | proline | I | C5H9NO2 | L-Proline | True | exact | NaN | NaN | NaN | NaN | NaN | NaN | NaN | proline | complete |
| 52 | sn-glycero-3-phosphocholine | I | C8H20NO6P | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 53 | thiamine | I | C12H16N4OS | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 54 | threonine isomers (coeluters: threonine, homos... | I | C4H9NO3 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 55 | trehalose | I | C12H22O11 | D-Trehalose dihydrate | True | exact | trehalose | - | 0.167 | 6.0 | 1.0 | 5.0 | 0.0 | trehalose | complete |
| 56 | tryptophan | I | C11H12N2O2 | L-Tryptophan | True | exact | tryptophan | - | 0.000 | 50.0 | 0.0 | 50.0 | 0.0 | tryptophan; trp | complete |
| 57 | xanthosine | I | C10H12N4O6 | NaN | False | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
In [9]:
# Also build crosswalk for FB conditions that are NOT in WoM
# (compounds tested in FB but not detected as produced by WoM)
fb_only = []
wom_names_lower = set(wom_produced['compound_name'].str.lower())
matched_fb_conditions = set()
for matches in MANUAL_WOM_TO_FB.values():
matched_fb_conditions.update(matches)
for _, row in fb_conditions.iterrows():
cond = row['condition_1']
if cond not in matched_fb_conditions:
norm = normalize_compound_name(cond)
if norm not in {normalize_compound_name(n) for n in wom_names_lower}:
# Check GapMind
gm_matches = MANUAL_TO_GAPMIND.get(norm, [])
gm_best = None
if gm_matches and len(gapmind_n2e3) > 0:
for gm_path in gm_matches:
gm_row = gapmind_n2e3[gapmind_n2e3['pathway'] == gm_path]
if len(gm_row) > 0:
score = gm_row.iloc[0].get('best_score', None)
if score is not None:
gm_best = score_map.get(int(score), str(score))
break
fb_only.append({
'fb_condition': cond,
'fb_expGroup': row['expGroup'],
'wom_action': 'not_detected',
'gapmind_pathway': '; '.join(gm_matches) if gm_matches else None,
'gapmind_prediction': gm_best,
})
fb_only_df = pd.DataFrame(fb_only)
print(f"\nFB conditions NOT in WoM produced set: {len(fb_only_df)}")
fb_only_df
FB conditions NOT in WoM produced set: 45
Out[9]:
| fb_condition | fb_expGroup | wom_action | gapmind_pathway | gapmind_prediction | |
|---|---|---|---|---|---|
| 0 | Ammonium chloride | nitrogen source | not_detected | NaN | NaN |
| 1 | D-Fructose | carbon source | not_detected | fructose | complete |
| 2 | D-Gluconic Acid sodium salt | carbon source | not_detected | NaN | NaN |
| 3 | D-Glucosamine Hydrochloride | carbon source | not_detected | NaN | NaN |
| 4 | D-Glucosamine Hydrochloride | nitrogen source | not_detected | NaN | NaN |
| 5 | D-Glucose | carbon source | not_detected | glucose | complete |
| 6 | D-Mannose | carbon source | not_detected | mannose | likely_complete |
| 7 | D-Ribose | carbon source | not_detected | ribose | complete |
| 8 | D-Serine | carbon source | not_detected | serine; ser; D-serine | complete |
| 9 | D-Serine | nitrogen source | not_detected | serine; ser; D-serine | complete |
| 10 | Ethanol | carbon source | not_detected | ethanol | complete |
| 11 | Gly-DL-Asp | nitrogen source | not_detected | NaN | NaN |
| 12 | Gly-Glu | nitrogen source | not_detected | NaN | NaN |
| 13 | Glycerol | carbon source | not_detected | glycerol | complete |
| 14 | L-Asparagine | carbon source | not_detected | asparagine; asn | complete |
| 15 | L-Citrulline | carbon source | not_detected | citrulline | complete |
| 16 | L-Citrulline | nitrogen source | not_detected | citrulline | complete |
| 17 | L-Glutamine | carbon source | not_detected | glutamine; gln | complete |
| 18 | L-Glutamine | nitrogen source | not_detected | glutamine; gln | complete |
| 19 | L-Histidine | carbon source | not_detected | histidine; his | complete |
| 20 | L-Histidine | nitrogen source | not_detected | histidine; his | complete |
| 21 | L-Isoleucine | carbon source | not_detected | isoleucine; ile | complete |
| 22 | L-Leucine | carbon source | not_detected | leucine; leu | complete |
| 23 | L-Ornithine | carbon source | not_detected | ornithine | NaN |
| 24 | L-Serine | carbon source | not_detected | serine; ser; D-serine | complete |
| 25 | L-Serine | nitrogen source | not_detected | serine; ser; D-serine | complete |
| 26 | L-Threonine | nitrogen source | not_detected | NaN | NaN |
| 27 | N-Acetyl-D-Glucosamine | nitrogen source | not_detected | NaN | NaN |
| 28 | Parabanic Acid | nitrogen source | not_detected | NaN | NaN |
| 29 | Potassium acetate | carbon source | not_detected | acetate | complete |
| 30 | Putrescine Dihydrochloride | carbon source | not_detected | NaN | NaN |
| 31 | Putrescine Dihydrochloride | nitrogen source | not_detected | NaN | NaN |
| 32 | Sodium Fumarate dibasic | carbon source | not_detected | fumarate | complete |
| 33 | Sodium butyrate | carbon source | not_detected | NaN | NaN |
| 34 | Sodium nitrate | nitrogen source | not_detected | NaN | NaN |
| 35 | Sodium nitrite | nitrogen source | not_detected | NaN | NaN |
| 36 | Sodium propionate | carbon source | not_detected | NaN | NaN |
| 37 | Sodium pyruvate | carbon source | not_detected | pyruvate | complete |
| 38 | Sodium succinate dibasic hexahydrate | carbon source | not_detected | succinate | complete |
| 39 | Trisodium citrate dihydrate | carbon source | not_detected | NaN | NaN |
| 40 | Tween 20 | carbon source | not_detected | NaN | NaN |
| 41 | Urea | nitrogen source | not_detected | NaN | NaN |
| 42 | a-Ketoglutaric acid disodium salt hydrate | carbon source | not_detected | NaN | NaN |
| 43 | casamino acids | carbon source | not_detected | NaN | NaN |
| 44 | m-Inositol | carbon source | not_detected | NaN | NaN |
6. Summary Statistics¶
In [10]:
print("=" * 60)
print("DATA EXTRACTION SUMMARY")
print("=" * 60)
print(f"\nStrain: Pseudomonas fluorescens FW300-N2E3")
print(f"GTDB: Pseudomonas_E fluorescens_E (RS_GCF_001307155.1)")
print(f"\n--- Web of Microbes ---")
print(f" Total observations: {len(wom_df)}")
print(f" Emerged (de novo): {(wom_df['action']=='E').sum()}")
print(f" Increased: {(wom_df['action']=='I').sum()}")
print(f" No change: {(wom_df['action']=='N').sum()}")
print(f" Medium: R2A")
print(f"\n--- Fitness Browser ---")
print(f" Carbon source experiments: {len(fb_exps[fb_exps['expGroup']=='carbon source'])}")
print(f" Nitrogen source experiments: {len(fb_exps[fb_exps['expGroup']=='nitrogen source'])}")
print(f" Unique C/N conditions: {fb_conditions['condition_1'].nunique()}")
print(f" Protein-coding genes: {len(fb_genes)}")
print(f"\n--- BacDive ---")
print(f" P. fluorescens strains tested: {bacdive_df['bacdive_id'].nunique()}")
print(f" Compounds tested: {bacdive_summary['compound_name'].nunique()}")
print(f"\n--- GapMind ---")
print(f" Pathways predicted: {gapmind_n2e3['pathway'].nunique() if len(gapmind_n2e3) > 0 else 'N/A'}")
print(f" Genomes in clade: {gapmind_df['genome_id'].nunique()}")
print(f"\n--- Cross-Database Overlap ---")
print(f" WoM metabolites matched to FB: {crosswalk['fb_matched'].sum()} / {len(crosswalk)}")
print(f" WoM metabolites matched to BacDive: {crosswalk['bacdive_compound'].notna().sum()} / {len(crosswalk)}")
print(f" WoM metabolites matched to GapMind: {crosswalk['gapmind_prediction'].notna().sum()} / {len(crosswalk)}")
============================================================ DATA EXTRACTION SUMMARY ============================================================ Strain: Pseudomonas fluorescens FW300-N2E3 GTDB: Pseudomonas_E fluorescens_E (RS_GCF_001307155.1) --- Web of Microbes --- Total observations: 105 Emerged (de novo): 27 Increased: 31 No change: 47 Medium: R2A --- Fitness Browser --- Carbon source experiments: 82 Nitrogen source experiments: 38 Unique C/N conditions: 62 Protein-coding genes: 5766 --- BacDive --- P. fluorescens strains tested: 105 Compounds tested: 83 --- GapMind --- Pathways predicted: 80 Genomes in clade: 40 --- Cross-Database Overlap --- WoM metabolites matched to FB: 28 / 58 WoM metabolites matched to BacDive: 8 / 58 WoM metabolites matched to GapMind: 13 / 58
In [11]:
spark.stop()
print("Spark session closed. Data files saved to ../data/")
Spark session closed. Data files saved to ../data/