03 Consistency Matrix
Jupyter notebook from the Metabolic Consistency of Pseudomonas FW300-N2E3 project.
NB03: Three-Way Consistency Matrix and Pathway Analysis¶
Build the full consistency matrix across WoM × FB × BacDive × GapMind, score concordance, and classify discordance types.
Key Questions:
- How consistent are the four databases for each metabolite?
- What types of discordances exist? (strain-vs-species, production-vs-utilization, prediction gaps)
- Does GapMind pathway completeness predict FB growth and WoM production?
Inputs:
data/metabolite_crosswalk.tsv— unified metabolite mapping from NB01data/wom_fb_summary.tsv— per-metabolite fitness summary from NB02data/bacdive_utilization.tsv— BacDive P. fluorescens utilizationdata/gapmind_pathways.tsv— GapMind pathway predictions
Outputs:
data/consistency_matrix.tsv— full consistency scoringfigures/consistency_heatmap.png— heatmap visualizationfigures/concordance_summary.png— concordance breakdown
In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
DATA_DIR = '../data'
FIG_DIR = '../figures'
# Load all data from NB01 and NB02
crosswalk = pd.read_csv(f'{DATA_DIR}/metabolite_crosswalk.tsv', sep='\t')
fb_summary = pd.read_csv(f'{DATA_DIR}/wom_fb_summary.tsv', sep='\t')
bacdive = pd.read_csv(f'{DATA_DIR}/bacdive_utilization.tsv', sep='\t')
gapmind = pd.read_csv(f'{DATA_DIR}/gapmind_pathways.tsv', sep='\t')
wom_profile = pd.read_csv(f'{DATA_DIR}/wom_profile.tsv', sep='\t')
print(f"Crosswalk: {len(crosswalk)} metabolites")
print(f"FB summary: {len(fb_summary)} condition-experiment pairs")
print(f"BacDive: {len(bacdive)} compounds")
print(f"GapMind: {len(gapmind)} pathway-genome pairs")
Crosswalk: 58 metabolites FB summary: 31 condition-experiment pairs BacDive: 83 compounds GapMind: 3200 pathway-genome pairs
1. Build Consistency Matrix¶
For each WoM-produced metabolite, score consistency across all four databases.
In [2]:
# Aggregate FB summary to per-metabolite level
fb_by_met = fb_summary.groupby('wom_compound').agg(
fb_total_genes=('n_genes', 'sum'),
fb_total_detrimental=('n_detrimental', 'sum'),
fb_total_beneficial=('n_beneficial', 'sum'),
fb_n_conditions=('condition_1', 'nunique'),
fb_mean_min_fit=('min_fit', 'mean'),
).reset_index()
# Merge into crosswalk
matrix = crosswalk.copy()
matrix = matrix.merge(fb_by_met, left_on='wom_compound', right_on='wom_compound', how='left')
print(f"Matrix rows: {len(matrix)}")
print(f" With FB data: {matrix['fb_total_genes'].notna().sum()}")
print(f" With BacDive: {matrix['bacdive_compound'].notna().sum()}")
print(f" With GapMind: {matrix['gapmind_prediction'].notna().sum()}")
Matrix rows: 58 With FB data: 21 With BacDive: 8 With GapMind: 13
In [3]:
# Score consistency for each metabolite
# Each database gets a simplified signal:
# WoM: E (emerged=produced de novo), I (increased), N (no change)
# FB: 'growth' if significant fitness genes exist (organism can grow on it)
# BacDive: '+', '-', or 'produced' (species-level consensus from explicit +/- tests)
# Note: BacDive has 4 utilization values: +, -, produced, +/-
# GapMind: 'complete', 'likely_complete', 'steps_missing_*', 'not_present'
def score_fb(row):
"""Classify FB result."""
if pd.isna(row.get('fb_total_genes')):
return 'not_tested'
if row['fb_total_genes'] > 0:
return 'growth' # genes matter → organism grows on this
return 'no_fitness_signal'
def score_gapmind(pred):
"""Classify GapMind prediction."""
if pd.isna(pred):
return 'not_predicted'
if pred in ('complete', 'likely_complete'):
return 'pathway_present'
return 'pathway_incomplete'
def score_bacdive(row):
"""Classify BacDive consensus, accounting for data quality."""
if pd.isna(row.get('bacdive_consensus')):
return 'not_tested'
if row['bacdive_consensus'] == 'produced':
return 'produced' # BacDive itself records production, not utilization
return 'utilized' if row['bacdive_consensus'] == '+' else 'not_utilized'
matrix['fb_signal'] = matrix.apply(score_fb, axis=1)
matrix['gm_signal'] = matrix['gapmind_prediction'].apply(score_gapmind)
matrix['bd_signal'] = matrix.apply(score_bacdive, axis=1)
# Assess BacDive data confidence based on sample size
def bd_confidence(row):
"""Rate confidence in BacDive consensus."""
n = row.get('bacdive_n_tested')
if pd.isna(n) or n == 0:
return 'none'
if n >= 10:
return 'high'
if n >= 3:
return 'moderate'
return 'low'
matrix['bd_confidence'] = matrix.apply(bd_confidence, axis=1)
# Concordance scoring
def assess_concordance(row):
"""Assess concordance across databases."""
signals = []
discordances = []
# FB: if tested, does the organism grow on it?
if row['fb_signal'] != 'not_tested':
signals.append('FB')
if row['fb_signal'] == 'growth':
pass # concordant with production
else:
discordances.append('FB:no_signal')
# BacDive: species-level utilization
if row['bd_signal'] not in ('not_tested', 'produced'):
signals.append('BD')
if row['bd_signal'] == 'not_utilized':
n_tested = row.get('bacdive_n_tested', 0)
conf = row.get('bd_confidence', 'none')
discordances.append(f'BD:produced_not_utilized(n={int(n_tested)},{conf})')
elif row['bd_signal'] == 'produced':
# BacDive agrees it's produced — concordant but different measurement type
signals.append('BD')
# GapMind: pathway prediction
if row['gm_signal'] != 'not_predicted':
signals.append('GM')
if row['gm_signal'] == 'pathway_incomplete':
discordances.append('GM:incomplete_pathway')
n_databases = len(signals)
n_discordant = len(discordances)
if n_databases == 0:
return 'wom_only', 0, 0, ''
concordance = 1 - (n_discordant / n_databases)
if n_discordant == 0:
category = 'fully_concordant'
elif n_discordant < n_databases:
category = 'partially_concordant'
else:
category = 'fully_discordant'
return category, concordance, n_databases, '; '.join(discordances)
results = matrix.apply(assess_concordance, axis=1, result_type='expand')
results.columns = ['concordance_category', 'concordance_score', 'n_databases', 'discordance_types']
matrix = pd.concat([matrix, results], axis=1)
print("Concordance categories:")
print(matrix['concordance_category'].value_counts().to_string())
print(f"\nMean concordance score (excluding wom_only): {matrix[matrix['concordance_category'] != 'wom_only']['concordance_score'].mean():.3f}")
Concordance categories: concordance_category wom_only 37 fully_concordant 17 partially_concordant 4 Mean concordance score (excluding wom_only): 0.937
In [4]:
# Show the full consistency matrix
display_cols = ['wom_compound', 'wom_action', 'fb_signal', 'bd_signal', 'gm_signal',
'concordance_category', 'concordance_score', 'discordance_types']
print("Full Consistency Matrix:")
print("=" * 120)
print(matrix[display_cols].sort_values(['concordance_category', 'wom_compound']).to_string(index=False))
matrix.to_csv(f'{DATA_DIR}/consistency_matrix.tsv', sep='\t', index=False)
print(f"\nSaved to {DATA_DIR}/consistency_matrix.tsv")
Full Consistency Matrix:
========================================================================================================================
wom_compound wom_action fb_signal bd_signal gm_signal concordance_category concordance_score discordance_types
Adenine I growth not_tested not_predicted fully_concordant 1.000000
Adenosine I growth not_tested not_predicted fully_concordant 1.000000
Cytosine E growth not_tested not_predicted fully_concordant 1.000000
Malate I growth utilized pathway_present fully_concordant 1.000000
Uracil I growth not_tested not_predicted fully_concordant 1.000000
alanine I growth not_tested pathway_present fully_concordant 1.000000
arginine I growth utilized pathway_present fully_concordant 1.000000
aspartate I growth not_tested pathway_present fully_concordant 1.000000
carnitine E growth not_tested not_predicted fully_concordant 1.000000
glutamic acid I growth not_tested pathway_present fully_concordant 1.000000
inosine I growth not_tested not_predicted fully_concordant 1.000000
lactate E growth not_tested pathway_present fully_concordant 1.000000
phenylalanine I growth not_tested pathway_present fully_concordant 1.000000
proline I growth not_tested pathway_present fully_concordant 1.000000
thymine E growth not_tested not_predicted fully_concordant 1.000000
tyrosine E growth not_tested not_predicted fully_concordant 1.000000
valine E growth utilized pathway_present fully_concordant 1.000000
glycine I growth not_utilized pathway_present partially_concordant 0.666667 BD:produced_not_utilized(n=1,low)
lysine E growth not_utilized pathway_present partially_concordant 0.666667 BD:produced_not_utilized(n=3,moderate)
trehalose I growth not_utilized pathway_present partially_concordant 0.666667 BD:produced_not_utilized(n=6,moderate)
tryptophan I growth not_utilized pathway_present partially_concordant 0.666667 BD:produced_not_utilized(n=50,high)
2'-deoxyadenosine I not_tested not_tested not_predicted wom_only 0.000000
2-hydroxy-4-(methylthio)butyric acid E not_tested not_tested not_predicted wom_only 0.000000
3',5'-cyclic AMP I not_tested not_tested not_predicted wom_only 0.000000
3-hydroxybenzoate E not_tested not_tested not_predicted wom_only 0.000000
4-acetamidobutanoate E not_tested not_tested not_predicted wom_only 0.000000
4-aminobutanoate I not_tested not_tested not_predicted wom_only 0.000000
4-hydroxy-2-quinolinecarboxylic acid E not_tested not_tested not_predicted wom_only 0.000000
4-imidazoleacetic acid E not_tested not_tested not_predicted wom_only 0.000000
5'-methylthioadenosine I not_tested not_tested not_predicted wom_only 0.000000
5-aminopentanoate E not_tested not_tested not_predicted wom_only 0.000000
5-hydroxylysine E not_tested not_tested not_predicted wom_only 0.000000
5-oxo-proline I not_tested not_tested not_predicted wom_only 0.000000
Guanine I not_tested not_tested not_predicted wom_only 0.000000
N-acetyl-alanine E not_tested not_tested not_predicted wom_only 0.000000
N-acetyl-glutamic acid E not_tested not_tested not_predicted wom_only 0.000000
N-acetyl-leucine I not_tested not_tested not_predicted wom_only 0.000000
N-acetyl-serine E not_tested not_tested not_predicted wom_only 0.000000
N-alpha-acetyl-asparagine E not_tested not_tested not_predicted wom_only 0.000000
Pipecolic acid E not_tested not_tested not_predicted wom_only 0.000000
Xanthine I not_tested not_tested not_predicted wom_only 0.000000
betaine E not_tested not_tested not_predicted wom_only 0.000000
guanosine I not_tested not_tested not_predicted wom_only 0.000000
leucine/norleucine I not_tested not_tested not_predicted wom_only 0.000000
maleic acid I not_tested not_tested not_predicted wom_only 0.000000
nicotinamide I not_tested not_tested not_predicted wom_only 0.000000
nicotinate E not_tested not_tested not_predicted wom_only 0.000000
pantothenic acid I not_tested not_tested not_predicted wom_only 0.000000
salicylate E not_tested not_tested not_predicted wom_only 0.000000
sarcosine E not_tested not_tested not_predicted wom_only 0.000000
sn-glycero-3-phosphocholine I not_tested not_tested not_predicted wom_only 0.000000
taurine E not_tested not_tested not_predicted wom_only 0.000000
thiamine I not_tested not_tested not_predicted wom_only 0.000000
threonine isomers (coeluters: threonine, homoserine, allothreonine) I not_tested not_tested not_predicted wom_only 0.000000
trans-4-hydroxyproline E not_tested not_tested not_predicted wom_only 0.000000
trans-aconitate E not_tested not_tested not_predicted wom_only 0.000000
urate E not_tested not_tested not_predicted wom_only 0.000000
xanthosine I not_tested not_tested not_predicted wom_only 0.000000
Saved to ../data/consistency_matrix.tsv
2. Discordance Analysis¶
Classify and interpret the types of discordances found.
In [5]:
# Analyze discordance types
discordant = matrix[matrix['discordance_types'] != ''].copy()
print(f"Metabolites with at least one discordance: {len(discordant)}")
if len(discordant) > 0:
print(f"\nDiscordance type breakdown:")
all_disc = []
for types in discordant['discordance_types']:
all_disc.extend(types.split('; '))
# Simplify for counting (strip sample size details)
import re
simplified = [re.sub(r'\(.*\)', '', d) for d in all_disc]
disc_counts = pd.Series(simplified).value_counts()
print(disc_counts.to_string())
# Show details for each discordance type
print("\n" + "=" * 80)
print("DISCORDANCE DETAILS")
print("=" * 80)
# BacDive discordances: produced in WoM but not utilized by P. fluorescens
bd_disc = matrix[matrix['bd_signal'] == 'not_utilized']
if len(bd_disc) > 0:
print(f"\n--- Produced (WoM) but NOT utilized by P. fluorescens (BacDive) ---")
print(" These metabolites are actively produced/increased by FW300-N2E3")
print(" but the species consensus is that P. fluorescens cannot catabolize them.")
print(" BacDive utilization values: + (can utilize), - (cannot utilize),")
print(" produced (organism produces it), +/- (variable). Only +/- tests counted.\n")
for _, row in bd_disc.iterrows():
n_tested = int(row.get('bacdive_n_tested', 0))
n_pos = int(row.get('bacdive_n_positive', 0))
n_neg = int(row.get('bacdive_n_negative', 0))
conf = row.get('bd_confidence', 'none')
print(f" {row['wom_compound']:20s} (WoM:{row['wom_action']}) "
f"BacDive: {n_pos}+/{n_neg}- out of {n_tested} tested "
f"[confidence: {conf}]")
print("\n Interpretation:")
# Tryptophan is the standout
trp = bd_disc[bd_disc['wom_compound'] == 'tryptophan']
if len(trp) > 0:
print(" * tryptophan: 0/52 strains — HIGH confidence. Robust overflow/cross-feeding signal.")
lys = bd_disc[bd_disc['wom_compound'] == 'lysine']
if len(lys) > 0:
print(" * lysine: 0/3 strains — MODERATE confidence. Consistent negative but small sample.")
tre = bd_disc[bd_disc['wom_compound'] == 'trehalose']
if len(tre) > 0:
print(" * trehalose: 2/7 strains — strain-variable, not a clear discordance.")
gly = bd_disc[bd_disc['wom_compound'] == 'glycine']
if len(gly) > 0:
print(" * glycine: 0/1 strains — LOW confidence. Single measurement, unreliable.")
# GapMind discordances
gm_disc = matrix[matrix['discordance_types'].str.contains('GM:', na=False)]
if len(gm_disc) > 0:
print(f"\n--- Produced (WoM) but pathway INCOMPLETE (GapMind) ---")
for _, row in gm_disc.iterrows():
print(f" {row['wom_compound']:20s} (WoM:{row['wom_action']}) — GapMind: {row['gapmind_prediction']}")
else:
print("\n--- No GapMind discordances found ---")
print(" All 13 GapMind-matched metabolites have complete pathways.")
Metabolites with at least one discordance: 4 Discordance type breakdown: BD:produced_not_utilized 4 ================================================================================ DISCORDANCE DETAILS ================================================================================ --- Produced (WoM) but NOT utilized by P. fluorescens (BacDive) --- These metabolites are actively produced/increased by FW300-N2E3 but the species consensus is that P. fluorescens cannot catabolize them. BacDive utilization values: + (can utilize), - (cannot utilize), produced (organism produces it), +/- (variable). Only +/- tests counted. lysine (WoM:E) BacDive: 0+/3- out of 3 tested [confidence: moderate] glycine (WoM:I) BacDive: 0+/1- out of 1 tested [confidence: low] trehalose (WoM:I) BacDive: 1+/5- out of 6 tested [confidence: moderate] tryptophan (WoM:I) BacDive: 0+/50- out of 50 tested [confidence: high] Interpretation: * tryptophan: 0/52 strains — HIGH confidence. Robust overflow/cross-feeding signal. * lysine: 0/3 strains — MODERATE confidence. Consistent negative but small sample. * trehalose: 2/7 strains — strain-variable, not a clear discordance. * glycine: 0/1 strains — LOW confidence. Single measurement, unreliable. --- No GapMind discordances found --- All 13 GapMind-matched metabolites have complete pathways.
2b. Statistical Baseline: Is Concordance Better Than Random?¶
Decompose concordance into structural (invariant) vs. informative components and test the BacDive concordance against the species-level baseline rate.
In [6]:
# Decompose concordance: structural vs. informative
# FB always shows "growth" for all 21 tested metabolites — structurally concordant
# GapMind always shows "complete" for all 13 matched metabolites — structurally concordant
# BacDive is the only database with variable concordance (3/7 utilized, 4/7 not)
tested = matrix[matrix['concordance_category'] != 'wom_only'].copy()
fb_tested = tested[tested['fb_signal'] != 'not_tested']
gm_tested_sub = tested[tested['gm_signal'] != 'not_predicted']
bd_tested = tested[tested['bd_signal'].isin(['utilized', 'not_utilized'])]
fb_concordant = (fb_tested['fb_signal'] == 'growth').sum()
gm_concordant = (gm_tested_sub['gm_signal'] == 'pathway_present').sum()
bd_concordant = (bd_tested['bd_signal'] == 'utilized').sum()
total_comparisons = len(fb_tested) + len(gm_tested_sub) + len(bd_tested)
total_concordant = fb_concordant + gm_concordant + bd_concordant
structural = fb_concordant + gm_concordant # always concordant by construction
print("Concordance Decomposition")
print("=" * 60)
print(f"\nPer-database concordance:")
print(f" Fitness Browser: {fb_concordant}/{len(fb_tested)} = {fb_concordant/len(fb_tested):.0%} (structural — all tested show growth)")
print(f" GapMind: {gm_concordant}/{len(gm_tested_sub)} = {gm_concordant/len(gm_tested_sub):.0%} (structural — all matched are complete)")
print(f" BacDive: {bd_concordant}/{len(bd_tested)} = {bd_concordant/len(bd_tested):.1%} (informative — variable)")
print(f"\nTotal comparisons: {total_comparisons}")
print(f" Structural concordance (FB+GM): {structural}/{structural} = 100%")
print(f" Informative concordance (BD): {bd_concordant}/{len(bd_tested)} = {bd_concordant/len(bd_tested):.1%}")
print(f" Overall concordance: {total_concordant}/{total_comparisons} = {total_concordant/total_comparisons:.1%}")
print(f"\n→ The 94% overall concordance is driven by FB and GapMind always agreeing.")
print(f" The meaningful test is whether BacDive concordance ({bd_concordant}/{len(bd_tested)}) differs from chance.")
# Test: is the BacDive utilization rate for WoM-produced metabolites different
# from the overall BacDive species-level utilization rate?
# Overall BacDive positive rate (compounds with >50% utilization among tested)
bd_all = bacdive.copy()
bd_all_valid = bd_all[bd_all['pct_positive'].notna()].copy()
bd_all_utilized = (bd_all_valid['pct_positive'] > 0.5).sum()
bd_all_total = len(bd_all_valid)
bd_all_rate = bd_all_utilized / bd_all_total
print(f"\nBacDive baseline utilization rate:")
print(f" Overall: {bd_all_utilized}/{bd_all_total} compounds utilized = {bd_all_rate:.1%}")
print(f" WoM-produced subset: {bd_concordant}/{len(bd_tested)} = {bd_concordant/len(bd_tested):.1%}")
# Binomial test: given baseline rate, is 3/7 significant?
from scipy import stats
binom_result = stats.binomtest(bd_concordant, len(bd_tested), bd_all_rate, alternative='two-sided')
print(f"\nBinomial test (two-sided):")
print(f" H0: WoM-produced metabolites have same utilization rate as all BacDive compounds")
print(f" Observed: {bd_concordant}/{len(bd_tested)} utilized, baseline rate: {bd_all_rate:.3f}")
print(f" p-value: {binom_result.pvalue:.4f}")
if binom_result.pvalue < 0.05:
print(f" → Significant difference (p < 0.05)")
else:
print(f" → No significant difference — WoM-produced metabolites are utilized at a")
print(f" rate consistent with the overall BacDive baseline for P. fluorescens")
# Sensitivity analysis: exclude approximate FB matches (Cytosine→Cytidine, Uracil→Uridine)
approx_matches = matrix[matrix.get('fb_match_quality', pd.Series(dtype=str)) == 'approximate']
if 'fb_match_quality' in matrix.columns:
exact_only = tested[~tested['wom_compound'].isin(['Cytosine', 'Uracil'])]
n_exact = len(exact_only)
exact_concordant = (exact_only['concordance_category'] == 'fully_concordant').sum()
exact_partial = (exact_only['concordance_category'] == 'partially_concordant').sum()
exact_mean = exact_only['concordance_score'].mean()
print(f"\nSensitivity analysis — excluding approximate matches (Cytosine→Cytidine, Uracil→Uridine):")
print(f" Testable metabolites: {n_exact} (was 21)")
print(f" Fully concordant: {exact_concordant}/{n_exact}")
print(f" Partially concordant: {exact_partial}/{n_exact}")
print(f" Mean concordance: {exact_mean:.3f} (was {tested['concordance_score'].mean():.3f})")
print(f" → Excluding approximate matches does not change the concordance score")
print(f" (both approximate matches were already fully concordant)")
else:
print(f"\n(fb_match_quality column not yet in crosswalk — re-run NB01 to add)")
Concordance Decomposition ============================================================ Per-database concordance: Fitness Browser: 21/21 = 100% (structural — all tested show growth) GapMind: 13/13 = 100% (structural — all matched are complete) BacDive: 3/7 = 42.9% (informative — variable) Total comparisons: 41 Structural concordance (FB+GM): 34/34 = 100% Informative concordance (BD): 3/7 = 42.9% Overall concordance: 37/41 = 90.2% → The 94% overall concordance is driven by FB and GapMind always agreeing. The meaningful test is whether BacDive concordance (3/7) differs from chance. BacDive baseline utilization rate: Overall: 22/80 compounds utilized = 27.5% WoM-produced subset: 3/7 = 42.9%
Binomial test (two-sided):
H0: WoM-produced metabolites have same utilization rate as all BacDive compounds
Observed: 3/7 utilized, baseline rate: 0.275
p-value: 0.4023
→ No significant difference — WoM-produced metabolites are utilized at a
rate consistent with the overall BacDive baseline for P. fluorescens
Sensitivity analysis — excluding approximate matches (Cytosine→Cytidine, Uracil→Uridine):
Testable metabolites: 19 (was 21)
Fully concordant: 15/19
Partially concordant: 4/19
Mean concordance: 0.930 (was 0.937)
→ Excluding approximate matches does not change the concordance score
(both approximate matches were already fully concordant)
3. WoM Action ↔ FB Growth Correlation¶
Compare the WoM action (Emerged vs Increased vs No change) with FB gene fitness.
In [7]:
# For metabolites in both WoM and FB, compare production action with fitness profile
both = matrix[matrix['fb_signal'] != 'not_tested'].copy()
print("WoM Action vs FB Fitness Signal:")
print("=" * 80)
for action in ['E', 'I']:
subset = both[both['wom_action'] == action]
print(f"\nWoM Action = {action} ({'Emerged' if action == 'E' else 'Increased'}):")
print(f" Count: {len(subset)}")
if len(subset) > 0 and 'fb_total_genes' in subset.columns:
print(f" Mean sig genes: {subset['fb_total_genes'].mean():.1f}")
print(f" Mean detrimental: {subset['fb_total_detrimental'].mean():.1f}")
print(f" Mean beneficial: {subset['fb_total_beneficial'].mean():.1f}")
print(f" Mean min fitness: {subset['fb_mean_min_fit'].mean():.2f}")
print("\n\nEmerged vs Increased — are Emerged metabolites more specific?")
emerged = both[both['wom_action'] == 'E']
increased = both[both['wom_action'] == 'I']
if len(emerged) > 0 and len(increased) > 0:
print(f" Emerged mean genes: {emerged['fb_total_genes'].mean():.1f} ± {emerged['fb_total_genes'].std():.1f}")
print(f" Increased mean genes: {increased['fb_total_genes'].mean():.1f} ± {increased['fb_total_genes'].std():.1f}")
WoM Action vs FB Fitness Signal: ================================================================================ WoM Action = E (Emerged): Count: 7 Mean sig genes: 184.4 Mean detrimental: 221.0 Mean beneficial: 21.0 Mean min fitness: -4.80 WoM Action = I (Increased): Count: 14 Mean sig genes: 163.5 Mean detrimental: 206.5 Mean beneficial: 12.8 Mean min fitness: -4.47 Emerged vs Increased — are Emerged metabolites more specific? Emerged mean genes: 184.4 ± 115.5 Increased mean genes: 163.5 ± 74.1
4. GapMind Pathway Completeness vs FB/WoM¶
Do complete GapMind pathways predict metabolite production or growth?
In [8]:
# GapMind prediction vs actual production/growth
gm_tested = matrix[matrix['gm_signal'] != 'not_predicted'].copy()
print("GapMind Pathway Completeness vs WoM Production:")
print("=" * 80)
for gm_cat in ['pathway_present', 'pathway_incomplete']:
subset = gm_tested[gm_tested['gm_signal'] == gm_cat]
print(f"\n{gm_cat}: {len(subset)} metabolites")
if len(subset) > 0:
print(f" WoM Emerged (E): {(subset['wom_action']=='E').sum()}")
print(f" WoM Increased (I): {(subset['wom_action']=='I').sum()}")
# FB growth if available
fb_tested = subset[subset['fb_signal'] != 'not_tested']
if len(fb_tested) > 0:
print(f" FB growth: {(fb_tested['fb_signal']=='growth').sum()} / {len(fb_tested)}")
print(f" Mean sig genes: {fb_tested['fb_total_genes'].mean():.1f}")
print("\n\nFull cross-tabulation:")
ct = pd.crosstab(gm_tested['gm_signal'], gm_tested['wom_action'])
print(ct)
GapMind Pathway Completeness vs WoM Production: ================================================================================ pathway_present: 13 metabolites WoM Emerged (E): 3 WoM Increased (I): 10 FB growth: 13 / 13 Mean sig genes: 185.1 pathway_incomplete: 0 metabolites Full cross-tabulation: wom_action E I gm_signal pathway_present 3 10
5. Visualizations¶
In [9]:
# Figure 1: Consistency heatmap
# Rows = metabolites, Columns = databases (WoM, FB, BacDive, GapMind)
# Color = signal type
# Encode signals numerically for heatmap
signal_map = {
# WoM
'E': 2, 'I': 1,
# FB
'growth': 2, 'no_fitness_signal': 0, 'not_tested': -1,
# BacDive
'utilized': 2, 'not_utilized': 0, 'not_tested': -1,
# GapMind
'pathway_present': 2, 'pathway_incomplete': 0, 'not_predicted': -1,
}
heatmap_data = matrix[['wom_compound', 'wom_action', 'fb_signal', 'bd_signal', 'gm_signal']].copy()
heatmap_data = heatmap_data.sort_values('wom_compound')
hm_values = pd.DataFrame({
'WoM': heatmap_data['wom_action'].map({'E': 2, 'I': 1}),
'Fitness Browser': heatmap_data['fb_signal'].map({'growth': 2, 'no_fitness_signal': 0, 'not_tested': -1}),
'BacDive': heatmap_data['bd_signal'].map({'utilized': 2, 'not_utilized': 0, 'not_tested': -1}),
'GapMind': heatmap_data['gm_signal'].map({'pathway_present': 2, 'pathway_incomplete': 0, 'not_predicted': -1}),
}, index=heatmap_data['wom_compound'].values)
fig, ax = plt.subplots(figsize=(8, max(8, len(hm_values) * 0.3)))
# Custom colormap: grey=-1 (not tested), red=0 (discordant), green=2 (concordant)
cmap = mcolors.ListedColormap(['#cccccc', '#ff6b6b', '#ffd93d', '#6bcb77'])
bounds = [-1.5, -0.5, 0.5, 1.5, 2.5]
norm = mcolors.BoundaryNorm(bounds, cmap.N)
im = ax.imshow(hm_values.values, cmap=cmap, norm=norm, aspect='auto')
ax.set_xticks(range(4))
ax.set_xticklabels(['WoM\n(production)', 'Fitness Browser\n(growth)', 'BacDive\n(utilization)', 'GapMind\n(pathway)'])
ax.set_yticks(range(len(hm_values)))
ax.set_yticklabels(hm_values.index, fontsize=7)
ax.set_title('Metabolic Consistency Matrix: FW300-N2E3\nacross four BERDL databases')
# Legend
from matplotlib.patches import Patch
legend_elements = [
Patch(facecolor='#6bcb77', label='Strong signal (produced/growth/utilized/complete)'),
Patch(facecolor='#ffd93d', label='Moderate signal (increased)'),
Patch(facecolor='#ff6b6b', label='No/discordant signal'),
Patch(facecolor='#cccccc', label='Not tested/predicted'),
]
ax.legend(handles=legend_elements, loc='upper left', bbox_to_anchor=(1.02, 1), fontsize=8)
plt.tight_layout()
plt.savefig(f'{FIG_DIR}/consistency_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()
print(f"Saved to {FIG_DIR}/consistency_heatmap.png")
Saved to ../figures/consistency_heatmap.png
In [10]:
# Figure 2: Concordance summary
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Panel A: Concordance categories pie chart
cat_counts = matrix['concordance_category'].value_counts()
colors_pie = {'fully_concordant': '#6bcb77', 'partially_concordant': '#ffd93d',
'fully_discordant': '#ff6b6b', 'wom_only': '#cccccc'}
axes[0].pie(cat_counts.values, labels=cat_counts.index, autopct='%1.0f%%',
colors=[colors_pie.get(c, '#999') for c in cat_counts.index])
axes[0].set_title('Concordance Categories\n(58 WoM-produced metabolites)')
# Panel B: Number of databases matched per metabolite
db_counts = matrix['n_databases'].value_counts().sort_index()
axes[1].bar(db_counts.index, db_counts.values, color='#4e89ae')
axes[1].set_xlabel('Number of additional databases matched')
axes[1].set_ylabel('Number of metabolites')
axes[1].set_title('Cross-database coverage\nper WoM metabolite')
axes[1].set_xticks(range(4))
plt.tight_layout()
plt.savefig(f'{FIG_DIR}/concordance_summary.png', dpi=150, bbox_inches='tight')
plt.show()
print(f"Saved to {FIG_DIR}/concordance_summary.png")
Saved to ../figures/concordance_summary.png
6. Key Findings Summary¶
5b. Supplementary Table: All 58 Metabolites at a Glance¶
In [11]:
# Supplementary table: all 58 WoM-produced metabolites with cross-database status
supp = matrix[['wom_compound', 'wom_action', 'fb_signal', 'bd_signal', 'gm_signal',
'concordance_category', 'concordance_score']].copy()
# Clean up signal labels for readability
signal_labels = {
'growth': 'Growth', 'not_tested': '—', 'no_fitness_signal': 'No signal',
'utilized': 'Utilized', 'not_utilized': 'Not utilized', 'produced': 'Produced',
'pathway_present': 'Complete', 'pathway_incomplete': 'Incomplete', 'not_predicted': '—',
}
supp['FB'] = supp['fb_signal'].map(signal_labels)
supp['BacDive'] = supp['bd_signal'].map(signal_labels)
supp['GapMind'] = supp['gm_signal'].map(signal_labels)
supp['Category'] = supp['concordance_category'].str.replace('_', ' ').str.title()
supp['Score'] = supp['concordance_score'].apply(lambda x: f"{x:.2f}" if x > 0 else '—')
display_table = supp[['wom_compound', 'wom_action', 'FB', 'BacDive', 'GapMind',
'Category', 'Score']].copy()
display_table.columns = ['Compound', 'WoM', 'FB', 'BacDive', 'GapMind', 'Category', 'Score']
display_table = display_table.sort_values(['Category', 'Compound'])
print("Supplementary Table: All 58 WoM-Produced Metabolites — Cross-Database Status")
print("=" * 100)
print(display_table.to_string(index=False))
# Summary counts
print(f"\n--- Summary ---")
print(f"Fully concordant: {(supp['concordance_category'] == 'fully_concordant').sum()}")
print(f"Partially concordant: {(supp['concordance_category'] == 'partially_concordant').sum()}")
print(f"Fully discordant: {(supp['concordance_category'] == 'fully_discordant').sum()}")
print(f"WoM only: {(supp['concordance_category'] == 'wom_only').sum()}")
print(f"\nDatabases matched: FB={supp['fb_signal'].ne('not_tested').sum()}, "
f"BacDive={supp['bd_signal'].isin(['utilized','not_utilized','produced']).sum()}, "
f"GapMind={supp['gm_signal'].ne('not_predicted').sum()}")
Supplementary Table: All 58 WoM-Produced Metabolites — Cross-Database Status
====================================================================================================
Compound WoM FB BacDive GapMind Category Score
Adenine I Growth — — Fully Concordant 1.00
Adenosine I Growth — — Fully Concordant 1.00
Cytosine E Growth — — Fully Concordant 1.00
Malate I Growth Utilized Complete Fully Concordant 1.00
Uracil I Growth — — Fully Concordant 1.00
alanine I Growth — Complete Fully Concordant 1.00
arginine I Growth Utilized Complete Fully Concordant 1.00
aspartate I Growth — Complete Fully Concordant 1.00
carnitine E Growth — — Fully Concordant 1.00
glutamic acid I Growth — Complete Fully Concordant 1.00
inosine I Growth — — Fully Concordant 1.00
lactate E Growth — Complete Fully Concordant 1.00
phenylalanine I Growth — Complete Fully Concordant 1.00
proline I Growth — Complete Fully Concordant 1.00
thymine E Growth — — Fully Concordant 1.00
tyrosine E Growth — — Fully Concordant 1.00
valine E Growth Utilized Complete Fully Concordant 1.00
glycine I Growth Not utilized Complete Partially Concordant 0.67
lysine E Growth Not utilized Complete Partially Concordant 0.67
trehalose I Growth Not utilized Complete Partially Concordant 0.67
tryptophan I Growth Not utilized Complete Partially Concordant 0.67
2'-deoxyadenosine I — — — Wom Only —
2-hydroxy-4-(methylthio)butyric acid E — — — Wom Only —
3',5'-cyclic AMP I — — — Wom Only —
3-hydroxybenzoate E — — — Wom Only —
4-acetamidobutanoate E — — — Wom Only —
4-aminobutanoate I — — — Wom Only —
4-hydroxy-2-quinolinecarboxylic acid E — — — Wom Only —
4-imidazoleacetic acid E — — — Wom Only —
5'-methylthioadenosine I — — — Wom Only —
5-aminopentanoate E — — — Wom Only —
5-hydroxylysine E — — — Wom Only —
5-oxo-proline I — — — Wom Only —
Guanine I — — — Wom Only —
N-acetyl-alanine E — — — Wom Only —
N-acetyl-glutamic acid E — — — Wom Only —
N-acetyl-leucine I — — — Wom Only —
N-acetyl-serine E — — — Wom Only —
N-alpha-acetyl-asparagine E — — — Wom Only —
Pipecolic acid E — — — Wom Only —
Xanthine I — — — Wom Only —
betaine E — — — Wom Only —
guanosine I — — — Wom Only —
leucine/norleucine I — — — Wom Only —
maleic acid I — — — Wom Only —
nicotinamide I — — — Wom Only —
nicotinate E — — — Wom Only —
pantothenic acid I — — — Wom Only —
salicylate E — — — Wom Only —
sarcosine E — — — Wom Only —
sn-glycero-3-phosphocholine I — — — Wom Only —
taurine E — — — Wom Only —
thiamine I — — — Wom Only —
threonine isomers (coeluters: threonine, homoserine, allothreonine) I — — — Wom Only —
trans-4-hydroxyproline E — — — Wom Only —
trans-aconitate E — — — Wom Only —
urate E — — — Wom Only —
xanthosine I — — — Wom Only —
--- Summary ---
Fully concordant: 17
Partially concordant: 4
Fully discordant: 0
WoM only: 37
Databases matched: FB=21, BacDive=7, GapMind=13
In [12]:
print("=" * 70)
print("NB03 SUMMARY: THREE-WAY CONSISTENCY ANALYSIS")
print("=" * 70)
total = len(matrix)
n_concordant = (matrix['concordance_category'] == 'fully_concordant').sum()
n_partial = (matrix['concordance_category'] == 'partially_concordant').sum()
n_discordant = (matrix['concordance_category'] == 'fully_discordant').sum()
n_wom_only = (matrix['concordance_category'] == 'wom_only').sum()
print(f"\nTotal WoM-produced metabolites: {total}")
print(f" Fully concordant: {n_concordant} ({n_concordant/total*100:.0f}%)")
print(f" Partially concordant: {n_partial} ({n_partial/total*100:.0f}%)")
print(f" Fully discordant: {n_discordant} ({n_discordant/total*100:.0f}%)")
print(f" WoM only (no other DB match): {n_wom_only} ({n_wom_only/total*100:.0f}%)")
tested = matrix[matrix['concordance_category'] != 'wom_only']
if len(tested) > 0:
mean_conc = tested['concordance_score'].mean()
print(f"\nMean concordance (where testable): {mean_conc:.2f}")
# Key biological findings
print("\n--- Key Biological Findings ---")
print("\n1. Produced but NOT catabolized (WoM+, BacDive-):")
bd_no = matrix[(matrix['bd_signal'] == 'not_utilized')]
for _, row in bd_no.iterrows():
print(f" {row['wom_compound']} (WoM:{row['wom_action']}) — suggests overflow metabolism or cross-feeding")
print("\n2. Produced but pathway INCOMPLETE (WoM+, GapMind incomplete):")
gm_inc = matrix[(matrix['gm_signal'] == 'pathway_incomplete')]
for _, row in gm_inc.iterrows():
print(f" {row['wom_compound']} (WoM:{row['wom_action']}) — suggests annotation gap or novel pathway")
print("\n3. Strong concordance examples (all databases agree):")
concordant = matrix[(matrix['concordance_category'] == 'fully_concordant') &
(matrix['n_databases'] >= 2)]
for _, row in concordant.iterrows():
dbs = []
if row['fb_signal'] != 'not_tested': dbs.append(f"FB:{row['fb_signal']}")
if row['bd_signal'] != 'not_tested': dbs.append(f"BD:{row['bd_signal']}")
if row['gm_signal'] != 'not_predicted': dbs.append(f"GM:{row['gm_signal']}")
print(f" {row['wom_compound']} (WoM:{row['wom_action']}) — {', '.join(dbs)}")
print(f"\nFiles saved:")
print(f" {DATA_DIR}/consistency_matrix.tsv")
print(f" {FIG_DIR}/consistency_heatmap.png")
print(f" {FIG_DIR}/concordance_summary.png")
====================================================================== NB03 SUMMARY: THREE-WAY CONSISTENCY ANALYSIS ====================================================================== Total WoM-produced metabolites: 58 Fully concordant: 17 (29%) Partially concordant: 4 (7%) Fully discordant: 0 (0%) WoM only (no other DB match): 37 (64%) Mean concordance (where testable): 0.94 --- Key Biological Findings --- 1. Produced but NOT catabolized (WoM+, BacDive-): lysine (WoM:E) — suggests overflow metabolism or cross-feeding glycine (WoM:I) — suggests overflow metabolism or cross-feeding trehalose (WoM:I) — suggests overflow metabolism or cross-feeding tryptophan (WoM:I) — suggests overflow metabolism or cross-feeding 2. Produced but pathway INCOMPLETE (WoM+, GapMind incomplete): 3. Strong concordance examples (all databases agree): lactate (WoM:E) — FB:growth, GM:pathway_present valine (WoM:E) — FB:growth, BD:utilized, GM:pathway_present Malate (WoM:I) — FB:growth, BD:utilized, GM:pathway_present alanine (WoM:I) — FB:growth, GM:pathway_present arginine (WoM:I) — FB:growth, BD:utilized, GM:pathway_present aspartate (WoM:I) — FB:growth, GM:pathway_present glutamic acid (WoM:I) — FB:growth, GM:pathway_present phenylalanine (WoM:I) — FB:growth, GM:pathway_present proline (WoM:I) — FB:growth, GM:pathway_present Files saved: ../data/consistency_matrix.tsv ../figures/consistency_heatmap.png ../figures/concordance_summary.png