02 Pathway Gene Linking
Jupyter notebook from the Metabolic Capability vs Metabolic Dependency project.
NB02: Pathway-Gene Linking¶
Map GapMind pathway names to FB genes via KEGG pathway and EC number annotations from eggNOG, then merge with fitness data to create the analysis-ready dataset.
Requires BERDL JupyterHub — uses Spark to build the gene-cluster → FB-gene link.
Inputs (from NB01)¶
data/organism_mapping.csvdata/fb_gapmind_scores.csvdata/fb_eggnog_annotations.csvdata/fb_fitness_stats.csvdata/fb_essential_genes.csv
Outputs¶
data/pathway_gene_fitness.csv— each row = (organism, pathway, gene, fitness, conservation)data/kegg_gapmind_mapping.csv— KEGG pathway → GapMind pathway mapping table
Approach¶
- Define KEGG pathway → GapMind pathway mapping (amino acid + carbon source)
- Build FB gene → pangenome cluster link via Spark (scaffold ID matching)
- Assign genes to GapMind pathways via eggNOG EC/KEGG annotations
- Merge with fitness data
- Quality control and output
import pandas as pd
import numpy as np
from pathlib import Path
import re
spark = get_spark_session()
DATA_DIR = Path('../data')
# Load NB01 outputs
org_mapping = pd.read_csv(DATA_DIR / 'organism_mapping.csv')
gapmind_fb = pd.read_csv(DATA_DIR / 'fb_gapmind_scores.csv')
eggnog = pd.read_csv(DATA_DIR / 'fb_eggnog_annotations.csv')
fitness_stats = pd.read_csv(DATA_DIR / 'fb_fitness_stats.csv')
essential_genes = pd.read_csv(DATA_DIR / 'fb_essential_genes.csv')
print(f"Organisms: {org_mapping['orgId'].nunique()}")
print(f"GapMind scores: {len(gapmind_fb)}")
print(f"eggNOG annotations: {len(eggnog):,}")
print(f"Fitness stats: {len(fitness_stats):,}")
print(f"Essential genes: {len(essential_genes):,}")
1. Define KEGG → GapMind Pathway Mapping¶
GapMind covers 80 pathways: ~18 amino acid biosynthesis + ~62 carbon source utilization. We map these to KEGG pathway IDs and EC number patterns from eggNOG annotations.
The mapping uses three strategies:
- KEGG Pathway IDs (from eggNOG
KEGG_Pathwaycolumn) — most reliable - EC number patterns — for pathways with well-defined enzyme sets
- KEGG Ortholog (KO) — for specific enzymes not captured by pathway IDs
# KEGG pathway ID → GapMind pathway name mapping
# Each GapMind pathway may map to one or more KEGG pathways
# Amino acid biosynthesis pathways
AA_KEGG_MAP = {
# GapMind pathway: [KEGG map IDs that contain relevant biosynthesis genes]
'alanine': ['map00250'], # Alanine, aspartate, glutamate metabolism
'arginine': ['map00220', 'map00330'], # Arginine biosynthesis; Arg+Pro metabolism
'asparagine': ['map00250'], # Alanine, aspartate, glutamate metabolism
'aspartate': ['map00250'], # Alanine, aspartate, glutamate metabolism
'cysteine': ['map00270'], # Cysteine and methionine metabolism
'glutamate': ['map00250'], # Alanine, aspartate, glutamate metabolism
'glutamine': ['map00250'], # Alanine, aspartate, glutamate metabolism
'glycine': ['map00260'], # Glycine, serine, threonine metabolism
'histidine': ['map00340'], # Histidine metabolism
'isoleucine': ['map00290'], # Val, Leu, Ile biosynthesis
'leucine': ['map00290'], # Val, Leu, Ile biosynthesis
'lysine': ['map00300'], # Lysine biosynthesis
'methionine': ['map00270'], # Cysteine and methionine metabolism
'phenylalanine': ['map00400'], # Phe, Tyr, Trp biosynthesis
'proline': ['map00330'], # Arginine and proline metabolism
'serine': ['map00260'], # Glycine, serine, threonine metabolism
'threonine': ['map00260'], # Glycine, serine, threonine metabolism
'tryptophan': ['map00400'], # Phe, Tyr, Trp biosynthesis
'tyrosine': ['map00400'], # Phe, Tyr, Trp biosynthesis
'valine': ['map00290'], # Val, Leu, Ile biosynthesis
}
# Carbon source utilization pathways (subset — most common ones)
CARBON_KEGG_MAP = {
'glucose': ['map00010', 'map00030'], # Glycolysis; Pentose phosphate
'galactose': ['map00052'], # Galactose metabolism
'mannose': ['map00051'], # Fructose and mannose metabolism
'fructose': ['map00051'], # Fructose and mannose metabolism
'xylose': ['map00040'], # Pentose and glucuronate interconversions
'arabinose': ['map00040'], # Pentose and glucuronate interconversions
'ribose': ['map00030'], # Pentose phosphate pathway
'gluconate': ['map00030'], # Pentose phosphate pathway
'glucuronate': ['map00040'], # Pentose and glucuronate interconversions
'galacturonate': ['map00040'], # Pentose and glucuronate interconversions
'acetate': ['map00620', 'map00630'], # Pyruvate; Glyoxylate metabolism
'citrate': ['map00020'], # Citrate cycle
'succinate': ['map00020'], # Citrate cycle
'fumarate': ['map00020'], # Citrate cycle
'malate': ['map00020'], # Citrate cycle
'pyruvate': ['map00620'], # Pyruvate metabolism
'lactate': ['map00620'], # Pyruvate metabolism
'glycerol': ['map00561'], # Glycerolipid metabolism
'mannitol': ['map00051'], # Fructose and mannose metabolism
'sorbitol': ['map00051'], # Fructose and mannose metabolism
'trehalose': ['map00500'], # Starch and sucrose metabolism
'maltose': ['map00500'], # Starch and sucrose metabolism
'sucrose': ['map00500'], # Starch and sucrose metabolism
'cellobiose': ['map00500'], # Starch and sucrose metabolism
'lactose': ['map00052'], # Galactose metabolism
'N-acetylglucosamine': ['map00520'], # Amino sugar metabolism
'myo-inositol': ['map00562'], # Inositol phosphate metabolism
'putrescine': ['map00330'], # Arginine and proline metabolism
'4-aminobutyrate': ['map00250'], # Alanine, aspartate, glutamate metabolism
'phenylacetate': ['map00360'], # Phenylalanine metabolism
'benzoate': ['map00362'], # Benzoate degradation
'4-hydroxybenzoate': ['map00362'], # Benzoate degradation
'vanillate': ['map00362'], # Benzoate degradation
}
# Combine all mappings
ALL_KEGG_MAP = {}
for pathway, kegg_ids in AA_KEGG_MAP.items():
ALL_KEGG_MAP[pathway] = {'kegg_pathways': kegg_ids, 'category': 'amino_acid'}
for pathway, kegg_ids in CARBON_KEGG_MAP.items():
ALL_KEGG_MAP[pathway] = {'kegg_pathways': kegg_ids, 'category': 'carbon'}
print(f"GapMind → KEGG mapping defined for {len(ALL_KEGG_MAP)} pathways")
print(f" Amino acid: {len(AA_KEGG_MAP)}")
print(f" Carbon source: {len(CARBON_KEGG_MAP)}")
# Check which GapMind pathways we can map
gapmind_pathways = gapmind_fb['pathway'].unique()
mapped = set(ALL_KEGG_MAP.keys()) & set(gapmind_pathways)
unmapped = set(gapmind_pathways) - set(ALL_KEGG_MAP.keys())
print(f"\nGapMind pathways with KEGG mapping: {len(mapped)}/{len(gapmind_pathways)}")
if unmapped:
print(f"Unmapped GapMind pathways: {sorted(unmapped)}")
# Build inverted index: KEGG pathway ID → list of GapMind pathways
kegg_to_gapmind = {}
for gm_pathway, info in ALL_KEGG_MAP.items():
for kegg_id in info['kegg_pathways']:
if kegg_id not in kegg_to_gapmind:
kegg_to_gapmind[kegg_id] = []
kegg_to_gapmind[kegg_id].append(gm_pathway)
# Save mapping for reference
mapping_rows = []
for gm_pathway, info in ALL_KEGG_MAP.items():
for kegg_id in info['kegg_pathways']:
mapping_rows.append({
'gapmind_pathway': gm_pathway,
'kegg_pathway_id': kegg_id,
'metabolic_category': info['category']
})
kegg_mapping_df = pd.DataFrame(mapping_rows)
kegg_mapping_df.to_csv(DATA_DIR / 'kegg_gapmind_mapping.csv', index=False)
print(f"Saved KEGG-GapMind mapping: {len(kegg_mapping_df)} entries")
2. Build FB Gene → Pangenome Cluster Link¶
We need to connect FB genes (locusId) to pangenome gene clusters. This uses the scaffold accession matching approach: FB scaffold IDs are embedded in pangenome gene IDs.
If the conservation_vs_fitness link table is available, we use that instead
(it was built with DIAMOND for higher accuracy).
# Check if upstream link table exists
upstream_link = Path('../../conservation_vs_fitness/data/fb_pangenome_link.tsv')
if upstream_link.exists():
print("Found upstream link table from conservation_vs_fitness")
fb_link = pd.read_csv(upstream_link, sep='\t')
print(f"Loaded: {len(fb_link):,} gene-cluster links")
print(f"Organisms: {fb_link['orgId'].nunique()}")
else:
print("Upstream link table not found — building from scaffold matching...")
print("(For higher accuracy, run conservation_vs_fitness NB01-03 first)")
# Build link via scaffold matching
# Get FB scaffolds per organism
fb_scaffolds = spark.sql(f"""
SELECT DISTINCT orgId, locusId, scaffoldId
FROM kescience_fitnessbrowser.gene
WHERE orgId IN ('{",'".join(org_mapping['orgId'].tolist())}')
AND type = '1'
""").toPandas()
# For each organism's clade, find gene clusters matching the scaffold
link_rows = []
for _, org_row in org_mapping.iterrows():
orgId = org_row['orgId']
clade = org_row['gtdb_species_clade_id']
# Get this organism's scaffolds
org_scaffs = fb_scaffolds[fb_scaffolds['orgId'] == orgId]['scaffoldId'].unique()
if len(org_scaffs) == 0:
continue
# Find gene clusters in this clade whose IDs start with any of the scaffolds
scaff_conditions = " OR ".join([f"gene_cluster_id LIKE '{s}%'" for s in org_scaffs[:5]])
clusters = spark.sql(f"""
SELECT gene_cluster_id, is_core, is_auxiliary, is_singleton
FROM kbase_ke_pangenome.gene_cluster
WHERE gtdb_species_clade_id = '{clade}'
AND ({scaff_conditions})
""").toPandas()
if len(clusters) > 0:
clusters['orgId'] = orgId
clusters['gtdb_species_clade_id'] = clade
link_rows.append(clusters)
print(f" {orgId}: {len(clusters)} gene clusters matched")
if link_rows:
fb_link = pd.concat(link_rows, ignore_index=True)
print(f"\nTotal links: {len(fb_link):,}")
else:
fb_link = pd.DataFrame()
print("WARNING: No gene-cluster links found")
# Show coverage per organism
if len(fb_link) > 0:
print(f"\nLink table: {len(fb_link):,} entries across {fb_link['orgId'].nunique()} organisms")
3. Assign Genes to GapMind Pathways via eggNOG¶
Use the KEGG_Pathway annotations from eggNOG to assign gene clusters to GapMind metabolic pathways. Then connect to FB genes via the link table.
# Parse eggNOG KEGG_Pathway column
# Format: comma-separated KEGG pathway IDs (e.g., "map00010,map00020")
# or may contain 'ko' prefix entries
def extract_kegg_pathways(kegg_str):
"""Extract KEGG map IDs from eggNOG KEGG_Pathway string."""
if pd.isna(kegg_str) or kegg_str == '-' or kegg_str == '':
return []
return [p.strip() for p in str(kegg_str).split(',') if p.strip().startswith('map')]
def extract_ec_numbers(ec_str):
"""Extract EC numbers from eggNOG EC string."""
if pd.isna(ec_str) or ec_str == '-' or ec_str == '':
return []
return [e.strip() for e in str(ec_str).split(',') if re.match(r'\d+\.\d+\.\d+\.\d+', e.strip())]
# Explode eggNOG to one row per gene_cluster × KEGG pathway
eggnog_with_kegg = eggnog[eggnog['KEGG_Pathway'].notna()].copy()
eggnog_with_kegg['kegg_pathway_list'] = eggnog_with_kegg['KEGG_Pathway'].apply(extract_kegg_pathways)
eggnog_exploded = eggnog_with_kegg.explode('kegg_pathway_list')
eggnog_exploded = eggnog_exploded.dropna(subset=['kegg_pathway_list'])
eggnog_exploded = eggnog_exploded.rename(columns={'kegg_pathway_list': 'kegg_pathway_id'})
print(f"eggNOG entries with KEGG pathway: {len(eggnog_with_kegg):,}")
print(f"After exploding to individual pathways: {len(eggnog_exploded):,}")
print(f"Unique KEGG pathways in data: {eggnog_exploded['kegg_pathway_id'].nunique()}")
# Map gene clusters to GapMind pathways via KEGG
# A gene cluster is assigned to a GapMind pathway if its KEGG pathway annotation
# matches any of the KEGG IDs in our mapping
gapmind_assignments = []
for _, row in eggnog_exploded.iterrows():
kegg_id = row['kegg_pathway_id']
if kegg_id in kegg_to_gapmind:
for gm_pathway in kegg_to_gapmind[kegg_id]:
gapmind_assignments.append({
'gene_cluster_id': row['gene_cluster_id'],
'gtdb_species_clade_id': row['gtdb_species_clade_id'],
'gapmind_pathway': gm_pathway,
'kegg_pathway_id': kegg_id,
'EC': row['EC'],
'is_core': row['is_core'],
'is_auxiliary': row['is_auxiliary'],
'is_singleton': row['is_singleton'],
'Description': row['Description']
})
pathway_genes = pd.DataFrame(gapmind_assignments)
print(f"Gene cluster → GapMind pathway assignments: {len(pathway_genes):,}")
print(f"Unique gene clusters assigned: {pathway_genes['gene_cluster_id'].nunique():,}")
print(f"GapMind pathways with gene assignments: {pathway_genes['gapmind_pathway'].nunique()}")
print(f"\nAssignments per pathway:")
print(pathway_genes['gapmind_pathway'].value_counts().head(20))
# Connect pathway gene clusters to FB genes via the link table
if 'locusId' in fb_link.columns:
# Full link table with locusId (from DIAMOND)
pathway_fb = pathway_genes.merge(
fb_link[['orgId', 'locusId', 'gene_cluster_id']],
on='gene_cluster_id',
how='inner'
)
else:
# Scaffold-based link (no locusId — need to join differently)
pathway_fb = pathway_genes.merge(
fb_link[['orgId', 'gene_cluster_id']],
on='gene_cluster_id',
how='inner'
)
print(f"Pathway genes linked to FB organisms: {len(pathway_fb):,}")
print(f"Organisms: {pathway_fb['orgId'].nunique()}")
print(f"Pathways: {pathway_fb['gapmind_pathway'].nunique()}")
4. Merge with Fitness Data¶
For each pathway gene, attach its fitness statistics. Also flag essential genes.
if 'locusId' in pathway_fb.columns:
# Merge with fitness stats
pathway_fitness = pathway_fb.merge(
fitness_stats[['orgId', 'locusId', 'mean_fit', 'mean_abs_fit',
'min_fit', 'max_fit', 'n_sig_important', 'n_sig_sick',
'n_sig_benefit', 'n_experiments']],
on=['orgId', 'locusId'],
how='left'
)
# Flag essential genes (in essential_genes but not in fitness_stats)
ess_set = set(zip(essential_genes['orgId'], essential_genes['locusId']))
pathway_fitness['is_essential'] = pathway_fitness.apply(
lambda r: (r['orgId'], r.get('locusId', '')) in ess_set, axis=1
)
# Classify gene fitness importance
pathway_fitness['fitness_category'] = 'neutral'
pathway_fitness.loc[pathway_fitness['is_essential'], 'fitness_category'] = 'essential'
pathway_fitness.loc[
(~pathway_fitness['is_essential']) &
(pathway_fitness['n_sig_important'].fillna(0) > 0),
'fitness_category'
] = 'important'
print(f"Pathway-gene-fitness table: {len(pathway_fitness):,} rows")
print(f"\nFitness categories:")
print(pathway_fitness['fitness_category'].value_counts())
print(f"\nWith fitness data: {pathway_fitness['mean_fit'].notna().sum():,}")
print(f"Essential: {pathway_fitness['is_essential'].sum():,}")
print(f"No data (unlinked): {pathway_fitness['mean_fit'].isna().sum() - pathway_fitness['is_essential'].sum():,}")
else:
print("WARNING: No locusId in link table — cannot merge fitness data directly")
print("Re-run with DIAMOND-based link table for gene-level analysis")
pathway_fitness = pathway_fb.copy()
pathway_fitness['fitness_category'] = 'unknown'
5. Add GapMind Pathway Completeness¶
Merge each gene's pathway assignment with the GapMind completeness score for that organism-pathway combination.
# Add GapMind completeness for each organism-pathway pair
gapmind_lookup = gapmind_fb[['orgId', 'pathway', 'best_score_rank', 'best_score_category', 'metabolic_category']].copy()
gapmind_lookup = gapmind_lookup.rename(columns={'pathway': 'gapmind_pathway'})
pathway_fitness = pathway_fitness.merge(
gapmind_lookup,
on=['orgId', 'gapmind_pathway'],
how='left'
)
# Classify pathway completeness
pathway_fitness['pathway_status'] = 'unknown'
pathway_fitness.loc[pathway_fitness['best_score_rank'] >= 5, 'pathway_status'] = 'complete'
pathway_fitness.loc[pathway_fitness['best_score_rank'] == 4, 'pathway_status'] = 'likely_complete'
pathway_fitness.loc[pathway_fitness['best_score_rank'] == 3, 'pathway_status'] = 'steps_missing_low'
pathway_fitness.loc[pathway_fitness['best_score_rank'] == 2, 'pathway_status'] = 'steps_missing_medium'
pathway_fitness.loc[pathway_fitness['best_score_rank'] <= 1, 'pathway_status'] = 'not_present'
print(f"\nPathway completeness for linked genes:")
print(pathway_fitness['pathway_status'].value_counts())
6. Quality Control¶
# Per-organism summary
org_qc = pathway_fitness.groupby('orgId').agg(
n_genes=('gene_cluster_id', 'nunique'),
n_pathways=('gapmind_pathway', 'nunique'),
n_essential=('is_essential', 'sum') if 'is_essential' in pathway_fitness.columns else ('orgId', 'size'),
n_important=('fitness_category', lambda x: (x == 'important').sum()),
n_neutral=('fitness_category', lambda x: (x == 'neutral').sum()),
n_complete_pathways=('pathway_status', lambda x: ((x == 'complete') | (x == 'likely_complete')).sum())
).reset_index()
print("Per-organism QC summary:")
print(org_qc.to_string(index=False))
# Pathway coverage check
pathway_coverage = pathway_fitness.groupby('gapmind_pathway').agg(
n_organisms=('orgId', 'nunique'),
n_genes=('gene_cluster_id', 'nunique'),
pct_important=('fitness_category', lambda x: (x.isin(['important', 'essential'])).mean() * 100)
).reset_index().sort_values('n_organisms', ascending=False)
print(f"\nPathway coverage (top 20):")
print(pathway_coverage.head(20).to_string(index=False))
7. Save Output¶
output_path = DATA_DIR / 'pathway_gene_fitness.csv'
pathway_fitness.to_csv(output_path, index=False)
print(f"Saved: {output_path} ({len(pathway_fitness):,} rows)")
print("\n" + "=" * 60)
print("NB02 SUMMARY")
print("=" * 60)
print(f"Pathway-gene-fitness links: {len(pathway_fitness):,}")
print(f"Organisms: {pathway_fitness['orgId'].nunique()}")
print(f"GapMind pathways mapped: {pathway_fitness['gapmind_pathway'].nunique()}")
print(f"Gene clusters: {pathway_fitness['gene_cluster_id'].nunique():,}")
if 'is_essential' in pathway_fitness.columns:
print(f"Essential genes in pathways: {pathway_fitness['is_essential'].sum():,}")
print(f"\nFitness categories: {dict(pathway_fitness['fitness_category'].value_counts())}")
print(f"Pathway status: {dict(pathway_fitness['pathway_status'].value_counts())}")
print("=" * 60)
print(f"\nNext: Run NB03 (pathway classification) — can run locally")