01 Extract Gapmind Pathways
Jupyter notebook from the Metabolic Capability vs Metabolic Dependency project.
NB01: Extract GapMind Pathway Data¶
Requires: BERDL JupyterHub with Spark access
Purpose: Extract GapMind pathway completeness predictions from the pangenome database and aggregate to genome level.
Output:
data/gapmind_genome_pathways.csv— Genome × pathway matrix with best scoresdata/gapmind_pathway_summary.csv— Pathway statistics (prevalence, completeness rate)
Runtime: ~5-10 minutes on Spark cluster
In [1]:
import os
from pathlib import Path
import pandas as pd
# Initialize Spark session (available on BERDL JupyterHub)
spark = get_spark_session()
# Set up paths
PROJECT_ROOT = Path('/home/cjneely/repos/BERIL-research-observatory/projects/metabolic_capability_dependency')
DATA_DIR = PROJECT_ROOT / 'data'
DATA_DIR.mkdir(exist_ok=True, parents=True)
print(f"Spark session initialized: {spark}")
print(f"Output directory: {DATA_DIR}")
Spark session initialized: <pyspark.sql.connect.session.SparkSession object at 0x7ca804b892b0> Output directory: /home/cjneely/repos/BERIL-research-observatory/projects/metabolic_capability_dependency/data
1. Inspect GapMind Schema¶
First, let's verify the schema and check available score categories.
In [2]:
# Check schema
spark.sql("DESCRIBE kbase_ke_pangenome.gapmind_pathways").show(50, truncate=False)
# Check available score categories
score_cats = spark.sql("""
SELECT score_category, COUNT(*) as n
FROM kbase_ke_pangenome.gapmind_pathways
GROUP BY score_category
ORDER BY n DESC
""").toPandas()
print("\nScore categories:")
print(score_cats)
+------------------+---------+-------+
|col_name |data_type|comment|
+------------------+---------+-------+
|genome_id |string |NULL |
|pathway |string |NULL |
|clade_name |string |NULL |
|metabolic_category|string |NULL |
|sequence_scope |string |NULL |
|nHi |int |NULL |
|nMed |int |NULL |
|nLo |int |NULL |
|score |double |NULL |
|score_category |string |NULL |
|score_simplified |double |NULL |
+------------------+---------+-------+
Score categories:
score_category n
0 not_present 161589717
1 complete 63786471
2 steps_missing_low 57194541
3 likely_complete 12396275
4 steps_missing_medium 10504276
In [3]:
# Check pathway names and counts
pathway_counts = spark.sql("""
SELECT pathway, COUNT(DISTINCT genome_id) as n_genomes
FROM kbase_ke_pangenome.gapmind_pathways
GROUP BY pathway
ORDER BY pathway
""").toPandas()
print(f"\nTotal pathways: {len(pathway_counts)}")
print("\nSample pathways:")
print(pathway_counts.head(20))
Total pathways: 80
Sample pathways:
pathway n_genomes
0 2-oxoglutarate 292806
1 4-hydroxybenzoate 292806
2 D-alanine 292806
3 D-lactate 292806
4 D-serine 292806
5 L-lactate 292806
6 L-malate 292806
7 NAG 292806
8 acetate 292806
9 alanine 292806
10 arabinose 292806
11 arg 292806
12 arginine 292806
13 asn 292806
14 asparagine 292806
15 aspartate 292806
16 cellobiose 292806
17 chorismate 292806
18 citrate 292806
19 citrulline 292806
2. Extract Best Pathway Scores per Genome¶
GapMind has multiple rows per genome-pathway pair (one per step/component). We take the BEST score for each pair.
Score mapping:
complete→ 5 (pathway fully present)likely_complete→ 4 (high confidence)steps_missing_low→ 3 (minor gaps)steps_missing_medium→ 2 (significant gaps)not_present→ 1 (pathway absent)
In [4]:
# Extract best scores per genome-pathway pair.
# Query one pathway at a time to stay well under spark.driver.maxResultSize (1 GiB).
# Each per-pathway result is ~290K rows (~30 MB), making 80 total batches.
# This avoids both the driver result size limit (hit by full toPandas()) and
# the silent empty-write issue seen with write.parquet() in Spark Connect.
QUERY_TEMPLATE = """
WITH pathway_scores AS (
SELECT
genome_id,
clade_name AS species,
pathway,
CASE score_category
WHEN 'complete' THEN 5
WHEN 'likely_complete' THEN 4
WHEN 'steps_missing_low' THEN 3
WHEN 'steps_missing_medium' THEN 2
WHEN 'not_present' THEN 1
ELSE 0
END as score_value
FROM kbase_ke_pangenome.gapmind_pathways
WHERE clade_name IS NOT NULL
AND pathway = '{pathway}'
)
SELECT
genome_id,
species,
pathway,
MAX(score_value) as best_score,
CASE WHEN MAX(score_value) >= 5 THEN 1 ELSE 0 END as is_complete
FROM pathway_scores
GROUP BY genome_id, species, pathway
"""
# Fetch distinct pathway list (small query, safe for toPandas)
pathways = spark.sql("""
SELECT DISTINCT pathway
FROM kbase_ke_pangenome.gapmind_pathways
ORDER BY pathway
""").toPandas()['pathway'].tolist()
print(f"Extracting {len(pathways)} pathways one at a time...")
batches = []
for i, pathway in enumerate(pathways):
batch = spark.sql(QUERY_TEMPLATE.format(pathway=pathway)).toPandas()
batches.append(batch)
if (i + 1) % 10 == 0 or (i + 1) == len(pathways):
print(f" {i + 1}/{len(pathways)} pathways complete")
gapmind_pd = pd.concat(batches, ignore_index=True)
output_path = DATA_DIR / 'gapmind_genome_pathways.csv'
gapmind_pd.to_csv(output_path, index=False)
print(f"\nExtracted {len(gapmind_pd):,} genome-pathway records")
print(f"Saved to: {output_path}")
print(f"\nUnique genomes: {gapmind_pd['genome_id'].nunique():,}")
print(f"Unique species: {gapmind_pd['species'].nunique():,}")
print(f"Unique pathways: {gapmind_pd['pathway'].nunique()}")
Extracting 80 pathways one at a time... 10/80 pathways complete 20/80 pathways complete 30/80 pathways complete 40/80 pathways complete 50/80 pathways complete 60/80 pathways complete 70/80 pathways complete 80/80 pathways complete Extracted 23,424,480 genome-pathway records Saved to: /home/cjneely/repos/BERIL-research-observatory/projects/metabolic_capability_dependency/data/gapmind_genome_pathways.csv Unique genomes: 292,806 Unique species: 27,690 Unique pathways: 80
In [5]:
# Preview the data
print("Sample records:")
print(gapmind_pd.head(20))
print("\nScore distribution:")
print(gapmind_pd['best_score'].value_counts().sort_index())
print("\nComplete pathway rate:")
print(f"{gapmind_pd['is_complete'].mean():.1%} of genome-pathway pairs are complete")
Sample records:
genome_id species \
0 GCF_016017725.1 s__Salmonella_enterica--RS_GCF_000006945.2
1 GCF_001716135.1 s__Salmonella_enterica--RS_GCF_000006945.2
2 GCF_001244185.1 s__Salmonella_enterica--RS_GCF_000006945.2
3 GCF_001680585.1 s__Salmonella_enterica--RS_GCF_000006945.2
4 GCF_003689575.1 s__Salmonella_enterica--RS_GCF_000006945.2
5 GCF_005518525.1 s__Salmonella_enterica--RS_GCF_000006945.2
6 GCF_002047885.1 s__Salmonella_enterica--RS_GCF_000006945.2
7 GCF_004292485.1 s__Salmonella_enterica--RS_GCF_000006945.2
8 GCF_002052035.1 s__Salmonella_enterica--RS_GCF_000006945.2
9 GCF_005144555.1 s__Salmonella_enterica--RS_GCF_000006945.2
10 GCF_001478105.1 s__Salmonella_enterica--RS_GCF_000006945.2
11 GCA_004337745.2 s__Salmonella_enterica--RS_GCF_000006945.2
12 GCF_017571765.1 s__Salmonella_enterica--RS_GCF_000006945.2
13 GCF_002047475.1 s__Salmonella_enterica--RS_GCF_000006945.2
14 GCF_024220255.1 s__Campylobacter_D_coli--RS_GCF_000254135.1
15 GCF_001487615.1 s__Campylobacter_D_coli--RS_GCF_000254135.1
16 GCF_001563565.1 s__Campylobacter_D_jejuni--RS_GCF_001457695.1
17 GCF_001224345.1 s__Campylobacter_D_jejuni--RS_GCF_001457695.1
18 GCF_007845785.1 s__Campylobacter_D_jejuni--RS_GCF_001457695.1
19 GCF_003961085.1 s__Campylobacter_D_jejuni--RS_GCF_001457695.1
pathway best_score is_complete
0 2-oxoglutarate 5 1
1 2-oxoglutarate 1 0
2 2-oxoglutarate 5 1
3 2-oxoglutarate 5 1
4 2-oxoglutarate 5 1
5 2-oxoglutarate 5 1
6 2-oxoglutarate 5 1
7 2-oxoglutarate 5 1
8 2-oxoglutarate 5 1
9 2-oxoglutarate 1 0
10 2-oxoglutarate 5 1
11 2-oxoglutarate 5 1
12 2-oxoglutarate 5 1
13 2-oxoglutarate 5 1
14 2-oxoglutarate 5 1
15 2-oxoglutarate 5 1
16 2-oxoglutarate 5 1
17 2-oxoglutarate 5 1
18 2-oxoglutarate 5 1
19 2-oxoglutarate 5 1
Score distribution:
best_score
1 3035627
2 322321
3 5025213
4 1626710
5 13414609
Name: count, dtype: int64
Complete pathway rate:
57.3% of genome-pathway pairs are complete
3. Generate Pathway Summary Statistics¶
In [6]:
# Compute pathway-level statistics
pathway_summary = gapmind_pd.groupby('pathway').agg({
'genome_id': 'count', # total records
'is_complete': ['sum', 'mean'], # complete count and rate
'best_score': ['mean', 'median', 'std']
}).reset_index()
pathway_summary.columns = [
'pathway',
'n_genomes',
'n_complete',
'pct_complete',
'mean_score',
'median_score',
'std_score'
]
pathway_summary['pct_complete'] *= 100 # Convert to percentage
pathway_summary = pathway_summary.sort_values('n_complete', ascending=False)
# Save summary
summary_path = DATA_DIR / 'gapmind_pathway_summary.csv'
pathway_summary.to_csv(summary_path, index=False)
print(f"Saved pathway summary to: {summary_path}")
print("\nTop 20 most complete pathways:")
print(pathway_summary.head(20))
Saved pathway summary to: /home/cjneely/repos/BERIL-research-observatory/projects/metabolic_capability_dependency/data/gapmind_pathway_summary.csv
Top 20 most complete pathways:
pathway n_genomes n_complete pct_complete mean_score \
37 gly 292806 283922 96.965909 4.900313
30 gln 292806 279305 95.389097 4.899387
17 chorismate 292806 274857 93.870003 4.839611
13 asn 292806 274354 93.698217 4.884039
51 met 292806 271207 92.623444 4.812876
69 threonine 292806 269468 92.029535 4.866150
68 thr 292806 262782 89.746112 4.732628
46 lys 292806 262782 89.746112 4.757259
20 cys 292806 248328 84.809738 4.682653
64 serine 292806 243218 83.064555 4.470482
23 deoxyribose 292806 232389 79.366202 4.614885
24 ethanol 292806 231465 79.050634 4.573048
41 ile 292806 227702 77.765483 4.501646
21 deoxyinosine 292806 216329 73.881341 4.571047
73 tryptophan 292806 216145 73.818501 4.487019
5 L-lactate 292806 214043 73.100620 4.426928
70 thymidine 292806 212730 72.652200 4.532404
57 proline 292806 212624 72.615998 4.361707
1 4-hydroxybenzoate 292806 212136 72.449335 4.469994
67 sucrose 292806 210620 71.931586 4.463266
median_score std_score
37 5.0 0.589304
30 5.0 0.501911
17 5.0 0.697537
13 5.0 0.481394
51 5.0 0.735292
69 5.0 0.500590
68 5.0 0.869882
46 5.0 0.788947
20 5.0 0.877506
64 5.0 1.245964
23 5.0 0.846964
24 5.0 0.931603
41 5.0 1.034995
21 5.0 0.843291
73 5.0 0.984676
5 5.0 1.044931
70 5.0 0.889139
57 5.0 1.202968
1 5.0 0.987630
67 5.0 0.950927
4. Species-Level Summary¶
In [7]:
# Compute per-species pathway completeness
species_summary = gapmind_pd.groupby('species').agg({
'genome_id': 'nunique',
'pathway': 'nunique',
'is_complete': 'sum'
}).reset_index()
species_summary.columns = ['species', 'n_genomes', 'n_pathways', 'n_complete_pathways']
species_summary = species_summary.sort_values('n_genomes', ascending=False)
# Save
species_path = DATA_DIR / 'gapmind_species_summary.csv'
species_summary.to_csv(species_path, index=False)
print(f"Saved species summary to: {species_path}")
print("\nTop 20 species by genome count:")
print(species_summary.head(20))
Saved species summary to: /home/cjneely/repos/BERIL-research-observatory/projects/metabolic_capability_dependency/data/gapmind_species_summary.csv
Top 20 species by genome count:
species n_genomes \
21956 s__Staphylococcus_aureus--RS_GCF_001027105.1 14505
11881 s__Klebsiella_pneumoniae--RS_GCF_000742135.1 14240
21202 s__Salmonella_enterica--RS_GCF_000006945.2 11396
22248 s__Streptococcus_pneumoniae--RS_GCF_001457635.1 8434
14804 s__Mycobacterium_tuberculosis--RS_GCF_000195955.2 6903
18681 s__Pseudomonas_aeruginosa--RS_GCF_001457615.1 6760
763 s__Acinetobacter_baumannii--RS_GCF_009759685.1 6644
5532 s__Clostridioides_difficile--RS_GCF_001077535.1 2603
7453 s__Enterococcus_B_faecium--RS_GCF_001544255.1 2533
7406 s__Enterobacter_hormaechei_A--RS_GCF_001729745.1 2453
5023 s__Campylobacter_D_jejuni--RS_GCF_001457695.1 2312
7488 s__Enterococcus_faecalis--RS_GCF_000392875.1 2249
12759 s__Listeria_monocytogenes--RS_GCF_900187225.1 2187
15025 s__Neisseria_meningitidis--RS_GCF_900638555.1 2121
22259 s__Streptococcus_pyogenes--RS_GCF_002055535.1 2079
12760 s__Listeria_monocytogenes_B--RS_GCF_000307025.1 1917
14623 s__Mycobacterium_abscessus--RS_GCF_000069185.1 1825
27099 s__Vibrio_parahaemolyticus--RS_GCF_900460535.1 1784
3421 s__Burkholderia_mallei--RS_GCF_000011705.1 1741
22319 s__Streptococcus_suis--RS_GCF_000294495.1 1618
n_pathways n_complete_pathways
21956 80 751639
11881 80 1123574
21202 80 697242
22248 80 361881
14804 80 337821
18681 80 414934
763 80 380370
5532 80 88304
7453 80 119906
7406 80 188780
5023 80 64523
7488 80 99967
12759 80 98387
15025 80 101535
22259 80 74724
12760 80 86222
14623 80 105569
27099 80 104262
3421 80 127868
22319 80 66362
Completion¶
Outputs generated:
data/gapmind_genome_pathways.csv— Full genome × pathway matrixdata/gapmind_pathway_summary.csv— Pathway-level statisticsdata/gapmind_species_summary.csv— Species-level statistics
Next step: Run NB02 locally to map pathways to Fitness Browser genes.
In [8]:
# Final verification
print("Files created:")
for f in DATA_DIR.glob('*.csv'):
size_mb = f.stat().st_size / 1024 / 1024
print(f" {f.name}: {size_mb:.1f} MB")
Files created: gapmind_genome_pathways.csv: 1670.0 MB gapmind_pathway_summary.csv: 0.0 MB gapmind_species_summary.csv: 1.4 MB