01 Data Extraction
Jupyter notebook from the Pangenome Openness, Metabolic Pathways, and Biogeography project.
Data Extraction: Pangenome, Pathways, and Biogeography¶
This notebook extracts data from BERDL for analyzing relationships between:
- Pangenome characteristics (open/closed)
- Metabolic pathway diversity (gapmind)
- Biogeographic distribution (AlphaEarth embeddings)
Run on BERDL JupyterHub for best performance with large tables.
In [1]:
# Initialize Spark session
import pandas as pd
import numpy as np
from pathlib import Path
spark = get_spark_session()
print(f"Spark version: {spark.version}")
Spark version: 4.0.1
1. Extract Pangenome Statistics¶
Get species-level pangenome metrics from the pangenome table.
In [2]:
# Query pangenome statistics with species names
pangenome_query = """
SELECT
p.gtdb_species_clade_id,
s.GTDB_species,
s.GTDB_taxonomy,
p.no_genomes,
p.no_core,
p.no_aux_genome,
p.no_singleton_gene_clusters,
p.no_gene_clusters,
s.mean_intra_species_ANI,
s.ANI_circumscription_radius
FROM kbase_ke_pangenome.pangenome p
JOIN kbase_ke_pangenome.gtdb_species_clade s
ON p.gtdb_species_clade_id = s.gtdb_species_clade_id
ORDER BY p.no_genomes DESC
"""
print("Querying pangenome statistics...")
pangenome_df = spark.sql(pangenome_query).toPandas()
print(f"Retrieved {len(pangenome_df)} species")
pangenome_df.head()
Querying pangenome statistics... Retrieved 27690 species
Out[2]:
| gtdb_species_clade_id | GTDB_species | GTDB_taxonomy | no_genomes | no_core | no_aux_genome | no_singleton_gene_clusters | no_gene_clusters | mean_intra_species_ANI | ANI_circumscription_radius | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s__Staphylococcus_aureus--RS_GCF_001027105.1 | s__Staphylococcus_aureus | d__Bacteria;p__Bacillota;c__Bacilli;o__Staphyl... | 14526 | 2083 | 145831 | 86127.0 | 147914 | 98.76 | 95.0000 |
| 1 | s__Klebsiella_pneumoniae--RS_GCF_000742135.1 | s__Klebsiella_pneumoniae | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 14240 | 4199 | 438925 | 276743.0 | 443124 | 98.97 | 95.2390 |
| 2 | s__Salmonella_enterica--RS_GCF_000006945.2 | s__Salmonella_enterica | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 11402 | 3639 | 262732 | 149757.0 | 266371 | 98.82 | 95.0604 |
| 3 | s__Streptococcus_pneumoniae--RS_GCF_001457635.1 | s__Streptococcus_pneumoniae | d__Bacteria;p__Bacillota;c__Bacilli;o__Lactoba... | 8434 | 1475 | 115370 | 67191.0 | 116845 | 98.62 | 95.0000 |
| 4 | s__Mycobacterium_tuberculosis--RS_GCF_000195955.2 | s__Mycobacterium_tuberculosis | d__Bacteria;p__Actinomycetota;c__Actinomycetia... | 6903 | 3741 | 139929 | 97549.0 | 143670 | 99.92 | 95.0000 |
In [3]:
# Calculate pangenome openness metrics
pangenome_df['accessory_core_ratio'] = pangenome_df['no_aux_genome'] / pangenome_df['no_core']
pangenome_df['singleton_ratio'] = pangenome_df['no_singleton_gene_clusters'] / pangenome_df['no_gene_clusters']
pangenome_df['pangenome_size'] = pangenome_df['no_gene_clusters']
pangenome_df['core_fraction'] = pangenome_df['no_core'] / pangenome_df['no_gene_clusters']
# Summary statistics
print("\nPangenome openness metrics:")
print(pangenome_df[['accessory_core_ratio', 'singleton_ratio', 'core_fraction']].describe())
Pangenome openness metrics:
accessory_core_ratio singleton_ratio core_fraction
count 27690.000000 27690.000000 27690.000000
mean 1.437290 0.353522 0.534242
std 2.937666 0.171431 0.223095
min 0.000000 0.000000 0.003307
25% 0.448855 0.236438 0.363399
50% 0.928612 0.362810 0.518508
75% 1.751796 0.477794 0.690200
max 301.357143 0.972192 1.000000
2. Extract Pathway Diversity¶
Aggregate gapmind pathway data to species level.
Note: The gapmind_pathways table has 305M rows, so we aggregate in Spark before converting to pandas.
In [4]:
# Aggregate pathway data to species level
# Count distinct pathways and calculate diversity metrics
pathway_query = """
SELECT
clade_name as gtdb_species_clade_id,
COUNT(DISTINCT pathway) as pathway_count,
COUNT(DISTINCT metabolic_category) as metabolic_category_count,
COUNT(DISTINCT genome_id) as genomes_with_pathways,
SUM(CASE WHEN score_category = 'present' THEN 1 ELSE 0 END) as present_count,
SUM(CASE WHEN score_category = 'partial' THEN 1 ELSE 0 END) as partial_count,
SUM(CASE WHEN score_category = 'not_present' THEN 1 ELSE 0 END) as not_present_count,
COUNT(*) as total_pathway_genome_pairs
FROM kbase_ke_pangenome.gapmind_pathways
GROUP BY clade_name
ORDER BY pathway_count DESC
"""
print("Aggregating pathway data to species level...")
print("(This may take 5-10 minutes for 305M rows)")
pathway_df = spark.sql(pathway_query).toPandas()
print(f"Retrieved pathway data for {len(pathway_df)} species")
pathway_df.head()
Aggregating pathway data to species level... (This may take 5-10 minutes for 305M rows) Retrieved pathway data for 27690 species
Out[4]:
| gtdb_species_clade_id | pathway_count | metabolic_category_count | genomes_with_pathways | present_count | partial_count | not_present_count | total_pathway_genome_pairs | |
|---|---|---|---|---|---|---|---|---|
| 0 | s__UBA2128_sp002402135--GB_GCA_002402135.1 | 80 | 2 | 5 | 0 | 0 | 2509 | 4164 |
| 1 | s__CFX10_sp013112715--GB_GCA_013112715.1 | 80 | 2 | 2 | 0 | 0 | 872 | 1735 |
| 2 | s__Aestuariivirga_sp016792825--GB_GCA_016792825.1 | 80 | 2 | 3 | 0 | 0 | 921 | 2429 |
| 3 | s__Rubrivivax_sp020248915--GB_GCA_020248915.1 | 80 | 2 | 3 | 0 | 0 | 1258 | 2429 |
| 4 | s__Heliomarina_baculiformis--RS_GCF_019966545.1 | 80 | 2 | 3 | 0 | 0 | 1158 | 2429 |
In [5]:
# Calculate pathway diversity metrics
pathway_df['present_ratio'] = pathway_df['present_count'] / pathway_df['total_pathway_genome_pairs']
pathway_df['partial_ratio'] = pathway_df['partial_count'] / pathway_df['total_pathway_genome_pairs']
# Summary
print("\nPathway diversity metrics:")
print(pathway_df[['pathway_count', 'metabolic_category_count', 'present_ratio']].describe())
Pathway diversity metrics:
pathway_count metabolic_category_count present_ratio
count 27690.0 27690.0 27690.0
mean 80.0 2.0 0.0
std 0.0 0.0 0.0
min 80.0 2.0 0.0
25% 80.0 2.0 0.0
50% 80.0 2.0 0.0
75% 80.0 2.0 0.0
max 80.0 2.0 0.0
3. Extract AlphaEarth Biogeography Data¶
Get genomes with AlphaEarth embeddings and calculate species-level geographic metrics.
In [6]:
# Get AlphaEarth embeddings with species linkage
# Join with genome table to get species information
alphaearth_query = """
SELECT
a.genome_id,
g.gtdb_species_clade_id,
a.cleaned_lat,
a.cleaned_lon,
a.cleaned_year,
a.A00, a.A01, a.A02, a.A03, a.A04, a.A05, a.A06, a.A07, a.A08, a.A09,
a.A10, a.A11, a.A12, a.A13, a.A14, a.A15, a.A16, a.A17, a.A18, a.A19,
a.A20, a.A21, a.A22, a.A23, a.A24, a.A25, a.A26, a.A27, a.A28, a.A29,
a.A30, a.A31, a.A32, a.A33, a.A34, a.A35, a.A36, a.A37, a.A38, a.A39,
a.A40, a.A41, a.A42, a.A43, a.A44, a.A45, a.A46, a.A47, a.A48, a.A49,
a.A50, a.A51, a.A52, a.A53, a.A54, a.A55, a.A56, a.A57, a.A58, a.A59,
a.A60, a.A61, a.A62, a.A63
FROM kbase_ke_pangenome.alphaearth_embeddings_all_years a
JOIN kbase_ke_pangenome.genome g
ON a.genome_id = g.genome_id
WHERE a.cleaned_lat IS NOT NULL
AND a.cleaned_lon IS NOT NULL
"""
print("Querying AlphaEarth embeddings...")
alphaearth_df = spark.sql(alphaearth_query).toPandas()
print(f"Retrieved {len(alphaearth_df)} genomes with embeddings")
alphaearth_df.head()
Querying AlphaEarth embeddings... Retrieved 83286 genomes with embeddings
Out[6]:
| genome_id | gtdb_species_clade_id | cleaned_lat | cleaned_lon | cleaned_year | A00 | A01 | A02 | A03 | A04 | ... | A54 | A55 | A56 | A57 | A58 | A59 | A60 | A61 | A62 | A63 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | RS_GCF_024106275.1 | s__Staphylococcus_lugdunensis--RS_GCF_002901705.1 | 29.637200 | -82.352000 | 2019 | -0.038447 | -0.186082 | 0.192910 | 0.098424 | -0.093564 | ... | 0.228897 | -0.276140 | 0.007443 | 0.038447 | 0.228897 | -0.192910 | 0.038447 | -0.059116 | 0.048228 | -0.088827 |
| 1 | GB_GCA_022483435.1 | s__Pediococcus_pentosaceus--RS_GCF_001437285.1 | 43.073847 | -89.413972 | 2020 | 0.051734 | -0.284444 | 0.066990 | -0.108512 | 0.027128 | ... | 0.221453 | -0.022207 | 0.084214 | 0.015748 | 0.022207 | -0.113741 | 0.035433 | -0.113741 | -0.166336 | 0.113741 |
| 2 | GB_GCA_022483705.1 | s__Pediococcus_pentosaceus--RS_GCF_001437285.1 | 43.073847 | -89.413972 | 2019 | 0.055363 | -0.259900 | 0.075356 | -0.075356 | 0.024606 | ... | 0.192910 | -0.048228 | 0.093564 | -0.003014 | 0.000984 | -0.103406 | 0.017778 | -0.098424 | -0.206936 | 0.108512 |
| 3 | GB_GCA_022484085.1 | s__Pediococcus_pentosaceus--RS_GCF_001437285.1 | 43.073847 | -89.413972 | 2020 | 0.051734 | -0.284444 | 0.066990 | -0.108512 | 0.027128 | ... | 0.221453 | -0.022207 | 0.084214 | 0.015748 | 0.022207 | -0.113741 | 0.035433 | -0.113741 | -0.166336 | 0.113741 |
| 4 | RS_GCF_001411765.2 | s__Pediococcus_pentosaceus--RS_GCF_001437285.1 | 35.100000 | 126.860000 | 2015 | -0.119093 | -0.051734 | 0.179377 | -0.124567 | -0.113741 | ... | 0.079723 | -0.214133 | -0.093564 | 0.062991 | -0.141730 | -0.066990 | -0.199862 | -0.318893 | -0.006151 | -0.071111 |
5 rows × 69 columns
In [7]:
# Calculate species-level biogeographic metrics
from scipy.spatial.distance import pdist, squareform
from geopy.distance import geodesic
def calculate_species_geography(species_df):
"""Calculate geographic range metrics for a species."""
if len(species_df) < 2:
return {
'n_genomes': len(species_df),
'mean_lat': species_df['cleaned_lat'].iloc[0],
'mean_lon': species_df['cleaned_lon'].iloc[0],
'lat_range': 0,
'lon_range': 0,
'max_geodesic_dist_km': 0,
'mean_geodesic_dist_km': 0,
'max_embedding_dist': 0,
'mean_embedding_dist': 0
}
# Geographic coordinates
coords = species_df[['cleaned_lat', 'cleaned_lon']].values
# Calculate geodesic distances (km)
geodesic_dists = []
for i in range(len(coords)):
for j in range(i+1, len(coords)):
dist = geodesic(coords[i], coords[j]).kilometers
geodesic_dists.append(dist)
# Extract embedding vectors
embedding_cols = [f'A{i:02d}' for i in range(64)]
embeddings = species_df[embedding_cols].values
# Calculate embedding distances (Euclidean)
embedding_dists = pdist(embeddings, metric='euclidean')
return {
'n_genomes': len(species_df),
'mean_lat': species_df['cleaned_lat'].mean(),
'mean_lon': species_df['cleaned_lon'].mean(),
'lat_range': species_df['cleaned_lat'].max() - species_df['cleaned_lat'].min(),
'lon_range': species_df['cleaned_lon'].max() - species_df['cleaned_lon'].min(),
'max_geodesic_dist_km': max(geodesic_dists) if geodesic_dists else 0,
'mean_geodesic_dist_km': np.mean(geodesic_dists) if geodesic_dists else 0,
'max_embedding_dist': embedding_dists.max(),
'mean_embedding_dist': embedding_dists.mean()
}
print("Calculating species-level geographic metrics...")
biogeography_metrics = []
for species_id, group in alphaearth_df.groupby('gtdb_species_clade_id'):
metrics = calculate_species_geography(group)
metrics['gtdb_species_clade_id'] = species_id
biogeography_metrics.append(metrics)
biogeography_df = pd.DataFrame(biogeography_metrics)
print(f"Calculated biogeography metrics for {len(biogeography_df)} species")
biogeography_df.head()
Calculating species-level geographic metrics... Calculated biogeography metrics for 15046 species
Out[7]:
| n_genomes | mean_lat | mean_lon | lat_range | lon_range | max_geodesic_dist_km | mean_geodesic_dist_km | max_embedding_dist | mean_embedding_dist | gtdb_species_clade_id | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 61.142067 | 21.291533 | 0.0031 | 0.0023 | 0.366983 | 0.244655 | 0.099606 | 0.066404 | s__0-14-0-80-60-11_sp018897875--GB_GCA_0188978... |
| 1 | 7 | 38.614043 | -110.704886 | 0.3783 | 0.6658 | 71.508967 | 20.431134 | 0.654792 | 0.187083 | s__0-14-3-00-41-53_sp002780895--GB_GCA_0027808... |
| 2 | 2 | 39.536900 | -107.782800 | 0.0000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | s__01-FULL-36-15b_sp001782035--GB_GCA_001782035.1 |
| 3 | 2 | 39.536900 | -107.782800 | 0.0000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | s__01-FULL-44-24b_sp001793235--GB_GCA_001793235.1 |
| 4 | 2 | 39.536900 | -107.782800 | 0.0000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | s__01-FULL-45-10b_sp001804205--GB_GCA_001804205.1 |
In [8]:
# Summary of biogeographic metrics
print("\nBiogeographic metrics summary:")
print(biogeography_df[['n_genomes', 'max_geodesic_dist_km', 'mean_embedding_dist']].describe())
# Filter for species with sufficient sampling (≥5 genomes with embeddings)
biogeography_filtered = biogeography_df[biogeography_df['n_genomes'] >= 5]
print(f"\nSpecies with ≥5 genomes with embeddings: {len(biogeography_filtered)}")
Biogeographic metrics summary:
n_genomes max_geodesic_dist_km mean_embedding_dist
count 15046.000000 15046.000000 13835.000000
mean 5.535425 2416.784184 0.311758
std 71.882888 4973.032794 0.451525
min 1.000000 0.000000 0.000000
25% 1.000000 0.000000 0.000000
50% 2.000000 0.000000 0.000000
75% 3.000000 790.065351 0.684116
max 4948.000000 19982.872140 1.560929
Species with ≥5 genomes with embeddings: 2159
4. Integrate Datasets¶
Merge pangenome, pathway, and biogeography data into a single analysis dataset.
In [9]:
# Merge all datasets
print("Merging datasets...")
# Start with pangenome data (all species)
integrated_df = pangenome_df.copy()
# Add pathway data (left join - not all species may have pathway data)
integrated_df = integrated_df.merge(
pathway_df,
on='gtdb_species_clade_id',
how='left'
)
# Add biogeography data (left join - only 28% coverage)
integrated_df = integrated_df.merge(
biogeography_filtered,
on='gtdb_species_clade_id',
how='left',
suffixes=('', '_bio')
)
print(f"\nIntegrated dataset: {len(integrated_df)} species")
print(f"Species with pathway data: {integrated_df['pathway_count'].notna().sum()}")
print(f"Species with biogeography data: {integrated_df['n_genomes'].notna().sum()}")
print(f"Species with all three data types: {integrated_df[['pathway_count', 'n_genomes']].notna().all(axis=1).sum()}")
integrated_df.head()
Merging datasets... Integrated dataset: 27690 species Species with pathway data: 27690 Species with biogeography data: 2159 Species with all three data types: 2159
Out[9]:
| gtdb_species_clade_id | GTDB_species | GTDB_taxonomy | no_genomes | no_core | no_aux_genome | no_singleton_gene_clusters | no_gene_clusters | mean_intra_species_ANI | ANI_circumscription_radius | ... | partial_ratio | n_genomes | mean_lat | mean_lon | lat_range | lon_range | max_geodesic_dist_km | mean_geodesic_dist_km | max_embedding_dist | mean_embedding_dist | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s__Staphylococcus_aureus--RS_GCF_001027105.1 | s__Staphylococcus_aureus | d__Bacteria;p__Bacillota;c__Bacilli;o__Staphyl... | 14526 | 2083 | 145831 | 86127.0 | 147914 | 98.76 | 95.0000 | ... | 0.0 | 3822.0 | 24.804048 | 21.737536 | 100.746850 | 311.1964 | 19950.223137 | 8517.202820 | NaN | NaN |
| 1 | s__Klebsiella_pneumoniae--RS_GCF_000742135.1 | s__Klebsiella_pneumoniae | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 14240 | 4199 | 438925 | 276743.0 | 443124 | 98.97 | 95.2390 | ... | 0.0 | 4948.0 | 24.317957 | 44.043047 | 103.479176 | 310.9133 | 19941.671766 | 7954.427146 | NaN | NaN |
| 2 | s__Salmonella_enterica--RS_GCF_000006945.2 | s__Salmonella_enterica | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 11402 | 3639 | 262732 | 149757.0 | 266371 | 98.82 | 95.0604 | ... | 0.0 | 3208.0 | 24.504208 | 27.242457 | 124.952380 | 276.1451 | 19953.606023 | 8222.034417 | NaN | NaN |
| 3 | s__Streptococcus_pneumoniae--RS_GCF_001457635.1 | s__Streptococcus_pneumoniae | d__Bacteria;p__Bacillota;c__Bacilli;o__Lactoba... | 8434 | 1475 | 115370 | 67191.0 | 116845 | 98.62 | 95.0000 | ... | 0.0 | 532.0 | 27.668334 | 18.353585 | 94.674421 | 235.8000 | 18784.952385 | 7397.429695 | 1.591833 | 1.081545 |
| 4 | s__Mycobacterium_tuberculosis--RS_GCF_000195955.2 | s__Mycobacterium_tuberculosis | d__Bacteria;p__Actinomycetota;c__Actinomycetia... | 6903 | 3741 | 139929 | 97549.0 | 143670 | 99.92 | 95.0000 | ... | 0.0 | 2556.0 | -6.971883 | -37.978282 | 101.275879 | 302.9693 | 19910.410057 | 6327.529730 | NaN | NaN |
5 rows × 32 columns
5. Save Extracted Data¶
Save all datasets to CSV for downstream analysis.
In [10]:
# Create data directory if it doesn't exist
data_dir = Path('../data')
data_dir.mkdir(exist_ok=True)
# Save individual datasets
print("Saving datasets...")
pangenome_df.to_csv(data_dir / 'pangenome_metrics.csv', index=False)
print(f"Saved: pangenome_metrics.csv ({len(pangenome_df)} species)")
pathway_df.to_csv(data_dir / 'pathway_diversity.csv', index=False)
print(f"Saved: pathway_diversity.csv ({len(pathway_df)} species)")
biogeography_df.to_csv(data_dir / 'biogeography_metrics.csv', index=False)
print(f"Saved: biogeography_metrics.csv ({len(biogeography_df)} species)")
# Save integrated dataset
integrated_df.to_csv(data_dir / 'integrated_dataset.csv', index=False)
print(f"Saved: integrated_dataset.csv ({len(integrated_df)} species)")
# Save genome-level AlphaEarth data (for detailed analysis)
alphaearth_df.to_csv(data_dir / 'alphaearth_genome_embeddings.csv', index=False)
print(f"Saved: alphaearth_genome_embeddings.csv ({len(alphaearth_df)} genomes)")
print("\nAll data extracted successfully!")
Saving datasets... Saved: pangenome_metrics.csv (27690 species) Saved: pathway_diversity.csv (27690 species) Saved: biogeography_metrics.csv (15046 species) Saved: integrated_dataset.csv (27690 species) Saved: alphaearth_genome_embeddings.csv (83286 genomes) All data extracted successfully!
Summary Statistics¶
Quick overview of the extracted data.
In [11]:
print("=" * 60)
print("DATA EXTRACTION SUMMARY")
print("=" * 60)
print(f"\nPangenome Statistics:")
print(f" Total species: {len(pangenome_df)}")
print(f" Total genomes: {pangenome_df['no_genomes'].sum():,}")
print(f" Median genomes/species: {pangenome_df['no_genomes'].median():.0f}")
print(f" Median accessory/core ratio: {pangenome_df['accessory_core_ratio'].median():.2f}")
print(f"\nPathway Diversity:")
print(f" Species with pathway data: {len(pathway_df)}")
print(f" Median pathways/species: {pathway_df['pathway_count'].median():.0f}")
print(f" Median present ratio: {pathway_df['present_ratio'].median():.2f}")
print(f"\nBiogeography (AlphaEarth):")
print(f" Species with embeddings (≥5 genomes): {len(biogeography_filtered)}")
print(f" Coverage: {len(biogeography_filtered) / len(pangenome_df) * 100:.1f}% of species")
print(f" Median geographic range: {biogeography_filtered['max_geodesic_dist_km'].median():.0f} km")
print(f"\nIntegrated Dataset:")
complete = integrated_df[['pathway_count', 'n_genomes']].notna().all(axis=1)
print(f" Species with all three data types: {complete.sum()}")
print(f" Ready for comparative analysis: {complete.sum() / len(integrated_df) * 100:.1f}%")
print("\n" + "=" * 60)
============================================================ DATA EXTRACTION SUMMARY ============================================================ Pangenome Statistics: Total species: 27690 Total genomes: 293,059 Median genomes/species: 3 Median accessory/core ratio: 0.93 Pathway Diversity: Species with pathway data: 27690 Median pathways/species: 80 Median present ratio: 0.00 Biogeography (AlphaEarth): Species with embeddings (≥5 genomes): 2159 Coverage: 7.8% of species Median geographic range: 6626 km Integrated Dataset: Species with all three data types: 2159 Ready for comparative analysis: 7.8% ============================================================
In [ ]: