01 Data Extraction Revised
Jupyter notebook from the Pangenome Openness, Metabolic Pathways, and Biogeography project.
Data Extraction (REVISED): Pangenome, Pathways, and AlphaEarth Embeddings¶
This notebook extracts data from BERDL for analyzing relationships between:
- Pangenome openness (core vs. accessory genes)
- Metabolic pathway completeness (GapMind predictions)
- Ecological niche breadth (AlphaEarth embedding diversity)
Key Corrections:¶
- Pathways: GapMind has MULTIPLE rows per genome-pathway pair (different steps). We take the BEST score.
- Score categories:
complete,likely_complete,steps_missing_low/medium,not_present - AlphaEarth: Focus on embedding DISTANCE as niche breadth, not just geography
Run on BERDL JupyterHub for best performance.
In [1]:
# Initialize Spark session
import pandas as pd
import numpy as np
from pathlib import Path
spark = get_spark_session()
print(f"Spark version: {spark.version}")
Spark version: 4.0.1
1. Extract Pangenome Statistics¶
In [2]:
# Query pangenome statistics with species names
pangenome_query = """
SELECT
p.gtdb_species_clade_id,
s.GTDB_species,
s.GTDB_taxonomy,
p.no_genomes,
p.no_core,
p.no_aux_genome,
p.no_singleton_gene_clusters,
p.no_gene_clusters,
s.mean_intra_species_ANI,
s.ANI_circumscription_radius
FROM kbase_ke_pangenome.pangenome p
JOIN kbase_ke_pangenome.gtdb_species_clade s
ON p.gtdb_species_clade_id = s.gtdb_species_clade_id
ORDER BY p.no_genomes DESC
"""
print("Querying pangenome statistics...")
pangenome_df = spark.sql(pangenome_query).toPandas()
print(f"Retrieved {len(pangenome_df)} species")
# Calculate pangenome openness metrics
pangenome_df['accessory_core_ratio'] = pangenome_df['no_aux_genome'] / pangenome_df['no_core']
pangenome_df['singleton_ratio'] = pangenome_df['no_singleton_gene_clusters'] / pangenome_df['no_gene_clusters']
pangenome_df['core_fraction'] = pangenome_df['no_core'] / pangenome_df['no_gene_clusters']
pangenome_df['pangenome_size'] = pangenome_df['no_gene_clusters']
print("\nPangenome openness metrics:")
print(pangenome_df[['accessory_core_ratio', 'singleton_ratio', 'core_fraction']].describe())
pangenome_df.head()
Querying pangenome statistics...
Retrieved 27690 species
Pangenome openness metrics:
accessory_core_ratio singleton_ratio core_fraction
count 27690.000000 27690.000000 27690.000000
mean 1.437290 0.353522 0.534242
std 2.937666 0.171431 0.223095
min 0.000000 0.000000 0.003307
25% 0.448855 0.236438 0.363399
50% 0.928612 0.362810 0.518508
75% 1.751796 0.477794 0.690200
max 301.357143 0.972192 1.000000
Out[2]:
| gtdb_species_clade_id | GTDB_species | GTDB_taxonomy | no_genomes | no_core | no_aux_genome | no_singleton_gene_clusters | no_gene_clusters | mean_intra_species_ANI | ANI_circumscription_radius | accessory_core_ratio | singleton_ratio | core_fraction | pangenome_size | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s__Staphylococcus_aureus--RS_GCF_001027105.1 | s__Staphylococcus_aureus | d__Bacteria;p__Bacillota;c__Bacilli;o__Staphyl... | 14526 | 2083 | 145831 | 86127.0 | 147914 | 98.76 | 95.0000 | 70.010082 | 0.582278 | 0.014083 | 147914 |
| 1 | s__Klebsiella_pneumoniae--RS_GCF_000742135.1 | s__Klebsiella_pneumoniae | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 14240 | 4199 | 438925 | 276743.0 | 443124 | 98.97 | 95.2390 | 104.530841 | 0.624527 | 0.009476 | 443124 |
| 2 | s__Salmonella_enterica--RS_GCF_000006945.2 | s__Salmonella_enterica | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 11402 | 3639 | 262732 | 149757.0 | 266371 | 98.82 | 95.0604 | 72.198956 | 0.562212 | 0.013661 | 266371 |
| 3 | s__Streptococcus_pneumoniae--RS_GCF_001457635.1 | s__Streptococcus_pneumoniae | d__Bacteria;p__Bacillota;c__Bacilli;o__Lactoba... | 8434 | 1475 | 115370 | 67191.0 | 116845 | 98.62 | 95.0000 | 78.216949 | 0.575044 | 0.012624 | 116845 |
| 4 | s__Mycobacterium_tuberculosis--RS_GCF_000195955.2 | s__Mycobacterium_tuberculosis | d__Bacteria;p__Actinomycetota;c__Actinomycetia... | 6903 | 3741 | 139929 | 97549.0 | 143670 | 99.92 | 95.0000 | 37.404170 | 0.678980 | 0.026039 | 143670 |
2. Extract Pathway Completeness (CORRECTED)¶
Key insights:
- GapMind has exactly 80 pathways total
- Each genome-pathway pair has MULTIPLE rows (different steps/components)
- We need to take the BEST score for each genome-pathway pair
- Score hierarchy:
complete>likely_complete>steps_missing_low>steps_missing_medium>not_present
In [3]:
# Aggregate pathway data: take BEST score for each genome-pathway pair
# Then aggregate to species level
pathway_query = """
WITH pathway_scores AS (
-- Assign numeric scores for ranking
SELECT
clade_name,
genome_id,
pathway,
score_category,
CASE score_category
WHEN 'complete' THEN 5
WHEN 'likely_complete' THEN 4
WHEN 'steps_missing_low' THEN 3
WHEN 'steps_missing_medium' THEN 2
WHEN 'not_present' THEN 1
ELSE 0
END as score_value
FROM kbase_ke_pangenome.gapmind_pathways
),
best_scores AS (
-- Get best score for each genome-pathway pair (multiple rows per pair)
SELECT
clade_name,
genome_id,
pathway,
MAX(score_value) as best_score
FROM pathway_scores
GROUP BY clade_name, genome_id, pathway
),
genome_pathway_stats AS (
-- Calculate per-genome pathway statistics
SELECT
clade_name,
genome_id,
COUNT(*) as total_pathways,
SUM(CASE WHEN best_score >= 5 THEN 1 ELSE 0 END) as complete_pathways,
SUM(CASE WHEN best_score >= 4 THEN 1 ELSE 0 END) as likely_complete_or_better,
SUM(CASE WHEN best_score >= 3 THEN 1 ELSE 0 END) as partial_or_better,
SUM(CASE WHEN best_score = 1 THEN 1 ELSE 0 END) as absent_pathways
FROM best_scores
GROUP BY clade_name, genome_id
)
-- Aggregate to species level
SELECT
clade_name as gtdb_species_clade_id,
COUNT(DISTINCT genome_id) as genomes_with_pathways,
AVG(complete_pathways) as mean_complete_pathways,
STDDEV(complete_pathways) as std_complete_pathways,
AVG(likely_complete_or_better) as mean_likely_complete,
STDDEV(likely_complete_or_better) as std_likely_complete,
AVG(partial_or_better) as mean_partial,
STDDEV(partial_or_better) as std_partial,
AVG(absent_pathways) as mean_absent,
MIN(complete_pathways) as min_complete,
MAX(complete_pathways) as max_complete
FROM genome_pathway_stats
GROUP BY clade_name
ORDER BY mean_complete_pathways DESC
"""
print("Aggregating pathway data to species level...")
print("(This may take 10-15 minutes for 305M rows with aggregation)")
pathway_df = spark.sql(pathway_query).toPandas()
print(f"Retrieved pathway data for {len(pathway_df)} species")
print("\nPathway completeness metrics:")
print(pathway_df[['mean_complete_pathways', 'mean_likely_complete', 'std_complete_pathways']].describe())
pathway_df.head(10)
Aggregating pathway data to species level...
(This may take 10-15 minutes for 305M rows with aggregation)
Retrieved pathway data for 27690 species
Pathway completeness metrics:
mean_complete_pathways mean_likely_complete std_complete_pathways
count 27690.000000 27690.000000 27673.000000
mean 32.352991 38.861716 2.307621
std 18.580366 18.867506 2.754440
min 0.000000 0.000000 0.000000
25% 19.000000 26.000000 0.000000
50% 28.666667 37.000000 1.414214
75% 45.000000 52.000000 3.529243
max 79.750000 80.000000 26.870058
Out[3]:
| gtdb_species_clade_id | genomes_with_pathways | mean_complete_pathways | std_complete_pathways | mean_likely_complete | std_likely_complete | mean_partial | std_partial | mean_absent | min_complete | max_complete | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s__Klebsiella_spallanzanii--RS_GCF_902158555.1 | 4 | 79.75000 | 0.500000 | 80.000000 | 0.000000 | 80.000000 | 0.000 | 0.0 | 79 | 80 |
| 1 | s__Klebsiella_terrigena--RS_GCF_006539725.1 | 64 | 79.28125 | 0.844567 | 79.328125 | 0.643766 | 79.984375 | 0.125 | 0.0 | 74 | 80 |
| 2 | s__Pseudomonas_E_sp000242655--RS_GCF_012935715.1 | 3 | 79.00000 | 0.000000 | 79.000000 | 0.000000 | 79.000000 | 0.000 | 0.0 | 79 | 79 |
| 3 | s__Klebsiella_planticola--RS_GCF_000735435.1 | 50 | 79.00000 | 0.000000 | 79.120000 | 0.328261 | 80.000000 | 0.000 | 0.0 | 79 | 79 |
| 4 | s__Enterobacter_B_lignolyticus--RS_GCF_0001648... | 2 | 79.00000 | 0.000000 | 79.000000 | 0.000000 | 80.000000 | 0.000 | 0.0 | 79 | 79 |
| 5 | s__Pseudomonas_E_sp013386825--RS_GCF_013386825.1 | 2 | 79.00000 | 0.000000 | 79.000000 | 0.000000 | 79.000000 | 0.000 | 0.0 | 79 | 79 |
| 6 | s__Klebsiella_sp003752615--RS_GCF_014298075.1 | 3 | 79.00000 | 0.000000 | 79.000000 | 0.000000 | 80.000000 | 0.000 | 0.0 | 79 | 79 |
| 7 | s__Pseudomonas_E_yamanorum_A--RS_GCF_013386765.1 | 7 | 79.00000 | 0.000000 | 79.000000 | 0.000000 | 79.000000 | 0.000 | 0.0 | 79 | 79 |
| 8 | s__Pseudomonas_E_pergaminensis--RS_GCF_0241123... | 7 | 79.00000 | 0.000000 | 79.000000 | 0.000000 | 79.000000 | 0.000 | 0.0 | 79 | 79 |
| 9 | s__Pseudomonas_E_sp001952855--RS_GCF_001952855.1 | 2 | 79.00000 | 0.000000 | 79.000000 | 0.000000 | 79.000000 | 0.000 | 0.0 | 79 | 79 |
3. Extract AlphaEarth Embeddings¶
Focus on embedding diversity as ecological niche breadth.
In [5]:
# Get AlphaEarth embeddings with species linkage
alphaearth_query = """
SELECT
a.genome_id,
g.gtdb_species_clade_id,
a.cleaned_lat,
a.cleaned_lon,
a.cleaned_year,
a.A00, a.A01, a.A02, a.A03, a.A04, a.A05, a.A06, a.A07, a.A08, a.A09,
a.A10, a.A11, a.A12, a.A13, a.A14, a.A15, a.A16, a.A17, a.A18, a.A19,
a.A20, a.A21, a.A22, a.A23, a.A24, a.A25, a.A26, a.A27, a.A28, a.A29,
a.A30, a.A31, a.A32, a.A33, a.A34, a.A35, a.A36, a.A37, a.A38, a.A39,
a.A40, a.A41, a.A42, a.A43, a.A44, a.A45, a.A46, a.A47, a.A48, a.A49,
a.A50, a.A51, a.A52, a.A53, a.A54, a.A55, a.A56, a.A57, a.A58, a.A59,
a.A60, a.A61, a.A62, a.A63
FROM kbase_ke_pangenome.alphaearth_embeddings_all_years a
JOIN kbase_ke_pangenome.genome g
ON a.genome_id = g.genome_id
WHERE a.cleaned_lat IS NOT NULL
AND a.cleaned_lon IS NOT NULL
"""
print("Querying AlphaEarth embeddings...")
alphaearth_df = spark.sql(alphaearth_query).toPandas()
print(f"Retrieved {len(alphaearth_df)} genomes with embeddings")
alphaearth_df.head()
Querying AlphaEarth embeddings... Retrieved 83286 genomes with embeddings
Out[5]:
| genome_id | gtdb_species_clade_id | cleaned_lat | cleaned_lon | cleaned_year | A00 | A01 | A02 | A03 | A04 | ... | A54 | A55 | A56 | A57 | A58 | A59 | A60 | A61 | A62 | A63 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | RS_GCF_024106275.1 | s__Staphylococcus_lugdunensis--RS_GCF_002901705.1 | 29.637200 | -82.352000 | 2019 | -0.038447 | -0.186082 | 0.192910 | 0.098424 | -0.093564 | ... | 0.228897 | -0.276140 | 0.007443 | 0.038447 | 0.228897 | -0.192910 | 0.038447 | -0.059116 | 0.048228 | -0.088827 |
| 1 | GB_GCA_022483435.1 | s__Pediococcus_pentosaceus--RS_GCF_001437285.1 | 43.073847 | -89.413972 | 2020 | 0.051734 | -0.284444 | 0.066990 | -0.108512 | 0.027128 | ... | 0.221453 | -0.022207 | 0.084214 | 0.015748 | 0.022207 | -0.113741 | 0.035433 | -0.113741 | -0.166336 | 0.113741 |
| 2 | GB_GCA_022483705.1 | s__Pediococcus_pentosaceus--RS_GCF_001437285.1 | 43.073847 | -89.413972 | 2019 | 0.055363 | -0.259900 | 0.075356 | -0.075356 | 0.024606 | ... | 0.192910 | -0.048228 | 0.093564 | -0.003014 | 0.000984 | -0.103406 | 0.017778 | -0.098424 | -0.206936 | 0.108512 |
| 3 | GB_GCA_022484085.1 | s__Pediococcus_pentosaceus--RS_GCF_001437285.1 | 43.073847 | -89.413972 | 2020 | 0.051734 | -0.284444 | 0.066990 | -0.108512 | 0.027128 | ... | 0.221453 | -0.022207 | 0.084214 | 0.015748 | 0.022207 | -0.113741 | 0.035433 | -0.113741 | -0.166336 | 0.113741 |
| 4 | RS_GCF_001411765.2 | s__Pediococcus_pentosaceus--RS_GCF_001437285.1 | 35.100000 | 126.860000 | 2015 | -0.119093 | -0.051734 | 0.179377 | -0.124567 | -0.113741 | ... | 0.079723 | -0.214133 | -0.093564 | 0.062991 | -0.141730 | -0.066990 | -0.199862 | -0.318893 | -0.006151 | -0.071111 |
5 rows × 69 columns
In [6]:
# Calculate species-level embedding diversity metrics
from scipy.spatial.distance import pdist, squareform
from geopy.distance import geodesic
def calculate_species_niche_metrics(species_df):
"""Calculate niche breadth metrics using AlphaEarth embeddings."""
n_genomes = len(species_df)
if n_genomes < 2:
return {
'n_genomes': n_genomes,
'mean_lat': species_df['cleaned_lat'].iloc[0] if n_genomes > 0 else None,
'mean_lon': species_df['cleaned_lon'].iloc[0] if n_genomes > 0 else None,
'lat_range': 0,
'lon_range': 0,
'max_geodesic_dist_km': 0,
'mean_geodesic_dist_km': 0,
'max_embedding_dist': 0,
'mean_embedding_dist': 0,
'std_embedding_dist': 0,
'embedding_variance': 0,
'niche_breadth_score': 0 # Our primary metric
}
# Geographic coordinates
coords = species_df[['cleaned_lat', 'cleaned_lon']].values
# Calculate geodesic distances (km)
geodesic_dists = []
for i in range(len(coords)):
for j in range(i+1, len(coords)):
dist = geodesic(coords[i], coords[j]).kilometers
geodesic_dists.append(dist)
# Extract embedding vectors (64-dimensional)
embedding_cols = [f'A{i:02d}' for i in range(64)]
embeddings = species_df[embedding_cols].values
# Calculate embedding distances (Euclidean)
embedding_dists = pdist(embeddings, metric='euclidean')
# Calculate variance of embeddings across all dimensions
# Higher variance = more ecological diversity
embedding_variance = np.mean(embeddings.var(axis=0))
# Niche breadth score: combines mean distance and variance
niche_breadth = embedding_dists.mean() * (1 + embedding_variance)
return {
'n_genomes': n_genomes,
'mean_lat': species_df['cleaned_lat'].mean(),
'mean_lon': species_df['cleaned_lon'].mean(),
'lat_range': species_df['cleaned_lat'].max() - species_df['cleaned_lat'].min(),
'lon_range': species_df['cleaned_lon'].max() - species_df['cleaned_lon'].min(),
'max_geodesic_dist_km': max(geodesic_dists) if geodesic_dists else 0,
'mean_geodesic_dist_km': np.mean(geodesic_dists) if geodesic_dists else 0,
'max_embedding_dist': embedding_dists.max(),
'mean_embedding_dist': embedding_dists.mean(),
'std_embedding_dist': embedding_dists.std(),
'embedding_variance': embedding_variance,
'niche_breadth_score': niche_breadth
}
print("Calculating species-level niche breadth metrics...")
niche_metrics = []
for species_id, group in alphaearth_df.groupby('gtdb_species_clade_id'):
metrics = calculate_species_niche_metrics(group)
metrics['gtdb_species_clade_id'] = species_id
niche_metrics.append(metrics)
niche_df = pd.DataFrame(niche_metrics)
print(f"Calculated metrics for {len(niche_df)} species")
# Filter for species with sufficient sampling (≥5 genomes)
niche_filtered = niche_df[niche_df['n_genomes'] >= 5]
print(f"Species with ≥5 genomes: {len(niche_filtered)}")
print("\nNiche breadth metrics:")
print(niche_filtered[['n_genomes', 'mean_embedding_dist', 'embedding_variance', 'niche_breadth_score']].describe())
niche_filtered.head(10)
Calculating species-level niche breadth metrics...
Calculated metrics for 15046 species
Species with ≥5 genomes: 2159
Niche breadth metrics:
n_genomes mean_embedding_dist embedding_variance \
count 2159.000000 1872.000000 1872.000000
mean 27.255211 0.527124 0.003984
std 188.329826 0.425666 0.003697
min 5.000000 0.000000 0.000000
25% 5.500000 0.046734 0.000049
50% 7.000000 0.544867 0.003349
75% 13.000000 0.911480 0.007132
max 4948.000000 1.367756 0.012762
niche_breadth_score
count 1872.000000
mean 0.530757
std 0.429544
min 0.000000
25% 0.046738
50% 0.546645
75% 0.918731
max 1.383805
Out[6]:
| n_genomes | mean_lat | mean_lon | lat_range | lon_range | max_geodesic_dist_km | mean_geodesic_dist_km | max_embedding_dist | mean_embedding_dist | std_embedding_dist | embedding_variance | niche_breadth_score | gtdb_species_clade_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 7 | 38.614043 | -110.704886 | 0.378300 | 0.66580 | 71.508967 | 20.431134 | 0.654792 | 0.187083 | 0.295805 | 8.203179e-04 | 0.187237 | s__0-14-3-00-41-53_sp002780895--GB_GCA_0027808... |
| 7 | 6 | 39.536900 | -107.782800 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.450090e-34 | 0.000000 | s__01-FULL-54-110_sp001789045--GB_GCA_001789045.1 |
| 20 | 6 | 33.623133 | 103.576367 | 27.246400 | 2.04410 | 3029.488762 | 1615.727340 | 0.828147 | 0.455582 | 0.400736 | 2.396768e-03 | 0.456674 | s__12-FULL-67-14b_sp002737365--GB_GCA_002737365.1 |
| 22 | 19 | 39.740000 | -123.630000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.483967e-34 | 0.000000 | s__13-2-20CM-66-19_sp001914695--GB_GCA_0019146... |
| 26 | 6 | 39.740000 | -123.630000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.489586e-34 | 0.000000 | s__13-2-20CM-66-19_sp005878835--GB_GCA_0058788... |
| 40 | 10 | 47.109748 | -14.113086 | 16.382503 | 253.79092 | 10881.508941 | 4752.898844 | 0.981485 | 0.584350 | 0.440156 | 3.763134e-03 | 0.586548 | s__14-2_sp011960065--GB_GCA_910575555.1 |
| 43 | 5 | 48.851472 | -16.640864 | 16.139840 | 84.12767 | 6584.884145 | 2633.953658 | 0.981485 | 0.392594 | 0.480827 | 2.408281e-03 | 0.393539 | s__14-2_sp910575075--GB_GCA_910575075.1 |
| 59 | 5 | 48.851472 | -16.640864 | 16.139840 | 84.12767 | 6584.884145 | 2633.953658 | 0.981485 | 0.392594 | 0.480827 | 2.408281e-03 | 0.393539 | s__1XD8-76_sp910573755--GB_GCA_910573755.1 |
| 61 | 5 | 39.536900 | -107.782800 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.115751e-35 | 0.000000 | s__2-01-FULL-37-20_sp001784115--GB_GCA_0017841... |
| 65 | 6 | 39.536900 | -107.782800 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.450090e-34 | 0.000000 | s__2-01-FULL-39-21_sp001792535--GB_GCA_0017925... |
4. Integrate Datasets¶
In [7]:
# Merge all datasets
print("Merging datasets...")
# Start with pangenome data (all species)
integrated_df = pangenome_df.copy()
# Add pathway data
integrated_df = integrated_df.merge(
pathway_df,
on='gtdb_species_clade_id',
how='left',
suffixes=('', '_pathway')
)
# Add niche breadth data
integrated_df = integrated_df.merge(
niche_filtered,
on='gtdb_species_clade_id',
how='left',
suffixes=('', '_niche')
)
print(f"\nIntegrated dataset: {len(integrated_df)} species")
print(f"Species with pathway data: {integrated_df['mean_complete_pathways'].notna().sum()}")
print(f"Species with niche data: {integrated_df['niche_breadth_score'].notna().sum()}")
print(f"Species with all three data types: {integrated_df[['mean_complete_pathways', 'niche_breadth_score']].notna().all(axis=1).sum()}")
integrated_df.head()
Merging datasets... Integrated dataset: 27690 species Species with pathway data: 27690 Species with niche data: 1872 Species with all three data types: 1872
Out[7]:
| gtdb_species_clade_id | GTDB_species | GTDB_taxonomy | no_genomes | no_core | no_aux_genome | no_singleton_gene_clusters | no_gene_clusters | mean_intra_species_ANI | ANI_circumscription_radius | ... | mean_lon | lat_range | lon_range | max_geodesic_dist_km | mean_geodesic_dist_km | max_embedding_dist | mean_embedding_dist | std_embedding_dist | embedding_variance | niche_breadth_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s__Staphylococcus_aureus--RS_GCF_001027105.1 | s__Staphylococcus_aureus | d__Bacteria;p__Bacillota;c__Bacilli;o__Staphyl... | 14526 | 2083 | 145831 | 86127.0 | 147914 | 98.76 | 95.0000 | ... | 21.737536 | 100.746850 | 311.1964 | 19950.223137 | 8517.202820 | NaN | NaN | NaN | NaN | NaN |
| 1 | s__Klebsiella_pneumoniae--RS_GCF_000742135.1 | s__Klebsiella_pneumoniae | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 14240 | 4199 | 438925 | 276743.0 | 443124 | 98.97 | 95.2390 | ... | 44.043047 | 103.479176 | 310.9133 | 19941.671766 | 7954.427146 | NaN | NaN | NaN | NaN | NaN |
| 2 | s__Salmonella_enterica--RS_GCF_000006945.2 | s__Salmonella_enterica | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 11402 | 3639 | 262732 | 149757.0 | 266371 | 98.82 | 95.0604 | ... | 27.242457 | 124.952380 | 276.1451 | 19953.606023 | 8222.034417 | NaN | NaN | NaN | NaN | NaN |
| 3 | s__Streptococcus_pneumoniae--RS_GCF_001457635.1 | s__Streptococcus_pneumoniae | d__Bacteria;p__Bacillota;c__Bacilli;o__Lactoba... | 8434 | 1475 | 115370 | 67191.0 | 116845 | 98.62 | 95.0000 | ... | 18.353585 | 94.674421 | 235.8000 | 18784.952385 | 7397.429695 | 1.591833 | 1.081545 | 0.339592 | 0.010021 | 1.092382 |
| 4 | s__Mycobacterium_tuberculosis--RS_GCF_000195955.2 | s__Mycobacterium_tuberculosis | d__Bacteria;p__Actinomycetota;c__Actinomycetia... | 6903 | 3741 | 139929 | 97549.0 | 143670 | 99.92 | 95.0000 | ... | -37.978282 | 101.275879 | 302.9693 | 19910.410057 | 6327.529730 | NaN | NaN | NaN | NaN | NaN |
5 rows × 36 columns
5. Save Extracted Data¶
In [ ]:
# Create data directory
data_dir = Path('../data')
data_dir.mkdir(exist_ok=True)
# Save datasets
print("Saving datasets...")
pangenome_df.to_csv(data_dir / 'pangenome_metrics.csv', index=False)
print(f"Saved: pangenome_metrics.csv ({len(pangenome_df)} species)")
pathway_df.to_csv(data_dir / 'pathway_completeness.csv', index=False)
print(f"Saved: pathway_completeness.csv ({len(pathway_df)} species)")
niche_df.to_csv(data_dir / 'niche_breadth_metrics.csv', index=False)
print(f"Saved: niche_breadth_metrics.csv ({len(niche_df)} species)")
integrated_df.to_csv(data_dir / 'integrated_dataset.csv', index=False)
print(f"Saved: integrated_dataset.csv ({len(integrated_df)} species)")
# Save genome-level AlphaEarth data for detailed analysis
alphaearth_df.to_csv(data_dir / 'alphaearth_genome_embeddings.csv', index=False)
print(f"Saved: alphaearth_genome_embeddings.csv ({len(alphaearth_df)} genomes)")
print("\nAll data extracted successfully!")
Saving datasets... Saved: pangenome_metrics.csv (27690 species) Saved: pathway_completeness.csv (27690 species) Saved: niche_breadth_metrics.csv (15046 species) Saved: integrated_dataset.csv (27690 species)
Summary Statistics¶
In [ ]:
print("=" * 70)
print("DATA EXTRACTION SUMMARY")
print("=" * 70)
print(f"\nPangenome Statistics:")
print(f" Total species: {len(pangenome_df)}")
print(f" Median accessory/core ratio: {pangenome_df['accessory_core_ratio'].median():.2f}")
print(f" Median core fraction: {pangenome_df['core_fraction'].median():.2f}")
print(f"\nPathway Completeness:")
print(f" Species with pathway data: {len(pathway_df)}")
print(f" Median complete pathways: {pathway_df['mean_complete_pathways'].median():.1f} / 80")
print(f" Median likely complete: {pathway_df['mean_likely_complete'].median():.1f} / 80")
print(f" Range of pathway completeness: {pathway_df['mean_complete_pathways'].min():.1f} - {pathway_df['mean_complete_pathways'].max():.1f}")
print(f"\nEcological Niche Breadth (AlphaEarth):")
print(f" Species with ≥5 genomes with embeddings: {len(niche_filtered)}")
print(f" Coverage: {len(niche_filtered) / len(pangenome_df) * 100:.1f}% of species")
print(f" Median embedding distance: {niche_filtered['mean_embedding_dist'].median():.3f}")
print(f" Median niche breadth score: {niche_filtered['niche_breadth_score'].median():.3f}")
print(f"\nIntegrated Dataset:")
complete = integrated_df[['mean_complete_pathways', 'niche_breadth_score']].notna().all(axis=1)
print(f" Species with all three data types: {complete.sum()}")
print(f" Ready for comparative analysis: {complete.sum() / len(integrated_df) * 100:.1f}%")
print("\n" + "=" * 70)
In [ ]: