01 Data Extraction
Jupyter notebook from the Carbon Source Utilization Predicts Ecology and Lifestyle in Pseudomonas project.
01 - Data Extraction: Pseudomonas Carbon Pathways & Metadata¶
Requires BERDL JupyterHub (Spark SQL access)
Extracts all Pseudomonas data needed for downstream analysis:
- Species list with pangenome stats and GTDB subgenus classification
- Genome-level GapMind carbon pathway best scores
- Isolation source metadata from ncbi_env and gtdb_metadata
Outputs saved to ../data/ for use by subsequent notebooks.
In [ ]:
# get_spark_session() is auto-injected by the BERDL JupyterHub kernel
import os
import pandas as pd
spark = get_spark_session()
DATA_DIR = os.path.join(os.path.dirname(os.getcwd()), 'data') if os.path.basename(os.getcwd()) == 'notebooks' else 'data'
os.makedirs(DATA_DIR, exist_ok=True)
print(f'Data directory: {DATA_DIR}')
1. Pseudomonas Species List with Pangenome Stats¶
In [2]:
species_df = spark.sql("""
SELECT
p.gtdb_species_clade_id,
sc.GTDB_species,
sc.GTDB_taxonomy,
CAST(p.no_genomes AS INT) as no_genomes,
CAST(p.no_core AS INT) as no_core,
CAST(p.no_aux_genome AS INT) as no_aux,
CAST(p.no_singleton_gene_clusters AS INT) as no_singleton,
CAST(p.no_gene_clusters AS INT) as no_gene_clusters,
CAST(sc.mean_intra_species_ANI AS FLOAT) as mean_ani,
CASE
WHEN p.gtdb_species_clade_id LIKE 's__Pseudomonas_E_%' THEN 'Pseudomonas_E'
WHEN p.gtdb_species_clade_id LIKE 's__Pseudomonas_B_%' THEN 'Pseudomonas_B'
WHEN p.gtdb_species_clade_id LIKE 's__Pseudomonas_A_%' THEN 'Pseudomonas_A'
WHEN p.gtdb_species_clade_id LIKE 's__Pseudomonas_F_%' THEN 'Pseudomonas_F'
WHEN p.gtdb_species_clade_id LIKE 's__Pseudomonas_H_%' THEN 'Pseudomonas_H'
ELSE 'Pseudomonas'
END as gtdb_subgenus
FROM kbase_ke_pangenome.pangenome p
JOIN kbase_ke_pangenome.gtdb_species_clade sc
ON p.gtdb_species_clade_id = sc.gtdb_species_clade_id
WHERE p.gtdb_species_clade_id LIKE 's__Pseudomonas_%'
ORDER BY no_genomes DESC
""").toPandas()
species_df['core_fraction'] = species_df['no_core'] / species_df['no_gene_clusters']
species_df['aux_fraction'] = species_df['no_aux'] / species_df['no_gene_clusters']
print(f'Total Pseudomonas species: {len(species_df)}')
print(f'Total genomes: {species_df.no_genomes.sum()}')
print(f'\nSubgenus distribution:')
print(species_df.groupby('gtdb_subgenus').agg(
n_species=('gtdb_species_clade_id', 'count'),
total_genomes=('no_genomes', 'sum')
).sort_values('total_genomes', ascending=False))
species_df.to_csv(os.path.join(DATA_DIR, 'pseudomonas_species.csv'), index=False)
print(f'\nSaved to {DATA_DIR}/pseudomonas_species.csv')
species_df.head(20)
Total Pseudomonas species: 433
Total genomes: 12727
Note: The species-table sum (12,727) is 5 fewer than the genome table (12,732).
This reflects genomes present in the genome table but not mapped to a species in the pangenome table.
Subgenus distribution:
n_species total_genomes
gtdb_subgenus
Pseudomonas 19 6905
Pseudomonas_E 398 5687
Pseudomonas_B 7 82
Pseudomonas_F 7 44
Pseudomonas_H 2 9
Saved to /home/mamillerpa/BERIL-research-observatory/projects/pseudomonas_carbon_ecology/data/pseudomonas_species.csv
Out[2]:
| gtdb_species_clade_id | GTDB_species | GTDB_taxonomy | no_genomes | no_core | no_aux | no_singleton | no_gene_clusters | mean_ani | gtdb_subgenus | core_fraction | aux_fraction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s__Pseudomonas_aeruginosa--RS_GCF_001457615.1 | s__Pseudomonas_aeruginosa | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 6760 | 5199 | 250894 | 120728 | 256093 | 99.120003 | Pseudomonas | 0.020301 | 0.979699 |
| 1 | s__Pseudomonas_E_viridiflava--RS_GCF_001642795.1 | s__Pseudomonas_E_viridiflava | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 1167 | 4524 | 58638 | 37856 | 63162 | 97.349998 | Pseudomonas_E | 0.071625 | 0.928375 |
| 2 | s__Pseudomonas_E_avellanae--RS_GCF_000444135.1 | s__Pseudomonas_E_avellanae | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 319 | 4006 | 59148 | 31471 | 63154 | 96.760002 | Pseudomonas_E | 0.063432 | 0.936568 |
| 3 | s__Pseudomonas_E_amygdali--RS_GCF_002699855.1 | s__Pseudomonas_E_amygdali | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 239 | 4000 | 54575 | 25328 | 58575 | 98.059998 | Pseudomonas_E | 0.068289 | 0.931711 |
| 4 | s__Pseudomonas_E_syringae--RS_GCF_000507185.2 | s__Pseudomonas_E_syringae | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 156 | 4204 | 31848 | 18906 | 36052 | 97.790001 | Pseudomonas_E | 0.116609 | 0.883391 |
| 5 | s__Pseudomonas_E_alloputida--RS_GCF_021282585.1 | s__Pseudomonas_E_alloputida | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 128 | 4147 | 38600 | 22481 | 42747 | 97.489998 | Pseudomonas_E | 0.097013 | 0.902987 |
| 6 | s__Pseudomonas_E_syringae_M--RS_GCF_009176725.1 | s__Pseudomonas_E_syringae_M | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 126 | 4412 | 25505 | 15642 | 29917 | 98.699997 | Pseudomonas_E | 0.147475 | 0.852525 |
| 7 | s__Pseudomonas_E_fulva--RS_GCF_000730565.1 | s__Pseudomonas_E_fulva | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 94 | 3865 | 15181 | 9323 | 19046 | 99.019997 | Pseudomonas_E | 0.202930 | 0.797070 |
| 8 | s__Pseudomonas_E_paracarnis--RS_GCF_904063055.1 | s__Pseudomonas_E_paracarnis | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 88 | 4390 | 27364 | 16782 | 31754 | 97.580002 | Pseudomonas_E | 0.138250 | 0.861750 |
| 9 | s__Pseudomonas_E_tremae--RS_GCF_001401155.1 | s__Pseudomonas_E_tremae | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 85 | 4335 | 14713 | 7499 | 19048 | 99.019997 | Pseudomonas_E | 0.227583 | 0.772417 |
| 10 | s__Pseudomonas_E_asiatica--RS_GCF_009932335.1 | s__Pseudomonas_E_asiatica | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 82 | 4270 | 25481 | 14016 | 29751 | 98.330002 | Pseudomonas_E | 0.143525 | 0.856475 |
| 11 | s__Pseudomonas_E_chlororaphis_F--RS_GCF_003850... | s__Pseudomonas_E_chlororaphis_F | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 82 | 4971 | 16735 | 9528 | 21706 | 96.160004 | Pseudomonas_E | 0.229015 | 0.770985 |
| 12 | s__Pseudomonas_E_protegens--RS_GCF_000397205.1 | s__Pseudomonas_E_protegens | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 78 | 5532 | 11941 | 7290 | 17473 | 98.779999 | Pseudomonas_E | 0.316603 | 0.683397 |
| 13 | s__Pseudomonas_E_monteilii--RS_GCF_000730605.1 | s__Pseudomonas_E_monteilii | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 73 | 4063 | 19726 | 9620 | 23789 | 97.870003 | Pseudomonas_E | 0.170793 | 0.829207 |
| 14 | s__Pseudomonas_E_cerasi--RS_GCF_900074915.1 | s__Pseudomonas_E_cerasi | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 70 | 4452 | 14905 | 8229 | 19357 | 98.510002 | Pseudomonas_E | 0.229994 | 0.770006 |
| 15 | s__Pseudomonas_E_mandelii--RS_GCF_900106065.1 | s__Pseudomonas_E_mandelii | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 65 | 5626 | 10307 | 7003 | 15933 | 98.919998 | Pseudomonas_E | 0.353104 | 0.646896 |
| 16 | s__Pseudomonas_E_juntendi--RS_GCF_009932375.1 | s__Pseudomonas_E_juntendi | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 53 | 3930 | 19173 | 10158 | 23103 | 98.010002 | Pseudomonas_E | 0.170108 | 0.829892 |
| 17 | s__Pseudomonas_E_sp002874965--RS_GCF_002874965.1 | s__Pseudomonas_E_sp002874965 | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 52 | 5553 | 13483 | 9895 | 19036 | 98.779999 | Pseudomonas_E | 0.291710 | 0.708290 |
| 18 | s__Pseudomonas_aeruginosa_A--RS_GCF_000017205.1 | s__Pseudomonas_aeruginosa_A | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 43 | 5150 | 10182 | 4023 | 15332 | 99.029999 | Pseudomonas | 0.335899 | 0.664101 |
| 19 | s__Pseudomonas_B_oryzihabitans--RS_GCF_0007306... | s__Pseudomonas_B_oryzihabitans | d__Bacteria;p__Pseudomonadota;c__Gammaproteoba... | 43 | 3899 | 14636 | 10002 | 18535 | 98.160004 | Pseudomonas_B | 0.210359 | 0.789641 |
2. Genome-Level GapMind Carbon Pathway Scores¶
GapMind has multiple rows per genome-pathway pair (one per step). We take the MAX score per pair.
Score hierarchy: complete (5) > likely_complete (4) > steps_missing_low (3) > steps_missing_medium (2) > not_present (1)
In [3]:
# Extract genome-level best scores for all Pseudomonas carbon pathways
# This query aggregates ~10M rows down to ~800K (genomes x pathways)
pathway_scores = spark.sql("""
WITH scored AS (
SELECT
clade_name,
genome_id,
pathway,
CASE score_category
WHEN 'complete' THEN 5
WHEN 'likely_complete' THEN 4
WHEN 'steps_missing_low' THEN 3
WHEN 'steps_missing_medium' THEN 2
WHEN 'not_present' THEN 1
ELSE 0
END as score_value
FROM kbase_ke_pangenome.gapmind_pathways
WHERE clade_name LIKE 's__Pseudomonas_%'
AND metabolic_category = 'carbon'
)
SELECT clade_name, genome_id, pathway, MAX(score_value) as best_score
FROM scored
GROUP BY clade_name, genome_id, pathway
""")
pathway_df = pathway_scores.toPandas()
print(f'Genome-pathway pairs: {len(pathway_df):,}')
print(f'Unique genomes: {pathway_df.genome_id.nunique():,}')
print(f'Unique pathways: {pathway_df.pathway.nunique()}')
print(f'\nScore distribution:')
print(pathway_df.best_score.value_counts().sort_index())
pathway_df.to_csv(os.path.join(DATA_DIR, 'carbon_pathway_scores.csv'), index=False)
print(f'\nSaved to {DATA_DIR}/carbon_pathway_scores.csv')
Genome-pathway pairs: 789,012 Unique genomes: 12,726 Unique pathways: 62 Score distribution: best_score 1 15062 2 12285 3 75045 4 62019 5 624601 Name: count, dtype: int64
Saved to /home/mamillerpa/BERIL-research-observatory/projects/pseudomonas_carbon_ecology/data/carbon_pathway_scores.csv
In [4]:
# Quick sanity check: pathway completeness for P. aeruginosa vs P. putida
check = pathway_df.copy()
check['is_complete'] = check['best_score'] >= 4 # complete or likely_complete
for species_prefix in ['s__Pseudomonas_aeruginosa--', 's__Pseudomonas_E_putida--']:
sub = check[check['clade_name'].str.startswith(species_prefix)]
if len(sub) > 0:
completeness = sub.groupby('pathway')['is_complete'].mean().sort_values(ascending=False)
name = species_prefix.split('--')[0].replace('s__', '')
print(f'\n{name}: {sub.genome_id.nunique()} genomes')
print(f' Pathways complete in >90% genomes: {(completeness > 0.9).sum()}')
print(f' Pathways complete in <10% genomes: {(completeness < 0.1).sum()}')
print(f' Top 10 pathways: {completeness.head(10).index.tolist()}')
Pseudomonas_aeruginosa: 6760 genomes Pathways complete in >90% genomes: 52 Pathways complete in <10% genomes: 8 Top 10 pathways: ['ethanol', 'threonine', 'deoxyribose', 'acetate', 'pyruvate', '4-hydroxybenzoate', 'tryptophan', 'putrescine', 'deoxyinosine', 'thymidine'] Pseudomonas_E_putida: 33 genomes Pathways complete in >90% genomes: 52 Pathways complete in <10% genomes: 8 Top 10 pathways: ['2-oxoglutarate', '4-hydroxybenzoate', 'D-alanine', 'D-lactate', 'D-serine', 'L-lactate', 'L-malate', 'acetate', 'alanine', 'arginine']
3. Isolation Source Metadata¶
In [5]:
# Extract isolation sources from both ncbi_env and gtdb_metadata
isolation_df = spark.sql("""
SELECT
g.genome_id,
g.gtdb_species_clade_id,
ne.content as ncbi_env_isolation_source,
m.ncbi_isolation_source as gtdb_isolation_source,
host.content as host,
env_broad.content as env_broad_scale,
env_local.content as env_local_scale,
env_medium.content as env_medium
FROM kbase_ke_pangenome.genome g
JOIN kbase_ke_pangenome.sample s ON g.genome_id = s.genome_id
JOIN kbase_ke_pangenome.gtdb_metadata m ON g.genome_id = m.accession
LEFT JOIN kbase_ke_pangenome.ncbi_env ne
ON s.ncbi_biosample_accession_id = ne.accession
AND ne.harmonized_name = 'isolation_source'
LEFT JOIN kbase_ke_pangenome.ncbi_env host
ON s.ncbi_biosample_accession_id = host.accession
AND host.harmonized_name = 'host'
LEFT JOIN kbase_ke_pangenome.ncbi_env env_broad
ON s.ncbi_biosample_accession_id = env_broad.accession
AND env_broad.harmonized_name = 'env_broad_scale'
LEFT JOIN kbase_ke_pangenome.ncbi_env env_local
ON s.ncbi_biosample_accession_id = env_local.accession
AND env_local.harmonized_name = 'env_local_scale'
LEFT JOIN kbase_ke_pangenome.ncbi_env env_medium
ON s.ncbi_biosample_accession_id = env_medium.accession
AND env_medium.harmonized_name = 'env_medium'
WHERE g.gtdb_species_clade_id LIKE 's__Pseudomonas_%'
""").toPandas()
print(f'Total Pseudomonas genomes: {len(isolation_df):,}')
print(f'With ncbi_env isolation_source: {isolation_df.ncbi_env_isolation_source.notna().sum():,}')
print(f'With gtdb isolation_source: {(isolation_df.gtdb_isolation_source.notna() & (isolation_df.gtdb_isolation_source != "")).sum():,}')
print(f'With host field: {isolation_df.host.notna().sum():,}')
print(f'With env_broad_scale: {isolation_df.env_broad_scale.notna().sum():,}')
isolation_df.to_csv(os.path.join(DATA_DIR, 'isolation_sources.csv'), index=False)
print(f'\nSaved to {DATA_DIR}/isolation_sources.csv')
Total Pseudomonas genomes: 12,732 With ncbi_env isolation_source: 8,486 With gtdb isolation_source: 12,731 With host field: 7,073 With env_broad_scale: 2,225 Saved to /home/mamillerpa/BERIL-research-observatory/projects/pseudomonas_carbon_ecology/data/isolation_sources.csv
In [6]:
# Preview isolation sources
print('Top 30 ncbi_env isolation sources:')
print(isolation_df['ncbi_env_isolation_source'].value_counts().head(30).to_string())
print('\nTop 20 host values:')
print(isolation_df['host'].value_counts().head(20).to_string())
print('\nTop 20 env_broad_scale values:')
print(isolation_df['env_broad_scale'].value_counts().head(20).to_string())
Top 30 ncbi_env isolation sources: ncbi_env_isolation_source sputum 846 missing 614 blood 443 hospital 290 not collected 276 groundwater 272 urine 232 soil 212 itra-abdominal tract infection 129 respiratory tract infection 122 lung 101 wound 100 urinary tract infection 98 derived from the strain Pseudomonas aeruginosa ATCC 27853 91 cystic fibrosis patient 81 Symptomatic mushroom tissue 78 raw milk 77 throat 73 leaf 70 River 69 feces 54 International Space Station during MT-2 48 Ward 47 agricultural soil 45 bronchoalveolar lavage 42 bulk soil from sugarcane field 40 rhizosphere 40 respiratory 40 Portion of Sputum 40 Seed 38 Top 20 host values: host Homo sapiens 4476 not applicable 368 Prunus avium 219 Agaricus bisporus 89 missing 68 tomato 67 Phaseolus vulgaris 66 not collected 60 Coffea arabica 54 Arabidopsis thaliana 52 Gallus gallus 43 dog 42 Actinidia 35 wheat 35 Actinidia chinensis 35 Unknown 34 Actinidia deliciosa 32 Solanum lycopersicum 27 Glycine max 27 not available 25 Top 20 env_broad_scale values: env_broad_scale missing 404 ENVO:00000114 182 medical 147 not applicable 144 freshwater environment [ENVO:01000306] 128 agricultural field 116 veterinary 64 indoor biome 48 NA 45 Alicante 40 human 31 plant associated biome 27 environmental 25 Human sputum 24 Human-associated habitat 24 Agricultural 22 aquatic biome 20 built environment 19 ENVO:00000428 19 boreal forest 19
Summary¶
Data files saved to ../data/:
pseudomonas_species.csv— 433 species with pangenome stats and subgenus classificationcarbon_pathway_scores.csv— genome x pathway best scores (62 pathways)isolation_sources.csv— isolation source metadata for environment harmonization
In [7]:
print('=== Data Extraction Complete ===')
for f in os.listdir(DATA_DIR):
fpath = os.path.join(DATA_DIR, f)
if os.path.isfile(fpath):
size_mb = os.path.getsize(fpath) / 1e6
print(f' {f}: {size_mb:.1f} MB')
=== Data Extraction Complete === pseudomonas_species.csv: 0.1 MB carbon_pathway_scores.csv: 59.0 MB isolation_sources.csv: 1.4 MB