01 Openness Stratification
Jupyter notebook from the Openness vs Functional Composition project.
Notebook 01: Openness Stratification¶
Project: Openness vs Functional Composition
Goal: Extract pangenome statistics, compute openness metrics, stratify species into quartiles, and select ~10 species per quartile balanced by phylum.
Data Sources:
kbase_ke_pangenome.pangenome— Per-species pangenome stats (27K rows)kbase_ke_pangenome.gtdb_species_clade— Taxonomy (27K rows)
Output: ../data/species_openness_quartiles.csv
In [1]:
spark = get_spark_session()
1. Extract Pangenome Stats¶
Get all species with >= 50 genomes (robust pangenome estimates).
In [2]:
pangenome_df = spark.sql("""
SELECT
p.gtdb_species_clade_id,
CAST(p.no_genomes AS INT) as no_genomes,
CAST(p.no_gene_clusters AS INT) as no_gene_clusters,
CAST(p.no_core AS INT) as no_core,
CAST(p.no_aux_genome AS INT) as no_aux_genome,
CAST(p.no_singleton_gene_clusters AS INT) as no_singleton,
sc.GTDB_taxonomy,
sc.GTDB_species
FROM kbase_ke_pangenome.pangenome p
JOIN kbase_ke_pangenome.gtdb_species_clade sc
ON p.gtdb_species_clade_id = sc.gtdb_species_clade_id
WHERE CAST(p.no_genomes AS INT) >= 50
""")
print(f"Species with >= 50 genomes: {pangenome_df.count():,}")
pangenome_df.show(5, truncate=60)
Species with >= 50 genomes: 457 +-------------------------------------------------+----------+----------------+-------+-------------+------------+------------------------------------------------------------+-----------------------------+ | gtdb_species_clade_id|no_genomes|no_gene_clusters|no_core|no_aux_genome|no_singleton| GTDB_taxonomy| GTDB_species| +-------------------------------------------------+----------+----------------+-------+-------------+------------+------------------------------------------------------------+-----------------------------+ | s__Klebsiella_pneumoniae--RS_GCF_000742135.1| 14240| 443124| 4199| 438925| 276743|d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__E...| s__Klebsiella_pneumoniae| | s__Staphylococcus_aureus--RS_GCF_001027105.1| 14526| 147914| 2083| 145831| 86127|d__Bacteria;p__Bacillota;c__Bacilli;o__Staphylococcales;f...| s__Staphylococcus_aureus| | s__Salmonella_enterica--RS_GCF_000006945.2| 11402| 266371| 3639| 262732| 149757|d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__E...| s__Salmonella_enterica| | s__Streptococcus_pneumoniae--RS_GCF_001457635.1| 8434| 116845| 1475| 115370| 67191|d__Bacteria;p__Bacillota;c__Bacilli;o__Lactobacillales;f_...| s__Streptococcus_pneumoniae| |s__Mycobacterium_tuberculosis--RS_GCF_000195955.2| 6903| 143670| 3741| 139929| 97549|d__Bacteria;p__Actinomycetota;c__Actinomycetia;o__Mycobac...|s__Mycobacterium_tuberculosis| +-------------------------------------------------+----------+----------------+-------+-------------+------------+------------------------------------------------------------+-----------------------------+ only showing top 5 rows
2. Compute Openness Metrics¶
In [3]:
from pyspark.sql import functions as F
# Compute openness metrics
openness_df = pangenome_df.withColumn(
'core_fraction', F.col('no_core') / F.col('no_gene_clusters')
).withColumn(
'singleton_fraction', F.col('no_singleton') / F.col('no_gene_clusters')
).withColumn(
'accessory_fraction', F.col('no_aux_genome') / F.col('no_gene_clusters')
).withColumn(
'openness', 1 - (F.col('no_core') / F.col('no_gene_clusters'))
).withColumn(
'phylum', F.split(F.col('GTDB_taxonomy'), ';').getItem(1)
)
# Summary statistics
openness_pdf = openness_df.toPandas()
print(f"Species count: {len(openness_pdf)}")
print(f"\nOpenness distribution:")
print(openness_pdf['openness'].describe())
print(f"\nPhylum distribution:")
print(openness_pdf['phylum'].value_counts().head(15))
Species count: 457 Openness distribution: count 457.000000 mean 0.845351 std 0.114875 min 0.094603 25% 0.810411 50% 0.873389 75% 0.914878 max 0.990524 Name: openness, dtype: float64 Phylum distribution: phylum p__Pseudomonadota 193 p__Bacillota 112 p__Bacillota_A 49 p__Bacteroidota 41 p__Actinomycetota 32 p__Campylobacterota 10 p__Spirochaetota 6 p__Chlamydiota 4 p__Verrucomicrobiota 3 p__Bacillota_C 2 p__Cyanobacteriota 2 p__Halobacteriota 1 p__Thermoproteota 1 p__Acidobacteriota 1 Name: count, dtype: int64
3. Stratify into Quartiles¶
In [4]:
import pandas as pd
# Assign quartiles by openness
openness_pdf['openness_quartile'] = pd.qcut(
openness_pdf['openness'], q=4, labels=['Q1_closed', 'Q2', 'Q3', 'Q4_open']
)
print("Species per quartile:")
print(openness_pdf['openness_quartile'].value_counts().sort_index())
print("\nOpenness range per quartile:")
print(openness_pdf.groupby('openness_quartile')['openness'].agg(['min', 'max', 'median']))
Species per quartile:
openness_quartile
Q1_closed 115
Q2 114
Q3 114
Q4_open 114
Name: count, dtype: int64
Openness range per quartile:
min max median
openness_quartile
Q1_closed 0.094603 0.810411 0.759087
Q2 0.810522 0.873389 0.846982
Q3 0.873793 0.914878 0.895281
Q4_open 0.915059 0.990524 0.935623
In [5]:
# Phylum distribution per quartile
print("Phylum distribution by quartile:")
print(pd.crosstab(openness_pdf['openness_quartile'], openness_pdf['phylum']))
Phylum distribution by quartile: phylum p__Acidobacteriota p__Actinomycetota p__Bacillota \ openness_quartile Q1_closed 1 10 27 Q2 0 8 26 Q3 0 4 24 Q4_open 0 10 35 phylum p__Bacillota_A p__Bacillota_C p__Bacteroidota \ openness_quartile Q1_closed 4 0 6 Q2 5 2 7 Q3 22 0 15 Q4_open 18 0 13 phylum p__Campylobacterota p__Chlamydiota p__Cyanobacteriota \ openness_quartile Q1_closed 2 4 0 Q2 1 0 0 Q3 1 0 1 Q4_open 6 0 1 phylum p__Halobacteriota p__Pseudomonadota p__Spirochaetota \ openness_quartile Q1_closed 1 56 2 Q2 0 63 2 Q3 0 44 1 Q4_open 0 30 1 phylum p__Thermoproteota p__Verrucomicrobiota openness_quartile Q1_closed 1 1 Q2 0 0 Q3 0 2 Q4_open 0 0
4. Select Target Species¶
Select ~10 species per quartile, balanced by phylum where possible. Prefer species with more genomes (better pangenome estimates).
In [6]:
# Select top species per quartile, balancing phylum representation
target_species = []
for q in ['Q1_closed', 'Q2', 'Q3', 'Q4_open']:
q_df = openness_pdf[openness_pdf['openness_quartile'] == q].copy()
# Get top phyla in this quartile
top_phyla = q_df['phylum'].value_counts().head(5).index.tolist()
selected = []
# Take 2 species per top phylum (up to 10 total)
for phylum in top_phyla:
phylum_species = q_df[q_df['phylum'] == phylum].nlargest(2, 'no_genomes')
selected.append(phylum_species)
if sum(len(s) for s in selected) >= 10:
break
q_selected = pd.concat(selected).head(10)
target_species.append(q_selected)
print(f"\n{q}: {len(q_selected)} species")
print(q_selected[['GTDB_species', 'phylum', 'no_genomes', 'openness']].to_string(index=False))
target_df = pd.concat(target_species)
print(f"\nTotal target species: {len(target_df)}")
Q1_closed: 10 species
GTDB_species phylum no_genomes openness
s__Erwinia_amylovora p__Pseudomonadota 231 0.773291
s__Rhodophyticola_sp013179285 p__Pseudomonadota 212 0.094603
s__Staphylococcus_argenteus p__Bacillota 218 0.732505
s__Mycoplasmoides_pneumoniae p__Bacillota 184 0.764535
s__Corynebacterium_pseudotuberculosis p__Actinomycetota 123 0.533730
s__Bifidobacterium_animalis p__Actinomycetota 115 0.776117
s__Riemerella_anatipestifer p__Bacteroidota 106 0.795809
s__Tenacibaculum_finnmarkense_A p__Bacteroidota 90 0.673787
s__Chlamydia_trachomatis p__Chlamydiota 167 0.501459
s__Chlamydophila_psittaci p__Chlamydiota 64 0.611493
Q2: 10 species
GTDB_species phylum no_genomes openness
s__Francisella_tularensis p__Pseudomonadota 852 0.860075
s__Burkholderia_cenocepacia p__Pseudomonadota 338 0.862164
s__Streptococcus_mutans p__Bacillota 261 0.817463
s__Bacillus_licheniformis p__Bacillota 226 0.816783
s__Cutibacterium_acnes p__Actinomycetota 361 0.868839
s__Mycobacterium_intracellulare p__Actinomycetota 104 0.834842
s__Flavobacterium_psychrophilum p__Bacteroidota 250 0.815604
s__Elizabethkingia_anophelis p__Bacteroidota 135 0.823687
s__Clostridium_beijerinckii p__Bacillota_A 240 0.840849
s__Paraclostridium_sordellii p__Bacillota_A 64 0.817343
Q3: 10 species
GTDB_species phylum no_genomes openness
s__Bordetella_pertussis p__Pseudomonadota 1019 0.881769
s__Neisseria_gonorrhoeae p__Pseudomonadota 954 0.910731
s__Listeria_monocytogenes_B p__Bacillota 1923 0.908038
s__Bacillus_velezensis p__Bacillota 647 0.912425
s__Clostridium_F_botulinum p__Bacillota_A 265 0.896587
s__Gemmiger_qucibialis p__Bacillota_A 88 0.909645
s__Bacteroides_fragilis_A p__Bacteroidota 117 0.895125
s__Prevotella_copri_B p__Bacteroidota 112 0.908153
s__Corynebacterium_diphtheriae p__Actinomycetota 389 0.882616
s__Bifidobacterium_infantis p__Actinomycetota 118 0.902046
Q4_open: 10 species
GTDB_species phylum no_genomes openness
s__Staphylococcus_aureus p__Bacillota 14526 0.985917
s__Streptococcus_pneumoniae p__Bacillota 8434 0.987376
s__Klebsiella_pneumoniae p__Pseudomonadota 14240 0.990524
s__Salmonella_enterica p__Pseudomonadota 11402 0.986339
s__Clostridioides_difficile p__Bacillota_A 2604 0.968889
s__Sarcina_perfringens p__Bacillota_A 403 0.930882
s__Phocaeicola_vulgatus p__Bacteroidota 449 0.964657
s__Bacteroides_uniformis p__Bacteroidota 401 0.968264
s__Mycobacterium_tuberculosis p__Actinomycetota 6903 0.973961
s__Mycobacterium_abscessus p__Actinomycetota 1825 0.948742
Total target species: 40
5. Save Results¶
In [7]:
# Save all species with openness metrics
openness_pdf.to_csv('../data/all_species_openness.csv', index=False)
print(f"Saved {len(openness_pdf)} species to ../data/all_species_openness.csv")
# Save target species for notebook 02
target_df.to_csv('../data/species_openness_quartiles.csv', index=False)
print(f"Saved {len(target_df)} target species to ../data/species_openness_quartiles.csv")
Saved 457 species to ../data/all_species_openness.csv Saved 40 target species to ../data/species_openness_quartiles.csv
Findings¶
Record after running:
- How many species have >= 50 genomes? ___
- What is the openness range? ___
- Are quartiles phylogenetically balanced? ___
- Any concerns about species selection? ___