01 Data Exploration
Jupyter notebook from the Pangenome Openness, Metabolic Pathways, and Phylogenetic Distances project.
Phase 1: Data Exploration & Characterization¶
This notebook explores the available data for analyzing relationships between:
- Pangenome openness/closedness
- Metabolic pathway completeness (GapMind)
- Phylogenetic/structural distances (AlphaEarth, phylogenetic tree)
In [ ]:
# Import libraries
# Note: Spark session is pre-initialized in JupyterHub kernel
# Do NOT use: from get_spark_session import get_spark_session
# See docs/pitfalls.md for details on JupyterHub environment
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
print("Libraries imported. Spark session is pre-initialized in the JupyterHub kernel.")
1. Pangenome Statistics Overview¶
In [ ]:
# Get basic statistics on pangenome table
pangenome_stats = spark.sql("""
SELECT
COUNT(*) as total_species,
AVG(no_genomes) as avg_genomes_per_species,
MIN(no_genomes) as min_genomes,
MAX(no_genomes) as max_genomes,
AVG(no_core) as avg_core_genes,
AVG(no_aux_genome) as avg_aux_genes,
AVG(no_singleton_gene_clusters) as avg_singletons,
AVG(no_gene_clusters) as avg_total_clusters
FROM kbase_ke_pangenome.pangenome
""").toPandas()
print("Pangenome Statistics:")
print(pangenome_stats.to_string())
2. Calculate Pangenome Openness Index¶
In [ ]:
# Calculate openness metric for all species
pangenome_openness = spark.sql("""
SELECT
p.gtdb_species_clade_id,
s.GTDB_species,
p.no_genomes,
p.no_core,
p.no_aux_genome,
p.no_singleton_gene_clusters,
p.no_gene_clusters,
(p.no_aux_genome + p.no_singleton_gene_clusters) / p.no_gene_clusters AS openness_score,
1.0 - ((p.no_aux_genome + p.no_singleton_gene_clusters) / p.no_gene_clusters) AS closedness_score,
p.no_core / p.no_gene_clusters AS core_fraction
FROM kbase_ke_pangenome.pangenome p
JOIN kbase_ke_pangenome.gtdb_species_clade s
ON p.gtdb_species_clade_id = s.gtdb_species_clade_id
ORDER BY openness_score DESC
""").toPandas()
print(f"Calculated openness scores for {len(pangenome_openness)} species")
print("\nOpenness Score Summary:")
print(pangenome_openness[['openness_score', 'closedness_score', 'core_fraction']].describe())
print("\nTop 10 Most Open Pangenomes:")
print(pangenome_openness[['GTDB_species', 'no_genomes', 'openness_score', 'core_fraction']].head(10).to_string())
print("\nTop 10 Most Closed Pangenomes:")
print(pangenome_openness[['GTDB_species', 'no_genomes', 'openness_score', 'core_fraction']].tail(10).to_string())
3. GapMind Pathways Coverage¶
In [ ]:
# Check GapMind pathways availability and structure
pathways_sample = spark.sql("""
SELECT
COUNT(*) as total_records,
COUNT(DISTINCT genome_id) as genomes_with_pathways,
COUNT(DISTINCT pathway) as unique_pathways
FROM kbase_ke_pangenome.gapmind_pathways
""").toPandas()
print("GapMind Pathways Overview:")
print(pathways_sample.to_string())
# Sample of pathways
sample_pathways = spark.sql("""
SELECT DISTINCT pathway
FROM kbase_ke_pangenome.gapmind_pathways
LIMIT 20
""").toPandas()
print("\nSample Pathways:")
for p in sample_pathways['pathway'].values:
print(f" - {p}")
4. Pathway Score Distribution¶
In [ ]:
# Pathway score distribution
score_dist = spark.sql("""
SELECT
score_simplified,
COUNT(*) as count,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) as percent
FROM kbase_ke_pangenome.gapmind_pathways
GROUP BY score_simplified
ORDER BY score_simplified
""").toPandas()
print("\nPathway Score Distribution:")
print(score_dist.to_string())
5. AlphaEarth Embeddings Availability¶
In [ ]:
# Check AlphaEarth embeddings coverage
# Note: The table contains embedding vectors (A00, A01, A02, ...) not separate year columns
embeddings_coverage = spark.sql("""
SELECT
COUNT(DISTINCT genome_id) as genomes_with_embeddings
FROM kbase_ke_pangenome.alphaearth_embeddings_all_years
""").toPandas()
total_genomes = spark.sql("""
SELECT COUNT(*) as total_genomes FROM kbase_ke_pangenome.genome
""").toPandas()['total_genomes'].values[0]
coverage_pct = 100.0 * embeddings_coverage['genomes_with_embeddings'].values[0] / total_genomes
print(f"AlphaEarth Embeddings Coverage:")
print(f" Total genomes: {total_genomes:,}")
print(f" Genomes with embeddings: {embeddings_coverage['genomes_with_embeddings'].values[0]:,}")
print(f" Coverage: {coverage_pct:.1f}%")
print(f"\nNote: Table contains AlphaEarth embedding vector components (A00, A01, A02, ...)")
print(f" representing structural similarity features across all years combined.")
6. Select Target Species for Analysis¶
In [ ]:
# Select species for detailed analysis
# Criteria: sufficient genomes for meaningful analysis
# Note: We'll filter by data availability in downstream analysis
target_species = spark.sql("""
SELECT
p.gtdb_species_clade_id,
s.GTDB_species,
p.no_genomes,
(p.no_aux_genome + p.no_singleton_gene_clusters) / p.no_gene_clusters AS openness_score
FROM kbase_ke_pangenome.pangenome p
JOIN kbase_ke_pangenome.gtdb_species_clade s
ON p.gtdb_species_clade_id = s.gtdb_species_clade_id
WHERE p.no_genomes >= 5 -- At least 5 genomes for meaningful analysis
ORDER BY openness_score DESC
""").toPandas()
# Now separately check pathway and embedding coverage
genomes_with_pathways = spark.sql("""
SELECT DISTINCT g.gtdb_species_clade_id
FROM kbase_ke_pangenome.genome g
WHERE g.genome_id IN (SELECT DISTINCT genome_id FROM kbase_ke_pangenome.gapmind_pathways)
""").toPandas()
genomes_with_embeddings = spark.sql("""
SELECT DISTINCT g.gtdb_species_clade_id
FROM kbase_ke_pangenome.genome g
WHERE g.genome_id IN (SELECT DISTINCT genome_id FROM kbase_ke_pangenome.alphaearth_embeddings_all_years)
""").toPandas()
# Mark which species have data
pathway_species = set(genomes_with_pathways['gtdb_species_clade_id'])
embedding_species = set(genomes_with_embeddings['gtdb_species_clade_id'])
target_species['has_pathways'] = target_species['gtdb_species_clade_id'].isin(pathway_species)
target_species['has_embeddings'] = target_species['gtdb_species_clade_id'].isin(embedding_species)
print(f"\nSelected {len(target_species)} target species for analysis")
print(f"Species with pathway data: {target_species['has_pathways'].sum()}")
print(f"Species with embedding data: {target_species['has_embeddings'].sum()}")
print("\nTarget Species Summary:")
print(target_species[['GTDB_species', 'no_genomes', 'openness_score', 'has_pathways', 'has_embeddings']].head(20).to_string())
7. Export Initial Data Summaries¶
In [ ]:
# Save pangenome openness scores for all species
pangenome_openness.to_csv(
'../data/species_pangenome_openness.csv',
index=False
)
print("Saved: species_pangenome_openness.csv")
# Save target species list
target_species.to_csv(
'../data/target_species_for_analysis.csv',
index=False
)
print("Saved: target_species_for_analysis.csv")
8. Visualize Pangenome Openness Distribution¶
In [ ]:
# Create visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Histogram of openness scores
axes[0, 0].hist(pangenome_openness['openness_score'], bins=50, edgecolor='black', alpha=0.7)
axes[0, 0].set_xlabel('Pangenome Openness Score')
axes[0, 0].set_ylabel('Number of Species')
axes[0, 0].set_title('Distribution of Pangenome Openness Across Species')
axes[0, 0].grid(alpha=0.3)
# Scatter: Genomes vs Openness
axes[0, 1].scatter(pangenome_openness['no_genomes'], pangenome_openness['openness_score'], alpha=0.5)
axes[0, 1].set_xlabel('Number of Genomes')
axes[0, 1].set_ylabel('Openness Score')
axes[0, 1].set_title('Pangenome Openness vs. Genome Count')
axes[0, 1].set_xscale('log')
axes[0, 1].grid(alpha=0.3)
# Scatter: Core Fraction vs Openness
axes[1, 0].scatter(pangenome_openness['core_fraction'], pangenome_openness['openness_score'], alpha=0.5)
axes[1, 0].set_xlabel('Core Gene Fraction')
axes[1, 0].set_ylabel('Openness Score')
axes[1, 0].set_title('Core Gene Content vs. Openness')
axes[1, 0].grid(alpha=0.3)
# Box plot by openness quartile
pangenome_openness['openness_quartile'] = pd.qcut(pangenome_openness['openness_score'], q=4, labels=['Q1_Closed', 'Q2', 'Q3', 'Q4_Open'])
quartile_data = [pangenome_openness[pangenome_openness['openness_quartile'] == q]['no_genomes'] for q in ['Q1_Closed', 'Q2', 'Q3', 'Q4_Open']]
axes[1, 1].boxplot(quartile_data, labels=['Q1_Closed', 'Q2', 'Q3', 'Q4_Open'])
axes[1, 1].set_ylabel('Number of Genomes')
axes[1, 1].set_title('Genome Count by Openness Quartile')
axes[1, 1].set_yscale('log')
axes[1, 1].grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.savefig('../figures/01_pangenome_openness_overview.png', dpi=300, bbox_inches='tight')
print("Saved: 01_pangenome_openness_overview.png")
plt.show()
Summary of Data Exploration¶
Key findings from this exploration:
- Pangenome Openness: Varies widely across species (0-1 range)
- GapMind Pathways: Available for most genomes with score categories
- AlphaEarth Embeddings: ~28% coverage of all genomes (as expected)
- Target Species: Selected species with good coverage across all data types
Next steps: Move to Phase 2 (02_pangenome_openness.ipynb) for detailed correlation analysis