Pangenome Openness, Metabolic Pathways, and Phylogenetic Distances

William J. Riehl

Research Question

How do pangenome characteristics (open vs. closed) correlate with metabolic pathway completeness, phylogenetic distances, and species ecology?

Research Plan

Hypothesis

Open pangenomes (high auxiliary/singleton gene ratios) may indicate:
Generalist species with broader niche adaptation
Greater horizontal gene transfer within ecological niches
Increased metabolic pathway diversity/flexibility
Closed pangenomes (low auxiliary/singleton ratios) may indicate:
Specialist species with narrow niche adaptation
Reduced gene acquisition rates
Core metabolic pathway conservation

Revision History

v1 (2026-02): Migrated from README.md

Overview

This project investigates whether pangenome openness correlates with GapMind metabolic pathway completeness and phylogenetic/structural distances from AlphaEarth embeddings. It tests whether open pangenomes indicate generalist species with broader niche adaptation and greater metabolic pathway diversity, while closed pangenomes indicate specialists with conserved core pathways. The analysis controls for phylogenetic signal using independent contrasts or PGLS.

Data Collections

🧬

Pangenome Collection

kbase_ke_pangenome

KBase, DOE

Review

Summary

This project presents a well-designed, ambitious research framework investigating the relationship between pangenome characteristics (open vs. closed), metabolic pathway completeness, and phylogenetic/structural distances across 27,690 microbial species. The project documentation is exceptionally thorough, with clear research questions, testable hypotheses, and a well-structured five-phase analysis plan. However, the project is currently in early stages with only Phase 1 notebook implemented. While the methodology is sound and the code quality in existing notebooks is good, the analysis is incomplete—no data files or figures have been generated, and critical analytical phases (Phases 3-5) remain unimplemented. The project shows strong potential but requires substantial work to deliver on its ambitious scope.

Methodology

Strengths:
- Clear research question: The hypothesis linking pangenome openness to metabolic pathway diversity and ecological specialization is well-articulated and biologically meaningful
- Appropriate data sources: Uses relevant BERDL tables (pangenome, gapmind_pathways, alphaearth_embeddings_all_years, phylogenetic_tree_distance_pairs)
- Sound openness metric: The formula (auxiliary_genes + singleton_genes) / total_genes is a standard and appropriate measure of pangenome openness
- Reproducible approach: Notebooks contain detailed SQL queries and clear processing steps that could be reproduced
- Awareness of data limitations: Documentation explicitly acknowledges AlphaEarth's ~28% coverage and data sparsity issues

Areas for improvement:
1. Missing execution: Neither notebook has been executed—no outputs exist in data/ or figures/ directories. This makes it impossible to verify that the queries work correctly or that the analysis produces valid results
2. Sampling strategy unclear: With 27,690 species and 293M genomes, the project lacks clear guidance on how to handle computational constraints. Will all species be analyzed? Should there be a representative subset?
3. Statistical power not addressed: No discussion of minimum sample sizes needed for correlations or how to handle species with few genomes
4. Phylogenetic correction method: Phase 4 mentions PGLS/PIC but doesn't specify which implementation or how phylogenetic distances will be converted to a tree structure

Code Quality

Strengths:
- Proper Spark usage: Correctly uses spark.sql() with the understanding that get_spark_session() is pre-initialized (correctly documented in comments)
- Appropriate SQL patterns: Uses exact equality (WHERE id = 'value') rather than LIKE patterns for performance, as recommended in docs/pitfalls.md
- Good aggregation logic: Notebook 02 correctly aggregates pathway scores using appropriate grouping and summary statistics
- Pitfall awareness: Comments reference docs/pitfalls.md and avoid common mistakes like string-typed numeric comparisons

Issues identified:
1. Untested code: Since notebooks haven't been executed, there may be hidden bugs. For example:
- Cell 5 in notebook 01 creates columns with division (/ p.no_gene_clusters) but doesn't handle potential division by zero
- Cell 11 checks AlphaEarth embeddings but the comment mentions "embedding vectors (A00, A01, A02, ...)" without verifying these column names actually exist

Incomplete data validation: Notebook 01 checks for pathway and embedding availability by species (cells 13-14) but doesn't validate that the has_pathways and has_embeddings flags are accurate before downstream use
Hardcoded path assumptions: Cell 15 (notebook 01) saves to '../data/' which assumes the notebook is run from the notebooks/ subdirectory. This should be validated or use absolute paths
Missing error handling: No try/except blocks around Spark queries that might timeout or fail on large tables
Categorical encoding issue: In notebook 01, cell 5, the SQL query computes openness metrics but doesn't filter for valid species (e.g., WHERE p.no_gene_clusters > 0)

Findings Assessment

Current state: No findings can be assessed because the analysis has not been executed. The data/ and figures/ directories are empty, and no output cells exist in the notebooks.

Expected findings (based on methodology):
- The analysis plan appropriately sets up four testable hypotheses (H1-H4) with clear predictions
- The correlation analysis in notebook 02 (cells 11-13) would produce meaningful results if executed
- Visualization code appears appropriate for the data types (scatter plots with trend lines, histograms, box plots)

Concerns:
1. Analysis incomplete: The README states Phase 1 is "✓ COMPLETE" but no outputs exist to confirm this
2. Phases 3-5 missing: Three of five planned analysis phases have no corresponding notebooks
3. No preliminary results: Cannot assess whether the hypotheses are supported or refuted
4. Incomplete validation: No discussion of what results would constitute success vs. failure for each hypothesis

Suggestions

Critical Issues (Must Address)

Execute Phase 1 notebook: Run 01_data_exploration.ipynb on BERDL JupyterHub to generate baseline data files and verify queries work correctly. This will reveal any bugs and produce the foundation for subsequent phases.
Verify SQL query correctness: Before scaling up, test queries on a small species subset. Specifically:
Add WHERE p.no_gene_clusters > 0 to prevent division by zero in openness calculations
Validate that AlphaEarth embedding column names match expectations
Test that join keys are correct (especially for the EAV-format ncbi_env table in Phase 5)
Address missing implementation: Create notebooks for Phases 3, 4, and 5, or remove these phases from the project scope if they're not feasible within your timeline. Currently, 60% of the planned work is missing.
Handle data sparsity explicitly: AlphaEarth coverage is only 28%. The analysis should:
Filter to species with sufficient embedding coverage before Phase 3
Report how many species are excluded due to missing embeddings
Consider whether 28% coverage introduces bias (e.g., clinical isolates likely underrepresented)

Important Improvements

Add statistical rigor: Include multiple testing correction (Bonferroni or FDR) when testing multiple correlations in Phase 2. The current code calculates p-values but doesn't adjust for multiple comparisons.
Specify computational approach: For large-scale analysis:
Document whether to use REST API vs. direct Spark (likely need Spark for billion-row tables)
Add query timeout handling for large species
Consider caching intermediate results to avoid re-querying on notebook restarts
Validate pathway aggregation logic: Cell 7 in notebook 02 calculates pathway diversity as unique_categories / total_pathways. Verify this is the intended metric—typically diversity would use Shannon entropy or Simpson's index.
Document expected runtime: Add estimates for how long each notebook should take to run, especially for queries on large tables like gapmind_pathways (305M rows).

Nice-to-Have Enhancements

Add unit tests for calculations: Create a small test dataset with known openness scores to verify the calculation logic produces expected results.
Include negative controls: Test whether random permutations of openness scores still show correlations (they shouldn't), to validate that observed patterns are real.
Link to related projects: The documentation mentions related projects (pangenome_openness/, ecotype_analysis/, cog_analysis/) but doesn't explain how this work differs or builds upon them. Clarify the unique contribution of this analysis.
Add project status tracking: Include a changelog or status log showing when each phase was completed and what outputs were generated.

This review was generated by an AI system. It should be treated as advisory input, not a definitive assessment.

Notebooks

📄

01_data_exploration.ipynb

01 Data Exploration

View notebook →

📄

02_pathway_analysis.ipynb

02 Pathway Analysis

View notebook →