Paramvir S. Dehal ORCID

Research Question

Do genes with correlated fitness profiles (co-fit) tend to co-occur in the same genomes across a species' pangenome? Does functional coupling constrain which genes are gained and lost together?

Overview

This project tests whether lab-measured functional coupling (co-fitness from RB-TnSeq) predicts genome-level co-inheritance (co-occurrence in pangenomes) across 11 bacterial species. It builds on module_conservation (modules are 86% core) and fitness_modules (48 accessory modules exist). Phi coefficients (binary Pearson correlation) measure co-occurrence across genomes, compared against prevalence-matched random pairs. See RESEARCH_PLAN.md for the detailed hypothesis and analysis plan.

Key results: Pairwise co-fitness weakly but consistently predicts co-occurrence (7/9 organisms positive), though the effect size is small. Multi-gene ICA modules show a substantially stronger co-inheritance signal, especially the 48 accessory modules -- functionally coherent gene groups that travel together through the pangenome (73% significant). See REPORT.md for full findings and interpretation.

Key Findings

Pairwise Co-fitness Weakly Predicts Co-occurrence

Across 9 organisms with co-fitness data (2.25M cofit pairs vs 22.5M prevalence-matched random pairs), co-fit gene pairs show a weak but consistent positive co-occurrence signal. The mean delta phi (cofit - random) is +0.011 across organisms, with 7 of 9 organisms showing a positive effect and 8 of 9 significant at p<0.05 (Mann-Whitney two-sided). However, the overall effect size is small (aggregate delta = +0.003, Mann-Whitney p=1.66e-29) and the Wilcoxon signed-rank test across organisms is not significant (W=9, p=0.13), reflecting high inter-organism variance.

Organism n cofit Mean phi (cofit) Mean phi (random) Delta p-value (two-sided)
Ddia6719 16,957 0.182 0.089 +0.093 <1e-300
pseudo3_N2E3 31,959 0.458 0.433 +0.026 6.6e-16
Phaeo 14,728 0.231 0.222 +0.009 6.5e-4
SyringaeB728a 129,385 0.049 0.043 +0.006 2.9e-4
Koxy 162,160 0.041 0.038 +0.003 3.3e-2
Smeli 230,516 0.029 0.026 +0.002 <1e-17
Btheta 242,676 0.067 0.067 +0.001 <1e-6
Putida 205,323 0.171 0.171 -0.000 0.41
Korea 7,994 0.447 0.489 -0.042 <1e-6

The signal is strongest in Ddia6719 (delta=+0.093), which despite being near-clonal (ANI 99.47%) has sufficient accessory gene variation to detect co-inheritance. Korea shows a negative delta (-0.042) because 95.2% of its cofit pairs have NaN phi (both genes at 100% prevalence across 72 genomes, producing zero-variance vectors). Only ~8,000 of 166,601 pairs are computable, and the negative delta reflects statistical noise in this small effective sample rather than a biological signal.

Co-fitness vs Co-occurrence

(Notebook: 02_cooccurrence.ipynb, 04_cross_organism.ipynb)

Operons Are Not a Confound

Only 0.7% of cofit pairs are genomically adjacent (within 5 genes). Excluding them does not change the result pattern.

Operon Control

(Notebook: 02_cooccurrence.ipynb)

ICA Modules Show Co-inheritance, Especially Accessory Modules

At the module level, 195 ICA modules across 6 organisms show within-module co-occurrence (mean phi=0.229) significantly exceeding prevalence-matched null expectations (mean phi=0.177, 1000 permutations), with delta=+0.053 and 51/195 modules (26%) significant at p<0.05. After Benjamini-Hochberg FDR correction, 21/195 modules (11%) remain significant at q<0.05.

Accessory modules show the strongest co-inheritance signal:

Module type n Mean delta phi Significant (p<0.05) Significant (FDR q<0.05)
Accessory (<50% core) 11 +0.108 8/11 (73%) 4/11 (36%)
Core (>90% core) 120 +0.059 29/120 (24%) 13/120 (11%)
Mixed (50-90%) 64 +0.031 14/64 (22%) 4/64 (6%)

The accessory vs core difference trends toward significance (Mann-Whitney p=0.051).

Module Co-inheritance

(Notebook: 03_module_coinheritance.ipynb)

Co-fitness Strength Weakly Anti-correlates with Co-occurrence

Stronger co-fitness scores weakly predict lower phi (Spearman rho=-0.109, p<1e-300 across 1.04M pairs). This likely reflects that the strongest co-fitness pairs are core genes with near-universal prevalence, leaving little variance for co-occurrence detection -- a prevalence ceiling effect.

Co-fitness Strength vs Phi

(Notebook: 02_cooccurrence.ipynb)

Phylogenetic Distance Stratification

Phylogenetic Control

Cofit pair phi is higher among near genomes (mean=0.102) than medium genomes (mean=0.067), as expected from shared ancestry. Most species lack genomes in the "far" stratum (>0.05 branch distance), limiting the ability to fully disentangle functional coupling from phylogenetic signal.

(Notebook: 02_cooccurrence.ipynb)

Results

Data Extraction

11 target species analyzed. Presence matrices extracted via Spark from gene_genecluster_junction joined with gene (two billion-row tables). BROADCAST hints on small filter tables reduced query time to ~210s per organism.

Organism Genomes Clusters Cofit pairs Phylo tree
Koxy 399 4,942 423,936 Yes
Btheta 287 4,649 328,455 Yes
Smeli 241 6,004 528,699 Yes
RalstoniaUW163 141 4,413 0 Yes
Putida 128 5,409 458,688 Yes
SyringaeB728a 126 4,999 371,004 Yes
Korea 72 4,075 230,724 Yes
RalstoniaGMI1000 70 4,723 0 Yes
Phaeo 43 3,790 192,138 No
Ddia6719 66 4,694 250,488 Yes
pseudo3_N2E3 40 5,513 507,828 No

Ralstonia UW163 and GMI1000 have zero co-fitness data in the Fitness Browser despite being in the organism table, excluding them from the primary analysis.

(Notebook: 01_data_extraction.ipynb)

Pairwise Analysis

For each organism, cofit pairs were mapped to pangenome cluster pairs via fb_pangenome_link.tsv, deduplicated, and phi coefficients computed from binary genome x cluster presence vectors. 10 prevalence-matched random pairs were generated per cofit pair, matching each cluster's prevalence independently (tolerance +/-5%). Adjacency was determined from FB gene coordinates (within 5 loci on the same scaffold).

Aggregate statistics:
- 9 organisms analyzed
- 2,253,491 cofit pairs, 22,534,910 random pairs
- Overall mean phi: cofit=0.092, random=0.089
- Overall delta: +0.003 (Mann-Whitney two-sided p=1.66e-29)
- Organisms with positive delta: 7/9
- Organisms with p<0.05: 8/9
- Wilcoxon signed-rank (delta > 0): W=9, p=0.13

Module Analysis

For 6 organisms with ICA module data (Koxy, Btheta, Putida, Korea, Phaeo, pseudo3_N2E3), mean pairwise phi was computed within each module and compared to 1000 prevalence-matched random gene sets. P-values were corrected for multiple testing using Benjamini-Hochberg FDR.

Per-organism module results:

Organism Modules Significant Mean phi Null mean
Koxy 44 14 (32%) 0.079 0.040
Btheta 36 22 (61%) 0.192 0.086
Putida 38 6 (16%) 0.266 0.202
Korea 29 0 (0%) 0.357 0.308
Phaeo 37 3 (8%) 0.108 0.095
pseudo3_N2E3 40 4 (10%) 0.444 0.409

Btheta shows the strongest module-level signal (61% significant), while Korea (despite being the best pairwise organism) shows 0% significant modules -- all Korea modules are >90% core with prevalence near 1.0.

Functional Enrichment

Among high-phi AND high-cofit pairs (top quartile of both), the most common functional categories (SEED annotations) are:

Category Gene count
Metabolism 15,936
Transport 11,940
Regulation 6,339
Motility 2,929
Mobile elements 2,716
DNA metabolism 2,490

Functional Enrichment

(Notebook: 04_cross_organism.ipynb)

Interpretation

Pairwise co-fitness weakly but consistently predicts co-inheritance: 7 of 9 organisms show a positive delta phi, and the aggregate signal is highly significant (p=1.66e-29). However, the effect size is small (delta=+0.003 overall, +0.011 mean across organisms), indicating that lab-measured functional coupling explains only a tiny fraction of co-inheritance patterns.

Multi-gene modules (ICA-derived co-regulated groups) show a substantially stronger co-inheritance signal (delta=+0.053, 26% significant at p<0.05, 11% after FDR correction), particularly accessory modules (73% significant, 36% after FDR). This suggests that coordinated regulation across multiple genes -- not just pairwise functional similarity -- is what most strongly constrains co-inheritance. The 48 accessory modules identified in the module_conservation project represent functionally coherent units that travel together through the pangenome.

The pairwise signal is attenuated by a prevalence ceiling: most FB genes map to core clusters (>95% prevalence), where phi approaches 0 for both cofit and random pairs because there is almost no variance to correlate. Species with more gene content diversity (Ddia6719, pseudo3_N2E3) show the largest positive deltas.

Literature Context

  • Hall et al. (2021) found that E. coli accessory genes co-occur by function and via mobile genetic elements, consistent with our module-level result showing functional coherence predicts co-inheritance (PMID: 34499026).
  • Whelan et al. (2020) developed Coinfinder to detect gene associations in pangenomes while controlling for phylogeny. Our analysis extends this by asking whether fitness-measured functional coupling predicts the associations Coinfinder would detect (PMID: 32100706).
  • Choudhury et al. (2025) showed P. aeruginosa phylogroups have distinct co-occurrence networks, reinforcing that accessory genome structure is lineage-specific (PMID: 40304385).
  • Price et al. (2018) established the Fitness Browser co-fitness data used here. Co-fitness identifies genes in the same pathway or complex; our results show this lab-measured coupling weakly but significantly predicts co-inheritance in natural populations, though the effect size is small (PMID: 29769716).

Limitations

  • Prevalence ceiling: Most FB genes map to core clusters (>95% prevalence). At high prevalence, phi approaches 0 for both cofit and random pairs because there is almost no variance to correlate. The analysis is most informative for species with substantial auxiliary gene content.
  • Ralstonia excluded: The two Ralstonia organisms (most phylogenetically diverse, lowest ANI) have zero co-fitness data in the Fitness Browser, excluding the species that would likely be most informative.
  • Phylogenetic control limited to two strata: Phylogenetic distance stratification was computed for 7 of 9 organisms, but most species lack genomes in the "far" stratum (>0.05 branch distance). Cofit pair phi is higher among near genomes (mean=0.102) than medium genomes (mean=0.067), as expected from shared ancestry, but the lack of a far stratum limits the ability to fully disentangle functional coupling from phylogenetic signal.
  • Pairwise vs multi-gene: Co-fitness captures pairwise gene relationships. ICA modules capture multi-gene coordinated regulation, which may better reflect the selective units that constrain co-inheritance.
  • Near-clonal species behave differently: Ddia6719 (ANI 99.47%) shows the strongest pairwise signal (delta=+0.093), while pseudo3_N2E3 (ANI 99.66%) also shows positive delta (+0.026). Despite low overall genome diversity, these species have sufficient accessory gene variation to detect co-inheritance. However, the high baseline phi in these species means the absolute phi values are less interpretable.

Future Directions

  1. Restrict to auxiliary-only pairs: Re-analyze using only gene pairs where both clusters are auxiliary (<95% prevalence) to maximize presence/absence variance.
  2. Compute co-fitness directly: For Ralstonia and other organisms lacking pre-computed co-fitness, calculate pairwise Pearson correlations from raw genefitness data.
  3. Fix phylogenetic control: Resolve the reference genome mapping to enable phylogenetic distance stratification.
  4. Module co-transfer networks: Build networks of which modules co-occur across species and test whether co-fitness predicts cross-module co-inheritance.
  5. Expand to more diverse species: Target additional organisms with >30% auxiliary genes and existing co-fitness data.

Data

Sources

Dataset Description Source
Fitness Browser co-fitness Top-20 co-fitness partners per gene cofit table via Spark
KBase pangenome Genome x cluster presence matrices gene_genecluster_junction + gene tables via Spark
Phylogenetic distances Pairwise genome distances phylogenetic_tree_distance_pairs table via Spark
FB-pangenome link table Gene-to-cluster mapping (177K links) conservation_vs_fitness/data/fb_pangenome_link.tsv
ICA fitness modules Co-regulated gene modules fitness_modules/data/modules/
Module conservation Module conservation stats module_conservation/data/module_conservation.tsv
SEED annotations Functional category assignments conservation_vs_fitness/data/seed_annotations.tsv

Generated Data

File Description
data/genome_cluster_matrices/{org}_presence.tsv Binary genome x cluster presence matrices (11 species)
data/cofit/{org}_cofit.tsv Co-fitness pairs per organism from Fitness Browser
data/gene_coords/{org}_coords.tsv Gene coordinates for adjacency detection
data/phylo_distances/{org}_phylo_distances.tsv Pairwise phylogenetic distances (9 species)
data/{org}_phi_results.tsv Per-organism phi results for cofit and random pairs
data/all_phi_results.tsv Combined phi results across all organisms
data/organism_summary.tsv Per-organism effect sizes and p-values
data/module_coinheritance.tsv Module-level co-inheritance scores
data/phylo_stratified.tsv Cofit pair phi stratified by phylogenetic distance

References

  • Price MN et al. (2018). "Mutant phenotypes for thousands of bacterial genes of unknown function." Nature 557:503-509. PMID: 29769716
  • Parks DH et al. (2022). "GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy." Nucleic Acids Res 50:D199-D207. PMID: 34520557
  • Hall JPJ et al. (2021). "Gene-gene relationships in an Escherichia coli accessory genome are linked to function and mobility." Microb Genom 7:000650. PMID: 34499026
  • Whelan FJ et al. (2020). "Coinfinder: detecting significant associations and dissociations in pangenomes." Microb Genom 6:e000338. PMID: 32100706
  • Choudhury M et al. (2025). "Phylogroup-specific co-occurrence networks in Pseudomonas aeruginosa." PMID: 40304385

Data Collections

Derived Data

This project builds on processed data from other projects.

Review

AI Review BERIL Automated Review 2026-02-15 Needs Re-review

Summary

This is a well-executed and thoroughly documented project that tests whether lab-measured functional coupling (co-fitness from RB-TnSeq) predicts genome-level co-inheritance (co-occurrence in pangenomes) across 11 bacterial species. The analysis covers 2.25M co-fit pairs vs 22.5M prevalence-matched random controls, with a secondary module-level analysis across 6 organisms with ICA data. The main pairwise result is a weak but statistically significant positive signal (delta phi = +0.003, p = 1.66e-29, 7/9 organisms positive), while the module-level analysis reveals a substantially stronger effect (delta = +0.053, 73% of accessory modules significant at p<0.05). The project excels in its three well-designed controls (operon adjacency, prevalence matching, phylogenetic stratification), comprehensive three-file documentation (README, RESEARCH_PLAN, REPORT), clear Spark/local separation with caching for reproducibility, complete notebook outputs, and 8 saved figures covering all 6 planned analyses. The honest reporting of negative and null results (Korea, Putida) alongside positive ones strengthens the credibility of the findings. The main areas for improvement are minor: requirements.txt is missing statsmodels and lacks version pins, a small discrepancy exists in the Putida module significance count between notebook and REPORT, and the phylogenetic control's 2000-pair subsampling is undocumented. Overall, this is among the strongest projects in the observatory for methodological rigor, statistical care, and intellectual honesty.

Methodology

Research question: Clearly stated and testable. The RESEARCH_PLAN.md articulates specific predictions (signal strongest at intermediate prevalences, collapsing at >99%) and pre-registers four verification checks (e.g., "At prevalence >99%, phi should be ~0 for both groups"), which is excellent scientific practice.

Approach: The decision to use continuous prevalence rather than binary core/auxiliary classification (documented in RESEARCH_PLAN "Key Design Decisions") is methodologically sound and maximizes statistical power. The phi coefficient is the correct metric for binary co-occurrence. The prevalence-matched random pair generation independently matches each cluster's prevalence (NB02 generate_random_pairs), avoiding the bias of mean-prevalence matching. The two-tier analysis (pairwise co-fitness primary, ICA modules secondary) provides complementary evidence at different scales.

Controls: Three well-implemented controls:
1. Operon control: Only 0.7% of cofit pairs are adjacent (within 5 genes); excluding them preserves the signal (NB04 Fig 2).
2. Prevalence matching: 10 random pairs per cofit pair with +/-5% tolerance on individual cluster prevalences.
3. Phylogenetic stratification: Correctly handles the GTDB prefix mismatch (documented as a discovered pitfall). Limited by lack of "far" stratum genomes in most species, which is acknowledged.

Data sources: All sources clearly identified in README with a table mapping database, table, and use. Dependencies on three prior projects are explicitly documented in the Dependencies section.

Reproducibility: The README includes a two-step reproduction guide cleanly separating Spark-dependent extraction from local analysis. A standalone src/extract_data.py script duplicates NB01 for batch execution. All notebooks implement caching (skip if output file exists). However, estimated runtimes are not provided (NB01 takes ~210s per organism, ~40 min total based on outputs).

Code Quality

SQL correctness: Spark SQL queries are correct and efficient. BROADCAST hints are properly applied on small filter tables joining against billion-row gene_genecluster_junction and gene tables (NB01, src/extract_data.py). All Fitness Browser numeric columns are properly CAST (CAST(rank AS INT), CAST(cofit AS FLOAT), CAST(begin AS INT)), consistent with docs/pitfalls.md. The Dyella79 locus tag mismatch is handled by filtering it from the link table in all notebooks.

Statistical methods: Appropriate throughout:
- Mann-Whitney U (two-sided) for non-normal phi distributions
- Wilcoxon signed-rank test for cross-organism meta-analysis (correctly reported as non-significant at p=0.13 with N=9)
- Permutation testing (1000 permutations) for module co-inheritance with Benjamini-Hochberg FDR correction via statsmodels.stats.multitest.multipletests
- Spearman correlation for co-fitness strength vs phi

Notebook organization: Clean four-notebook pipeline (extraction → pairwise analysis → module analysis → synthesis/figures). Each notebook has markdown headers, runtime requirements stated upfront, inline outputs preserved, and summary cells at the end. All notebooks produce saved outputs (text, tables, and figures).

Pitfall awareness: Two pitfalls discovered by this project are documented in docs/pitfalls.md: (1) GTDB prefix mismatch in phylogenetic distance table IDs, and (2) BROADCAST hints for billion-row table joins. The project also correctly follows the flat data/ directory pattern and three-file documentation structure per prior pitfall guidance.

Issues identified:

  1. Missing statsmodels in requirements.txt: NB03 imports from statsmodels.stats.multitest import multipletests but the package is absent from requirements.txt. Reproduction would fail at NB03 on a fresh install. No version pins for any package.

  2. Putida module significance discrepancy: NB03 output prints "5 (13%)" significant modules for Putida, but the REPORT table says "6 (16%)". One of these needs correction.

  3. Phylogenetic control subsamples to 2000 pairs: NB04 Fig 3 cell silently subsamples cofit pairs (if len(cofit_unique) > 2000: cofit_unique = cofit_unique.sample(2000, random_state=42)), which is not documented in the REPORT. This affects most organisms and reduces statistical power for the phylogenetic analysis.

  4. compute_phi float comparison: Using vec.std() == 0 for exact float comparison is safe for binary integer vectors but fragile in general. A tolerance-based check (< 1e-10) would be more robust.

Findings Assessment

Conclusions supported by data: Yes, all central claims are well-supported:
- Pairwise delta phi of +0.003 aggregate, 7/9 organisms positive, 8/9 significant. The non-significant Wilcoxon (p=0.13) is correctly disclosed.
- Module-level delta of +0.053 with 51/195 significant (26%) at p<0.05, 21/195 (11%) after FDR. Accessory modules show 73% significance (8/11, 36% after FDR) -- a biologically meaningful contrast.
- The prevalence ceiling interpretation is well-supported by the Ddia6719 case (strongest signal, most accessory variation among near-clonal species).

Negative results handled honestly: Korea's negative delta (-0.042) and Putida's null result (delta=-0.000, p=0.41) are reported without suppression or spin, with plausible explanations provided.

Limitations acknowledged: Five specific limitations documented in the REPORT covering prevalence ceiling, Ralstonia exclusion, limited phylogenetic strata, pairwise vs multi-gene scale, and near-clonal species behavior.

Literature context: Four references with PMIDs cited meaningfully in the REPORT interpretation, connecting to Coinfinder (Whelan et al. 2020), E. coli accessory gene co-occurrence (Hall et al. 2021), P. aeruginosa phylogroup co-occurrence networks (Choudhury et al. 2025), and the Fitness Browser source data (Price et al. 2018). The references.md includes 11 total references with full bibliographic details.

Visualizations: All 6 planned figures from RESEARCH_PLAN are produced and saved in figures/, plus 2 supplementary figures. Figures have proper axis labels, titles, legends, and panel letters. Color choices are reasonable (viridis colormap in Fig 4 is colorblind-accessible).

Counter-intuitive result explained: The weak anti-correlation between cofit strength and phi (Spearman rho=-0.109) is correctly interpreted as a prevalence ceiling effect, though a partial correlation controlling for prevalence would strengthen this interpretation.

Suggestions

  1. Add statsmodels to requirements.txt and pin versions (Critical for reproducibility): NB03 imports multipletests from statsmodels but the package is absent from requirements.txt. Add it and consider pinning at least major versions for reproducibility (e.g., pandas>=2.0, scipy>=1.10, statsmodels>=0.14).

  2. Resolve Putida module significance discrepancy (Minor): NB03 prints "5 (13%)" for Putida but the REPORT table says "6 (16%)". Check the source data and correct whichever is wrong.

  3. Document the 2000-pair subsampling in phylogenetic control (Minor): NB04 silently subsamples cofit pairs to 2000 for the phylogenetic stratification analysis. Add a note in the REPORT's phylogenetic control limitation section (e.g., "phylogenetic stratification was computed on a random subsample of 2000 cofit pairs per organism for computational efficiency").

  4. Add expected runtimes to reproduction instructions (Nice-to-have): The README's Reproduction section would benefit from approximate runtimes: "Step 1: ~45 min on BERDL JupyterHub (Spark); Steps 2-4: ~10 min each locally."

  5. Investigate Korea's random pair distribution (Nice-to-have): Korea's random pair median phi (0.8597) is dramatically higher than its cofit pair median (0.2584), which is unusual. A diagnostic prevalence histogram for Korea would clarify whether the prevalence matching is selecting near-universal clusters and help explain the negative delta.

  6. Partial correlation for cofit strength vs phi (Nice-to-have): Computing a partial Spearman correlation controlling for mean prevalence would directly test whether the anti-correlation (rho=-0.109) is indeed a prevalence ceiling artifact, strengthening the interpretation in the REPORT.

This review was generated by an AI system. It should be treated as advisory input, not a definitive assessment.

Visualizations

Fig1 Cofit Cooccurrence

Fig1 Cofit Cooccurrence

Fig2 Operon Control

Fig2 Operon Control

Fig3 Phylo Control

Fig3 Phylo Control

Fig4 Cofit Strength

Fig4 Cofit Strength

Fig5 Module Coinheritance

Fig5 Module Coinheritance

Fig6 Functional

Fig6 Functional

Module Coinheritance

Module Coinheritance

Quick Cooccurrence

Quick Cooccurrence

Notebooks