Co-fitness Predicts Co-inheritance in Bacterial Pangenomes
CompletedResearch Question
Do genes with correlated fitness profiles (co-fit) tend to co-occur in the same genomes across a species' pangenome? Does functional coupling constrain which genes are gained and lost together?
Overview
This project tests whether lab-measured functional coupling (co-fitness from RB-TnSeq) predicts genome-level co-inheritance (co-occurrence in pangenomes) across 11 bacterial species. It builds on module_conservation (modules are 86% core) and fitness_modules (48 accessory modules exist). Phi coefficients (binary Pearson correlation) measure co-occurrence across genomes, compared against prevalence-matched random pairs. See RESEARCH_PLAN.md for the detailed hypothesis and analysis plan.
Key results: Pairwise co-fitness weakly but consistently predicts co-occurrence (7/9 organisms positive), though the effect size is small. Multi-gene ICA modules show a substantially stronger co-inheritance signal, especially the 48 accessory modules -- functionally coherent gene groups that travel together through the pangenome (73% significant). See REPORT.md for full findings and interpretation.
Key Findings
Pairwise Co-fitness Weakly Predicts Co-occurrence
Across 9 organisms with co-fitness data (2.25M cofit pairs vs 22.5M prevalence-matched random pairs), co-fit gene pairs show a weak but consistent positive co-occurrence signal. The mean delta phi (cofit - random) is +0.011 across organisms, with 7 of 9 organisms showing a positive effect and 8 of 9 significant at p<0.05 (Mann-Whitney two-sided). However, the overall effect size is small (aggregate delta = +0.003, Mann-Whitney p=1.66e-29) and the Wilcoxon signed-rank test across organisms is not significant (W=9, p=0.13), reflecting high inter-organism variance.
| Organism | n cofit | Mean phi (cofit) | Mean phi (random) | Delta | p-value (two-sided) |
|---|---|---|---|---|---|
| Ddia6719 | 16,957 | 0.182 | 0.089 | +0.093 | <1e-300 |
| pseudo3_N2E3 | 31,959 | 0.458 | 0.433 | +0.026 | 6.6e-16 |
| Phaeo | 14,728 | 0.231 | 0.222 | +0.009 | 6.5e-4 |
| SyringaeB728a | 129,385 | 0.049 | 0.043 | +0.006 | 2.9e-4 |
| Koxy | 162,160 | 0.041 | 0.038 | +0.003 | 3.3e-2 |
| Smeli | 230,516 | 0.029 | 0.026 | +0.002 | <1e-17 |
| Btheta | 242,676 | 0.067 | 0.067 | +0.001 | <1e-6 |
| Putida | 205,323 | 0.171 | 0.171 | -0.000 | 0.41 |
| Korea | 7,994 | 0.447 | 0.489 | -0.042 | <1e-6 |
The signal is strongest in Ddia6719 (delta=+0.093), which despite being near-clonal (ANI 99.47%) has sufficient accessory gene variation to detect co-inheritance. Korea shows a negative delta (-0.042) because 95.2% of its cofit pairs have NaN phi (both genes at 100% prevalence across 72 genomes, producing zero-variance vectors). Only ~8,000 of 166,601 pairs are computable, and the negative delta reflects statistical noise in this small effective sample rather than a biological signal.

(Notebook: 02_cooccurrence.ipynb, 04_cross_organism.ipynb)
Operons Are Not a Confound
Only 0.7% of cofit pairs are genomically adjacent (within 5 genes). Excluding them does not change the result pattern.

(Notebook: 02_cooccurrence.ipynb)
ICA Modules Show Co-inheritance, Especially Accessory Modules
At the module level, 195 ICA modules across 6 organisms show within-module co-occurrence (mean phi=0.229) significantly exceeding prevalence-matched null expectations (mean phi=0.177, 1000 permutations), with delta=+0.053 and 51/195 modules (26%) significant at p<0.05. After Benjamini-Hochberg FDR correction, 21/195 modules (11%) remain significant at q<0.05.
Accessory modules show the strongest co-inheritance signal:
| Module type | n | Mean delta phi | Significant (p<0.05) | Significant (FDR q<0.05) |
|---|---|---|---|---|
| Accessory (<50% core) | 11 | +0.108 | 8/11 (73%) | 4/11 (36%) |
| Core (>90% core) | 120 | +0.059 | 29/120 (24%) | 13/120 (11%) |
| Mixed (50-90%) | 64 | +0.031 | 14/64 (22%) | 4/64 (6%) |
The accessory vs core difference trends toward significance (Mann-Whitney p=0.051).

(Notebook: 03_module_coinheritance.ipynb)
Co-fitness Strength Weakly Anti-correlates with Co-occurrence
Stronger co-fitness scores weakly predict lower phi (Spearman rho=-0.109, p<1e-300 across 1.04M pairs). This likely reflects that the strongest co-fitness pairs are core genes with near-universal prevalence, leaving little variance for co-occurrence detection -- a prevalence ceiling effect.

(Notebook: 02_cooccurrence.ipynb)
Phylogenetic Distance Stratification

Cofit pair phi is higher among near genomes (mean=0.102) than medium genomes (mean=0.067), as expected from shared ancestry. Most species lack genomes in the "far" stratum (>0.05 branch distance), limiting the ability to fully disentangle functional coupling from phylogenetic signal.
(Notebook: 02_cooccurrence.ipynb)
Results
Data Extraction
11 target species analyzed. Presence matrices extracted via Spark from gene_genecluster_junction joined with gene (two billion-row tables). BROADCAST hints on small filter tables reduced query time to ~210s per organism.
| Organism | Genomes | Clusters | Cofit pairs | Phylo tree |
|---|---|---|---|---|
| Koxy | 399 | 4,942 | 423,936 | Yes |
| Btheta | 287 | 4,649 | 328,455 | Yes |
| Smeli | 241 | 6,004 | 528,699 | Yes |
| RalstoniaUW163 | 141 | 4,413 | 0 | Yes |
| Putida | 128 | 5,409 | 458,688 | Yes |
| SyringaeB728a | 126 | 4,999 | 371,004 | Yes |
| Korea | 72 | 4,075 | 230,724 | Yes |
| RalstoniaGMI1000 | 70 | 4,723 | 0 | Yes |
| Phaeo | 43 | 3,790 | 192,138 | No |
| Ddia6719 | 66 | 4,694 | 250,488 | Yes |
| pseudo3_N2E3 | 40 | 5,513 | 507,828 | No |
Ralstonia UW163 and GMI1000 have zero co-fitness data in the Fitness Browser despite being in the organism table, excluding them from the primary analysis.
(Notebook: 01_data_extraction.ipynb)
Pairwise Analysis
For each organism, cofit pairs were mapped to pangenome cluster pairs via fb_pangenome_link.tsv, deduplicated, and phi coefficients computed from binary genome x cluster presence vectors. 10 prevalence-matched random pairs were generated per cofit pair, matching each cluster's prevalence independently (tolerance +/-5%). Adjacency was determined from FB gene coordinates (within 5 loci on the same scaffold).
Aggregate statistics:
- 9 organisms analyzed
- 2,253,491 cofit pairs, 22,534,910 random pairs
- Overall mean phi: cofit=0.092, random=0.089
- Overall delta: +0.003 (Mann-Whitney two-sided p=1.66e-29)
- Organisms with positive delta: 7/9
- Organisms with p<0.05: 8/9
- Wilcoxon signed-rank (delta > 0): W=9, p=0.13
Module Analysis
For 6 organisms with ICA module data (Koxy, Btheta, Putida, Korea, Phaeo, pseudo3_N2E3), mean pairwise phi was computed within each module and compared to 1000 prevalence-matched random gene sets. P-values were corrected for multiple testing using Benjamini-Hochberg FDR.
Per-organism module results:
| Organism | Modules | Significant | Mean phi | Null mean |
|---|---|---|---|---|
| Koxy | 44 | 14 (32%) | 0.079 | 0.040 |
| Btheta | 36 | 22 (61%) | 0.192 | 0.086 |
| Putida | 38 | 6 (16%) | 0.266 | 0.202 |
| Korea | 29 | 0 (0%) | 0.357 | 0.308 |
| Phaeo | 37 | 3 (8%) | 0.108 | 0.095 |
| pseudo3_N2E3 | 40 | 4 (10%) | 0.444 | 0.409 |
Btheta shows the strongest module-level signal (61% significant), while Korea (despite being the best pairwise organism) shows 0% significant modules -- all Korea modules are >90% core with prevalence near 1.0.
Functional Enrichment
Among high-phi AND high-cofit pairs (top quartile of both), the most common functional categories (SEED annotations) are:
| Category | Gene count |
|---|---|
| Metabolism | 15,936 |
| Transport | 11,940 |
| Regulation | 6,339 |
| Motility | 2,929 |
| Mobile elements | 2,716 |
| DNA metabolism | 2,490 |

(Notebook: 04_cross_organism.ipynb)
Interpretation
Pairwise co-fitness weakly but consistently predicts co-inheritance: 7 of 9 organisms show a positive delta phi, and the aggregate signal is highly significant (p=1.66e-29). However, the effect size is small (delta=+0.003 overall, +0.011 mean across organisms), indicating that lab-measured functional coupling explains only a tiny fraction of co-inheritance patterns.
Multi-gene modules (ICA-derived co-regulated groups) show a substantially stronger co-inheritance signal (delta=+0.053, 26% significant at p<0.05, 11% after FDR correction), particularly accessory modules (73% significant, 36% after FDR). This suggests that coordinated regulation across multiple genes -- not just pairwise functional similarity -- is what most strongly constrains co-inheritance. The 48 accessory modules identified in the module_conservation project represent functionally coherent units that travel together through the pangenome.
The pairwise signal is attenuated by a prevalence ceiling: most FB genes map to core clusters (>95% prevalence), where phi approaches 0 for both cofit and random pairs because there is almost no variance to correlate. Species with more gene content diversity (Ddia6719, pseudo3_N2E3) show the largest positive deltas.
Literature Context
- Hall et al. (2021) found that E. coli accessory genes co-occur by function and via mobile genetic elements, consistent with our module-level result showing functional coherence predicts co-inheritance (PMID: 34499026).
- Whelan et al. (2020) developed Coinfinder to detect gene associations in pangenomes while controlling for phylogeny. Our analysis extends this by asking whether fitness-measured functional coupling predicts the associations Coinfinder would detect (PMID: 32100706).
- Choudhury et al. (2025) showed P. aeruginosa phylogroups have distinct co-occurrence networks, reinforcing that accessory genome structure is lineage-specific (PMID: 40304385).
- Price et al. (2018) established the Fitness Browser co-fitness data used here. Co-fitness identifies genes in the same pathway or complex; our results show this lab-measured coupling weakly but significantly predicts co-inheritance in natural populations, though the effect size is small (PMID: 29769716).
Limitations
- Prevalence ceiling: Most FB genes map to core clusters (>95% prevalence). At high prevalence, phi approaches 0 for both cofit and random pairs because there is almost no variance to correlate. The analysis is most informative for species with substantial auxiliary gene content.
- Ralstonia excluded: The two Ralstonia organisms (most phylogenetically diverse, lowest ANI) have zero co-fitness data in the Fitness Browser, excluding the species that would likely be most informative.
- Phylogenetic control limited to two strata: Phylogenetic distance stratification was computed for 7 of 9 organisms, but most species lack genomes in the "far" stratum (>0.05 branch distance). Cofit pair phi is higher among near genomes (mean=0.102) than medium genomes (mean=0.067), as expected from shared ancestry, but the lack of a far stratum limits the ability to fully disentangle functional coupling from phylogenetic signal.
- Pairwise vs multi-gene: Co-fitness captures pairwise gene relationships. ICA modules capture multi-gene coordinated regulation, which may better reflect the selective units that constrain co-inheritance.
- Near-clonal species behave differently: Ddia6719 (ANI 99.47%) shows the strongest pairwise signal (delta=+0.093), while pseudo3_N2E3 (ANI 99.66%) also shows positive delta (+0.026). Despite low overall genome diversity, these species have sufficient accessory gene variation to detect co-inheritance. However, the high baseline phi in these species means the absolute phi values are less interpretable.
Future Directions
- Restrict to auxiliary-only pairs: Re-analyze using only gene pairs where both clusters are auxiliary (<95% prevalence) to maximize presence/absence variance.
- Compute co-fitness directly: For Ralstonia and other organisms lacking pre-computed co-fitness, calculate pairwise Pearson correlations from raw
genefitnessdata. - Fix phylogenetic control: Resolve the reference genome mapping to enable phylogenetic distance stratification.
- Module co-transfer networks: Build networks of which modules co-occur across species and test whether co-fitness predicts cross-module co-inheritance.
- Expand to more diverse species: Target additional organisms with >30% auxiliary genes and existing co-fitness data.
Data
Sources
| Dataset | Description | Source |
|---|---|---|
| Fitness Browser co-fitness | Top-20 co-fitness partners per gene | cofit table via Spark |
| KBase pangenome | Genome x cluster presence matrices | gene_genecluster_junction + gene tables via Spark |
| Phylogenetic distances | Pairwise genome distances | phylogenetic_tree_distance_pairs table via Spark |
| FB-pangenome link table | Gene-to-cluster mapping (177K links) | conservation_vs_fitness/data/fb_pangenome_link.tsv |
| ICA fitness modules | Co-regulated gene modules | fitness_modules/data/modules/ |
| Module conservation | Module conservation stats | module_conservation/data/module_conservation.tsv |
| SEED annotations | Functional category assignments | conservation_vs_fitness/data/seed_annotations.tsv |
Generated Data
| File | Description |
|---|---|
data/genome_cluster_matrices/{org}_presence.tsv |
Binary genome x cluster presence matrices (11 species) |
data/cofit/{org}_cofit.tsv |
Co-fitness pairs per organism from Fitness Browser |
data/gene_coords/{org}_coords.tsv |
Gene coordinates for adjacency detection |
data/phylo_distances/{org}_phylo_distances.tsv |
Pairwise phylogenetic distances (9 species) |
data/{org}_phi_results.tsv |
Per-organism phi results for cofit and random pairs |
data/all_phi_results.tsv |
Combined phi results across all organisms |
data/organism_summary.tsv |
Per-organism effect sizes and p-values |
data/module_coinheritance.tsv |
Module-level co-inheritance scores |
data/phylo_stratified.tsv |
Cofit pair phi stratified by phylogenetic distance |
References
- Price MN et al. (2018). "Mutant phenotypes for thousands of bacterial genes of unknown function." Nature 557:503-509. PMID: 29769716
- Parks DH et al. (2022). "GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy." Nucleic Acids Res 50:D199-D207. PMID: 34520557
- Hall JPJ et al. (2021). "Gene-gene relationships in an Escherichia coli accessory genome are linked to function and mobility." Microb Genom 7:000650. PMID: 34499026
- Whelan FJ et al. (2020). "Coinfinder: detecting significant associations and dissociations in pangenomes." Microb Genom 6:e000338. PMID: 32100706
- Choudhury M et al. (2025). "Phylogroup-specific co-occurrence networks in Pseudomonas aeruginosa." PMID: 40304385
Data Collections
Derived Data
This project builds on processed data from other projects.
Review
Summary
This is a well-executed and thoroughly documented project that tests whether lab-measured functional coupling (co-fitness from RB-TnSeq) predicts genome-level co-inheritance (co-occurrence in pangenomes) across 11 bacterial species. The analysis covers 2.25M co-fit pairs vs 22.5M prevalence-matched random controls, with a secondary module-level analysis across 6 organisms with ICA data. The main pairwise result is a weak but statistically significant positive signal (delta phi = +0.003, p = 1.66e-29, 7/9 organisms positive), while the module-level analysis reveals a substantially stronger effect (delta = +0.053, 73% of accessory modules significant at p<0.05). The project excels in its three well-designed controls (operon adjacency, prevalence matching, phylogenetic stratification), comprehensive three-file documentation (README, RESEARCH_PLAN, REPORT), clear Spark/local separation with caching for reproducibility, complete notebook outputs, and 8 saved figures covering all 6 planned analyses. The honest reporting of negative and null results (Korea, Putida) alongside positive ones strengthens the credibility of the findings. The main areas for improvement are minor: requirements.txt is missing statsmodels and lacks version pins, a small discrepancy exists in the Putida module significance count between notebook and REPORT, and the phylogenetic control's 2000-pair subsampling is undocumented. Overall, this is among the strongest projects in the observatory for methodological rigor, statistical care, and intellectual honesty.
Methodology
Research question: Clearly stated and testable. The RESEARCH_PLAN.md articulates specific predictions (signal strongest at intermediate prevalences, collapsing at >99%) and pre-registers four verification checks (e.g., "At prevalence >99%, phi should be ~0 for both groups"), which is excellent scientific practice.
Approach: The decision to use continuous prevalence rather than binary core/auxiliary classification (documented in RESEARCH_PLAN "Key Design Decisions") is methodologically sound and maximizes statistical power. The phi coefficient is the correct metric for binary co-occurrence. The prevalence-matched random pair generation independently matches each cluster's prevalence (NB02 generate_random_pairs), avoiding the bias of mean-prevalence matching. The two-tier analysis (pairwise co-fitness primary, ICA modules secondary) provides complementary evidence at different scales.
Controls: Three well-implemented controls:
1. Operon control: Only 0.7% of cofit pairs are adjacent (within 5 genes); excluding them preserves the signal (NB04 Fig 2).
2. Prevalence matching: 10 random pairs per cofit pair with +/-5% tolerance on individual cluster prevalences.
3. Phylogenetic stratification: Correctly handles the GTDB prefix mismatch (documented as a discovered pitfall). Limited by lack of "far" stratum genomes in most species, which is acknowledged.
Data sources: All sources clearly identified in README with a table mapping database, table, and use. Dependencies on three prior projects are explicitly documented in the Dependencies section.
Reproducibility: The README includes a two-step reproduction guide cleanly separating Spark-dependent extraction from local analysis. A standalone src/extract_data.py script duplicates NB01 for batch execution. All notebooks implement caching (skip if output file exists). However, estimated runtimes are not provided (NB01 takes ~210s per organism, ~40 min total based on outputs).
Code Quality
SQL correctness: Spark SQL queries are correct and efficient. BROADCAST hints are properly applied on small filter tables joining against billion-row gene_genecluster_junction and gene tables (NB01, src/extract_data.py). All Fitness Browser numeric columns are properly CAST (CAST(rank AS INT), CAST(cofit AS FLOAT), CAST(begin AS INT)), consistent with docs/pitfalls.md. The Dyella79 locus tag mismatch is handled by filtering it from the link table in all notebooks.
Statistical methods: Appropriate throughout:
- Mann-Whitney U (two-sided) for non-normal phi distributions
- Wilcoxon signed-rank test for cross-organism meta-analysis (correctly reported as non-significant at p=0.13 with N=9)
- Permutation testing (1000 permutations) for module co-inheritance with Benjamini-Hochberg FDR correction via statsmodels.stats.multitest.multipletests
- Spearman correlation for co-fitness strength vs phi
Notebook organization: Clean four-notebook pipeline (extraction → pairwise analysis → module analysis → synthesis/figures). Each notebook has markdown headers, runtime requirements stated upfront, inline outputs preserved, and summary cells at the end. All notebooks produce saved outputs (text, tables, and figures).
Pitfall awareness: Two pitfalls discovered by this project are documented in docs/pitfalls.md: (1) GTDB prefix mismatch in phylogenetic distance table IDs, and (2) BROADCAST hints for billion-row table joins. The project also correctly follows the flat data/ directory pattern and three-file documentation structure per prior pitfall guidance.
Issues identified:
-
Missing
statsmodelsinrequirements.txt: NB03 importsfrom statsmodels.stats.multitest import multipletestsbut the package is absent fromrequirements.txt. Reproduction would fail at NB03 on a fresh install. No version pins for any package. -
Putida module significance discrepancy: NB03 output prints "5 (13%)" significant modules for Putida, but the REPORT table says "6 (16%)". One of these needs correction.
-
Phylogenetic control subsamples to 2000 pairs: NB04 Fig 3 cell silently subsamples cofit pairs (
if len(cofit_unique) > 2000: cofit_unique = cofit_unique.sample(2000, random_state=42)), which is not documented in the REPORT. This affects most organisms and reduces statistical power for the phylogenetic analysis. -
compute_phifloat comparison: Usingvec.std() == 0for exact float comparison is safe for binary integer vectors but fragile in general. A tolerance-based check (< 1e-10) would be more robust.
Findings Assessment
Conclusions supported by data: Yes, all central claims are well-supported:
- Pairwise delta phi of +0.003 aggregate, 7/9 organisms positive, 8/9 significant. The non-significant Wilcoxon (p=0.13) is correctly disclosed.
- Module-level delta of +0.053 with 51/195 significant (26%) at p<0.05, 21/195 (11%) after FDR. Accessory modules show 73% significance (8/11, 36% after FDR) -- a biologically meaningful contrast.
- The prevalence ceiling interpretation is well-supported by the Ddia6719 case (strongest signal, most accessory variation among near-clonal species).
Negative results handled honestly: Korea's negative delta (-0.042) and Putida's null result (delta=-0.000, p=0.41) are reported without suppression or spin, with plausible explanations provided.
Limitations acknowledged: Five specific limitations documented in the REPORT covering prevalence ceiling, Ralstonia exclusion, limited phylogenetic strata, pairwise vs multi-gene scale, and near-clonal species behavior.
Literature context: Four references with PMIDs cited meaningfully in the REPORT interpretation, connecting to Coinfinder (Whelan et al. 2020), E. coli accessory gene co-occurrence (Hall et al. 2021), P. aeruginosa phylogroup co-occurrence networks (Choudhury et al. 2025), and the Fitness Browser source data (Price et al. 2018). The references.md includes 11 total references with full bibliographic details.
Visualizations: All 6 planned figures from RESEARCH_PLAN are produced and saved in figures/, plus 2 supplementary figures. Figures have proper axis labels, titles, legends, and panel letters. Color choices are reasonable (viridis colormap in Fig 4 is colorblind-accessible).
Counter-intuitive result explained: The weak anti-correlation between cofit strength and phi (Spearman rho=-0.109) is correctly interpreted as a prevalence ceiling effect, though a partial correlation controlling for prevalence would strengthen this interpretation.
Suggestions
-
Add
statsmodelstorequirements.txtand pin versions (Critical for reproducibility): NB03 importsmultipletestsfromstatsmodelsbut the package is absent fromrequirements.txt. Add it and consider pinning at least major versions for reproducibility (e.g.,pandas>=2.0,scipy>=1.10,statsmodels>=0.14). -
Resolve Putida module significance discrepancy (Minor): NB03 prints "5 (13%)" for Putida but the REPORT table says "6 (16%)". Check the source data and correct whichever is wrong.
-
Document the 2000-pair subsampling in phylogenetic control (Minor): NB04 silently subsamples cofit pairs to 2000 for the phylogenetic stratification analysis. Add a note in the REPORT's phylogenetic control limitation section (e.g., "phylogenetic stratification was computed on a random subsample of 2000 cofit pairs per organism for computational efficiency").
-
Add expected runtimes to reproduction instructions (Nice-to-have): The README's Reproduction section would benefit from approximate runtimes: "Step 1: ~45 min on BERDL JupyterHub (Spark); Steps 2-4: ~10 min each locally."
-
Investigate Korea's random pair distribution (Nice-to-have): Korea's random pair median phi (0.8597) is dramatically higher than its cofit pair median (0.2584), which is unusual. A diagnostic prevalence histogram for Korea would clarify whether the prevalence matching is selecting near-universal clusters and help explain the negative delta.
-
Partial correlation for cofit strength vs phi (Nice-to-have): Computing a partial Spearman correlation controlling for mean prevalence would directly test whether the anti-correlation (rho=-0.109) is indeed a prevalence ceiling artifact, strengthening the interpretation in the REPORT.
This review was generated by an AI system. It should be treated as advisory input, not a definitive assessment.
Visualizations
Fig1 Cofit Cooccurrence
Fig2 Operon Control
Fig3 Phylo Control
Fig4 Cofit Strength
Fig5 Module Coinheritance
Fig6 Functional
Module Coinheritance
Quick Cooccurrence