Co-fitness Predicts Co-inheritance in Bacterial Pangenomes

Paramvir S. Dehal

Research Question

Do genes with correlated fitness profiles (co-fit) tend to co-occur in the same genomes across a species' pangenome? Does functional coupling constrain which genes are gained and lost together?

Overview

This project tests whether lab-measured functional coupling (co-fitness from RB-TnSeq) predicts genome-level co-inheritance (co-occurrence in pangenomes) across 11 bacterial species. It builds on module_conservation (modules are 86% core) and fitness_modules (48 accessory modules exist). Phi coefficients (binary Pearson correlation) measure co-occurrence across genomes, compared against prevalence-matched random pairs. See RESEARCH_PLAN.md for the detailed hypothesis and analysis plan.

Key results: Pairwise co-fitness weakly but consistently predicts co-occurrence (7/9 organisms positive), though the effect size is small. Multi-gene ICA modules show a substantially stronger co-inheritance signal, especially the 48 accessory modules -- functionally coherent gene groups that travel together through the pangenome (73% significant). See REPORT.md for full findings and interpretation.

Key Findings

Pairwise Co-fitness Weakly Predicts Co-occurrence

Across 9 organisms with co-fitness data (2.25M cofit pairs vs 22.5M prevalence-matched random pairs), co-fit gene pairs show a weak but consistent positive co-occurrence signal. The mean delta phi (cofit - random) is +0.011 across organisms, with 7 of 9 organisms showing a positive effect and 8 of 9 significant at p<0.05 (Mann-Whitney two-sided). However, the overall effect size is small (aggregate delta = +0.003, Mann-Whitney p=1.66e-29) and the Wilcoxon signed-rank test across organisms is not significant (W=9, p=0.13), reflecting high inter-organism variance.

Organism	n cofit	Mean phi (cofit)	Mean phi (random)	Delta	p-value (two-sided)
Ddia6719	16,957	0.182	0.089	+0.093	<1e-300
pseudo3_N2E3	31,959	0.458	0.433	+0.026	6.6e-16
Phaeo	14,728	0.231	0.222	+0.009	6.5e-4
SyringaeB728a	129,385	0.049	0.043	+0.006	2.9e-4
Koxy	162,160	0.041	0.038	+0.003	3.3e-2
Smeli	230,516	0.029	0.026	+0.002	<1e-17
Btheta	242,676	0.067	0.067	+0.001	<1e-6
Putida	205,323	0.171	0.171	-0.000	0.41
Korea	7,994	0.447	0.489	-0.042	<1e-6

The signal is strongest in Ddia6719 (delta=+0.093), which despite being near-clonal (ANI 99.47%) has sufficient accessory gene variation to detect co-inheritance. Korea shows a negative delta (-0.042) because 95.2% of its cofit pairs have NaN phi (both genes at 100% prevalence across 72 genomes, producing zero-variance vectors). Only ~8,000 of 166,601 pairs are computable, and the negative delta reflects statistical noise in this small effective sample rather than a biological signal.

Co-fitness vs Co-occurrence

(Notebook: 02_cooccurrence.ipynb, 04_cross_organism.ipynb)

Operons Are Not a Confound

Only 0.7% of cofit pairs are genomically adjacent (within 5 genes). Excluding them does not change the result pattern.

Operon Control

(Notebook: 02_cooccurrence.ipynb)

ICA Modules Show Co-inheritance, Especially Accessory Modules

At the module level, 195 ICA modules across 6 organisms show within-module co-occurrence (mean phi=0.229) significantly exceeding prevalence-matched null expectations (mean phi=0.177, 1000 permutations), with delta=+0.053 and 51/195 modules (26%) significant at p<0.05. After Benjamini-Hochberg FDR correction, 21/195 modules (11%) remain significant at q<0.05.

Accessory modules show the strongest co-inheritance signal:

Module type	n	Mean delta phi	Significant (p<0.05)	Significant (FDR q<0.05)
Accessory (<50% core)	11	+0.108	8/11 (73%)	4/11 (36%)
Core (>90% core)	120	+0.059	29/120 (24%)	13/120 (11%)
Mixed (50-90%)	64	+0.031	14/64 (22%)	4/64 (6%)

The accessory vs core difference trends toward significance (Mann-Whitney p=0.051).

Module Co-inheritance

(Notebook: 03_module_coinheritance.ipynb)

Co-fitness Strength Weakly Anti-correlates with Co-occurrence

Stronger co-fitness scores weakly predict lower phi (Spearman rho=-0.109, p<1e-300 across 1.04M pairs). This likely reflects that the strongest co-fitness pairs are core genes with near-universal prevalence, leaving little variance for co-occurrence detection -- a prevalence ceiling effect.

Co-fitness Strength vs Phi

(Notebook: 02_cooccurrence.ipynb)

Phylogenetic Distance Stratification

Phylogenetic Control

Cofit pair phi is higher among near genomes (mean=0.102) than medium genomes (mean=0.067), as expected from shared ancestry. Most species lack genomes in the "far" stratum (>0.05 branch distance), limiting the ability to fully disentangle functional coupling from phylogenetic signal.

(Notebook: 02_cooccurrence.ipynb)

Results

Data Extraction

11 target species analyzed. Presence matrices extracted via Spark from gene_genecluster_junction joined with gene (two billion-row tables). BROADCAST hints on small filter tables reduced query time to ~210s per organism.

Organism	Genomes	Clusters	Cofit pairs	Phylo tree
Koxy	399	4,942	423,936	Yes
Btheta	287	4,649	328,455	Yes
Smeli	241	6,004	528,699	Yes
RalstoniaUW163	141	4,413	0	Yes
Putida	128	5,409	458,688	Yes
SyringaeB728a	126	4,999	371,004	Yes
Korea	72	4,075	230,724	Yes
RalstoniaGMI1000	70	4,723	0	Yes
Phaeo	43	3,790	192,138	No
Ddia6719	66	4,694	250,488	Yes
pseudo3_N2E3	40	5,513	507,828	No

Ralstonia UW163 and GMI1000 have zero co-fitness data in the Fitness Browser despite being in the organism table, excluding them from the primary analysis.

(Notebook: 01_data_extraction.ipynb)

Pairwise Analysis

For each organism, cofit pairs were mapped to pangenome cluster pairs via fb_pangenome_link.tsv, deduplicated, and phi coefficients computed from binary genome x cluster presence vectors. 10 prevalence-matched random pairs were generated per cofit pair, matching each cluster's prevalence independently (tolerance +/-5%). Adjacency was determined from FB gene coordinates (within 5 loci on the same scaffold).

Aggregate statistics:
- 9 organisms analyzed
- 2,253,491 cofit pairs, 22,534,910 random pairs
- Overall mean phi: cofit=0.092, random=0.089
- Overall delta: +0.003 (Mann-Whitney two-sided p=1.66e-29)
- Organisms with positive delta: 7/9
- Organisms with p<0.05: 8/9
- Wilcoxon signed-rank (delta > 0): W=9, p=0.13

Module Analysis

For 6 organisms with ICA module data (Koxy, Btheta, Putida, Korea, Phaeo, pseudo3_N2E3), mean pairwise phi was computed within each module and compared to 1000 prevalence-matched random gene sets. P-values were corrected for multiple testing using Benjamini-Hochberg FDR.

Per-organism module results:

Organism	Modules	Significant	Mean phi	Null mean
Koxy	44	14 (32%)	0.079	0.040
Btheta	36	22 (61%)	0.192	0.086
Putida	38	6 (16%)	0.266	0.202
Korea	29	0 (0%)	0.357	0.308
Phaeo	37	3 (8%)	0.108	0.095
pseudo3_N2E3	40	4 (10%)	0.444	0.409

Btheta shows the strongest module-level signal (61% significant), while Korea (despite being the best pairwise organism) shows 0% significant modules -- all Korea modules are >90% core with prevalence near 1.0.

Functional Enrichment

Among high-phi AND high-cofit pairs (top quartile of both), the most common functional categories (SEED annotations) are:

Category	Gene count
Metabolism	15,936
Transport	11,940
Regulation	6,339
Motility	2,929
Mobile elements	2,716
DNA metabolism	2,490

Functional Enrichment

(Notebook: 04_cross_organism.ipynb)

Interpretation

Pairwise co-fitness weakly but consistently predicts co-inheritance: 7 of 9 organisms show a positive delta phi, and the aggregate signal is highly significant (p=1.66e-29). However, the effect size is small (delta=+0.003 overall, +0.011 mean across organisms), indicating that lab-measured functional coupling explains only a tiny fraction of co-inheritance patterns.

Multi-gene modules (ICA-derived co-regulated groups) show a substantially stronger co-inheritance signal (delta=+0.053, 26% significant at p<0.05, 11% after FDR correction), particularly accessory modules (73% significant, 36% after FDR). This suggests that coordinated regulation across multiple genes -- not just pairwise functional similarity -- is what most strongly constrains co-inheritance. The 48 accessory modules identified in the module_conservation project represent functionally coherent units that travel together through the pangenome.

The pairwise signal is attenuated by a prevalence ceiling: most FB genes map to core clusters (>95% prevalence), where phi approaches 0 for both cofit and random pairs because there is almost no variance to correlate. Species with more gene content diversity (Ddia6719, pseudo3_N2E3) show the largest positive deltas.

Literature Context

Hall et al. (2021) found that E. coli accessory genes co-occur by function and via mobile genetic elements, consistent with our module-level result showing functional coherence predicts co-inheritance (PMID: 34499026).
Whelan et al. (2020) developed Coinfinder to detect gene associations in pangenomes while controlling for phylogeny. Our analysis extends this by asking whether fitness-measured functional coupling predicts the associations Coinfinder would detect (PMID: 32100706).
Choudhury et al. (2025) showed P. aeruginosa phylogroups have distinct co-occurrence networks, reinforcing that accessory genome structure is lineage-specific (PMID: 40304385).
Price et al. (2018) established the Fitness Browser co-fitness data used here. Co-fitness identifies genes in the same pathway or complex; our results show this lab-measured coupling weakly but significantly predicts co-inheritance in natural populations, though the effect size is small (PMID: 29769716).

Limitations

Prevalence ceiling: Most FB genes map to core clusters (>95% prevalence). At high prevalence, phi approaches 0 for both cofit and random pairs because there is almost no variance to correlate. The analysis is most informative for species with substantial auxiliary gene content.
Ralstonia excluded: The two Ralstonia organisms (most phylogenetically diverse, lowest ANI) have zero co-fitness data in the Fitness Browser, excluding the species that would likely be most informative.
Phylogenetic control limited to two strata: Phylogenetic distance stratification was computed for 7 of 9 organisms, but most species lack genomes in the "far" stratum (>0.05 branch distance). Cofit pair phi is higher among near genomes (mean=0.102) than medium genomes (mean=0.067), as expected from shared ancestry, but the lack of a far stratum limits the ability to fully disentangle functional coupling from phylogenetic signal.
Pairwise vs multi-gene: Co-fitness captures pairwise gene relationships. ICA modules capture multi-gene coordinated regulation, which may better reflect the selective units that constrain co-inheritance.
Near-clonal species behave differently: Ddia6719 (ANI 99.47%) shows the strongest pairwise signal (delta=+0.093), while pseudo3_N2E3 (ANI 99.66%) also shows positive delta (+0.026). Despite low overall genome diversity, these species have sufficient accessory gene variation to detect co-inheritance. However, the high baseline phi in these species means the absolute phi values are less interpretable.

Future Directions

Restrict to auxiliary-only pairs: Re-analyze using only gene pairs where both clusters are auxiliary (<95% prevalence) to maximize presence/absence variance.
Compute co-fitness directly: For Ralstonia and other organisms lacking pre-computed co-fitness, calculate pairwise Pearson correlations from raw genefitness data.
Fix phylogenetic control: Resolve the reference genome mapping to enable phylogenetic distance stratification.
Module co-transfer networks: Build networks of which modules co-occur across species and test whether co-fitness predicts cross-module co-inheritance.
Expand to more diverse species: Target additional organisms with >30% auxiliary genes and existing co-fitness data.

Data

Sources

Dataset	Description	Source
Fitness Browser co-fitness	Top-20 co-fitness partners per gene	`cofit` table via Spark
KBase pangenome	Genome x cluster presence matrices	`gene_genecluster_junction` + `gene` tables via Spark
Phylogenetic distances	Pairwise genome distances	`phylogenetic_tree_distance_pairs` table via Spark
FB-pangenome link table	Gene-to-cluster mapping (177K links)	`conservation_vs_fitness/data/fb_pangenome_link.tsv`
ICA fitness modules	Co-regulated gene modules	`fitness_modules/data/modules/`
Module conservation	Module conservation stats	`module_conservation/data/module_conservation.tsv`
SEED annotations	Functional category assignments	`conservation_vs_fitness/data/seed_annotations.tsv`

Generated Data

File	Description
`data/genome_cluster_matrices/{org}_presence.tsv`	Binary genome x cluster presence matrices (11 species)
`data/cofit/{org}_cofit.tsv`	Co-fitness pairs per organism from Fitness Browser
`data/gene_coords/{org}_coords.tsv`	Gene coordinates for adjacency detection
`data/phylo_distances/{org}_phylo_distances.tsv`	Pairwise phylogenetic distances (9 species)
`data/{org}_phi_results.tsv`	Per-organism phi results for cofit and random pairs
`data/all_phi_results.tsv`	Combined phi results across all organisms
`data/organism_summary.tsv`	Per-organism effect sizes and p-values
`data/module_coinheritance.tsv`	Module-level co-inheritance scores
`data/phylo_stratified.tsv`	Cofit pair phi stratified by phylogenetic distance

References

Price MN et al. (2018). "Mutant phenotypes for thousands of bacterial genes of unknown function." Nature 557:503-509. PMID: 29769716
Parks DH et al. (2022). "GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy." Nucleic Acids Res 50:D199-D207. PMID: 34520557
Hall JPJ et al. (2021). "Gene-gene relationships in an Escherichia coli accessory genome are linked to function and mobility." Microb Genom 7:000650. PMID: 34499026
Whelan FJ et al. (2020). "Coinfinder: detecting significant associations and dissociations in pangenomes." Microb Genom 6:e000338. PMID: 32100706
Choudhury M et al. (2025). "Phylogroup-specific co-occurrence networks in Pseudomonas aeruginosa." PMID: 40304385

Data Collections

🧬

Pangenome Collection

kbase_ke_pangenome

KBase, DOE

📊

Fitness Browser

kescience_fitnessbrowser

Price Lab, LBNL

Derived Data

This project builds on processed data from other projects.

Conservation Vs Fitness

View source project →

Fitness Modules

modules

View source project →

Module Conservation

View source project →

Review

Summary

This is a well-executed and thoroughly documented project that tests whether lab-measured functional coupling (co-fitness from RB-TnSeq) predicts genome-level co-inheritance (co-occurrence in pangenomes) across 11 bacterial species. The analysis covers 2.25M co-fit pairs vs 22.5M prevalence-matched random controls, with a secondary module-level analysis across 6 organisms with ICA data. The main pairwise result is a weak but statistically significant positive signal (delta phi = +0.003, p = 1.66e-29, 7/9 organisms positive), while the module-level analysis reveals a substantially stronger effect (delta = +0.053, 73% of accessory modules significant at p<0.05). The project excels in its three well-designed controls (operon adjacency, prevalence matching, phylogenetic stratification), comprehensive three-file documentation (README, RESEARCH_PLAN, REPORT), clear Spark/local separation with caching for reproducibility, complete notebook outputs, and 8 saved figures covering all 6 planned analyses. The honest reporting of negative and null results (Korea, Putida) alongside positive ones strengthens the credibility of the findings. The main areas for improvement are minor: requirements.txt is missing statsmodels and lacks version pins, a small discrepancy exists in the Putida module significance count between notebook and REPORT, and the phylogenetic control's 2000-pair subsampling is undocumented. Overall, this is among the strongest projects in the observatory for methodological rigor, statistical care, and intellectual honesty.

Methodology

Research question: Clearly stated and testable. The RESEARCH_PLAN.md articulates specific predictions (signal strongest at intermediate prevalences, collapsing at >99%) and pre-registers four verification checks (e.g., "At prevalence >99%, phi should be ~0 for both groups"), which is excellent scientific practice.

Approach: The decision to use continuous prevalence rather than binary core/auxiliary classification (documented in RESEARCH_PLAN "Key Design Decisions") is methodologically sound and maximizes statistical power. The phi coefficient is the correct metric for binary co-occurrence. The prevalence-matched random pair generation independently matches each cluster's prevalence (NB02 generate_random_pairs), avoiding the bias of mean-prevalence matching. The two-tier analysis (pairwise co-fitness primary, ICA modules secondary) provides complementary evidence at different scales.

Controls: Three well-implemented controls:
1. Operon control: Only 0.7% of cofit pairs are adjacent (within 5 genes); excluding them preserves the signal (NB04 Fig 2).
2. Prevalence matching: 10 random pairs per cofit pair with +/-5% tolerance on individual cluster prevalences.
3. Phylogenetic stratification: Correctly handles the GTDB prefix mismatch (documented as a discovered pitfall). Limited by lack of "far" stratum genomes in most species, which is acknowledged.

Data sources: All sources clearly identified in README with a table mapping database, table, and use. Dependencies on three prior projects are explicitly documented in the Dependencies section.

Reproducibility: The README includes a two-step reproduction guide cleanly separating Spark-dependent extraction from local analysis. A standalone src/extract_data.py script duplicates NB01 for batch execution. All notebooks implement caching (skip if output file exists). However, estimated runtimes are not provided (NB01 takes ~210s per organism, ~40 min total based on outputs).

Code Quality

SQL correctness: Spark SQL queries are correct and efficient. BROADCAST hints are properly applied on small filter tables joining against billion-row gene_genecluster_junction and gene tables (NB01, src/extract_data.py). All Fitness Browser numeric columns are properly CAST (CAST(rank AS INT), CAST(cofit AS FLOAT), CAST(begin AS INT)), consistent with docs/pitfalls.md. The Dyella79 locus tag mismatch is handled by filtering it from the link table in all notebooks.

Statistical methods: Appropriate throughout:
- Mann-Whitney U (two-sided) for non-normal phi distributions
- Wilcoxon signed-rank test for cross-organism meta-analysis (correctly reported as non-significant at p=0.13 with N=9)
- Permutation testing (1000 permutations) for module co-inheritance with Benjamini-Hochberg FDR correction via statsmodels.stats.multitest.multipletests
- Spearman correlation for co-fitness strength vs phi

Notebook organization: Clean four-notebook pipeline (extraction → pairwise analysis → module analysis → synthesis/figures). Each notebook has markdown headers, runtime requirements stated upfront, inline outputs preserved, and summary cells at the end. All notebooks produce saved outputs (text, tables, and figures).

Pitfall awareness: Two pitfalls discovered by this project are documented in docs/pitfalls.md: (1) GTDB prefix mismatch in phylogenetic distance table IDs, and (2) BROADCAST hints for billion-row table joins. The project also correctly follows the flat data/ directory pattern and three-file documentation structure per prior pitfall guidance.

Issues identified:

Missing statsmodels in requirements.txt: NB03 imports from statsmodels.stats.multitest import multipletests but the package is absent from requirements.txt. Reproduction would fail at NB03 on a fresh install. No version pins for any package.
Putida module significance discrepancy: NB03 output prints "5 (13%)" significant modules for Putida, but the REPORT table says "6 (16%)". One of these needs correction.
Phylogenetic control subsamples to 2000 pairs: NB04 Fig 3 cell silently subsamples cofit pairs (if len(cofit_unique) > 2000: cofit_unique = cofit_unique.sample(2000, random_state=42)), which is not documented in the REPORT. This affects most organisms and reduces statistical power for the phylogenetic analysis.
compute_phi float comparison: Using vec.std() == 0 for exact float comparison is safe for binary integer vectors but fragile in general. A tolerance-based check (< 1e-10) would be more robust.

Findings Assessment

Conclusions supported by data: Yes, all central claims are well-supported:
- Pairwise delta phi of +0.003 aggregate, 7/9 organisms positive, 8/9 significant. The non-significant Wilcoxon (p=0.13) is correctly disclosed.
- Module-level delta of +0.053 with 51/195 significant (26%) at p<0.05, 21/195 (11%) after FDR. Accessory modules show 73% significance (8/11, 36% after FDR) -- a biologically meaningful contrast.
- The prevalence ceiling interpretation is well-supported by the Ddia6719 case (strongest signal, most accessory variation among near-clonal species).

Negative results handled honestly: Korea's negative delta (-0.042) and Putida's null result (delta=-0.000, p=0.41) are reported without suppression or spin, with plausible explanations provided.

Limitations acknowledged: Five specific limitations documented in the REPORT covering prevalence ceiling, Ralstonia exclusion, limited phylogenetic strata, pairwise vs multi-gene scale, and near-clonal species behavior.

Literature context: Four references with PMIDs cited meaningfully in the REPORT interpretation, connecting to Coinfinder (Whelan et al. 2020), E. coli accessory gene co-occurrence (Hall et al. 2021), P. aeruginosa phylogroup co-occurrence networks (Choudhury et al. 2025), and the Fitness Browser source data (Price et al. 2018). The references.md includes 11 total references with full bibliographic details.

Visualizations: All 6 planned figures from RESEARCH_PLAN are produced and saved in figures/, plus 2 supplementary figures. Figures have proper axis labels, titles, legends, and panel letters. Color choices are reasonable (viridis colormap in Fig 4 is colorblind-accessible).

Counter-intuitive result explained: The weak anti-correlation between cofit strength and phi (Spearman rho=-0.109) is correctly interpreted as a prevalence ceiling effect, though a partial correlation controlling for prevalence would strengthen this interpretation.

Suggestions

Add statsmodels to requirements.txt and pin versions (Critical for reproducibility): NB03 imports multipletests from statsmodels but the package is absent from requirements.txt. Add it and consider pinning at least major versions for reproducibility (e.g., pandas>=2.0, scipy>=1.10, statsmodels>=0.14).
Resolve Putida module significance discrepancy (Minor): NB03 prints "5 (13%)" for Putida but the REPORT table says "6 (16%)". Check the source data and correct whichever is wrong.
Document the 2000-pair subsampling in phylogenetic control (Minor): NB04 silently subsamples cofit pairs to 2000 for the phylogenetic stratification analysis. Add a note in the REPORT's phylogenetic control limitation section (e.g., "phylogenetic stratification was computed on a random subsample of 2000 cofit pairs per organism for computational efficiency").
Add expected runtimes to reproduction instructions (Nice-to-have): The README's Reproduction section would benefit from approximate runtimes: "Step 1: ~45 min on BERDL JupyterHub (Spark); Steps 2-4: ~10 min each locally."
Investigate Korea's random pair distribution (Nice-to-have): Korea's random pair median phi (0.8597) is dramatically higher than its cofit pair median (0.2584), which is unusual. A diagnostic prevalence histogram for Korea would clarify whether the prevalence matching is selecting near-universal clusters and help explain the negative delta.
Partial correlation for cofit strength vs phi (Nice-to-have): Computing a partial Spearman correlation controlling for mean prevalence would directly test whether the anti-correlation (rho=-0.109) is indeed a prevalence ceiling artifact, strengthening the interpretation in the REPORT.