Field vs Lab Gene Importance in *Desulfovibrio vulgaris* Hildenborough

Paramvir S. Dehal

Research Question

Which genes matter for survival under environmentally-realistic conditions but appear dispensable in the lab, and vice versa? Do field-relevant fitness effects predict pangenome conservation better than lab-only effects?

Overview

DvH has 757 RB-TnSeq experiments spanning environmentally-relevant stresses (uranium, mercury, nitrate, sulfate reduction) and standard lab conditions (antibiotics, rich media). This project classifies these experiments by ecological relevance and tests whether genes important under field-relevant conditions show different pangenome conservation patterns than genes important under lab-only conditions. It builds on conservation_vs_fitness (FB-pangenome link, essential genes) and fitness_modules (ICA modules, condition activity scores).

Key results: field-stress and field-core genes are significantly enriched in the core genome (OR=1.58 and 1.46, FDR q<0.03), but fitness magnitude matters more than condition type for overall prediction (10-fold CV-AUC 0.52-0.55 for field/lab fitness vs 0.65 with gene length). Antibiotic and heavy-metal resistance genes trend below baseline (73% and 71% core vs 76%). Module-level analysis identifies 21 "ecological" modules (0.98 core fraction) and 9 "lab" modules (0.52 core fraction). ENIGMA CORAL database contains no DvH data. See REPORT.md for full findings and interpretation.

Key Findings

ENIGMA CORAL Contains No DvH Fitness Data (NB01)

The ENIGMA CORAL database (47 tables, enigma_coral on BERDL) was surveyed for complementary data. Key finding: DvH is completely absent from the database. The single TnSeq library is for FW300-N2E2 (Pseudomonas), DubSeq libraries cover E. coli, P. putida, and B. thetaiotaomicron. All 6,705 genomes and 15,015 genes are environmental isolates from Oak Ridge, not DvH. The database does contain 4,346 field samples with geochemistry data (uranium, metals) across 596 Oak Ridge locations, and 213,044 ASVs for community composition -- potentially useful for future analyses, but not for gene-level fitness analysis.

(Notebook: 01_enigma_discovery.ipynb)

Condition Classification (NB02)

757 DvH experiments classified into 6 categories:

Category	Experiments	%	Rationale
Lab-nutrient	237	31.3%	Amino acids, organic acids, carbon sources
Field-core	204	26.9%	DvH primary metabolism (sulfate, lactate, formate, pyruvate, H2)
Lab-other	140	18.5%	Lab reagents (DMSO, PEG, osmotic stress, iron chelators)
Field-stress	78	10.3%	FRC contaminants (uranium, mercury, nitrate, nitrite, oxygen, NO)
Heavy-metals	55	7.3%	Metal stress (cobalt, nickel, zinc, copper, manganese, selenium)
Lab-antibiotic	43	5.7%	Antibiotics and respiratory inhibitors

Broad split: 337 field (44.5%) vs 420 lab (55.5%).

Heatmap of gene importance (fitness < -2) by condition class and conservation status

(Notebook: 02_condition_classification.ipynb)

Genes Important for Field Conditions Are Significantly More Conserved (NB03)

Of 2,725 non-essential genes with both fitness data and pangenome links, 76.3% are core overall. An additional 678 essential genes (80.1% core) are excluded because they lack fitness data (no transposon mutants recovered). Genes important (fitness < -2) under different condition classes show varying conservation:

Condition class	Important genes	Core %	OR vs baseline	FDR q
Field-stress	298	83.6%	1.58	0.026
Field-core	376	82.4%	1.46	0.026
Lab-other	292	81.5%	1.37	0.073
Lab-nutrient	452	81.4%	1.36	0.037
Lab-antibiotic	109	73.4%	0.86	0.49
Heavy-metals	198	71.2%	0.77	0.14
Essential (no fitness)	678	80.1%	--	--
All genes (baseline)	2,725	76.3%	--	--

Field-stress (q=0.026), field-core (q=0.026), and lab-nutrient (q=0.037) genes are significantly enriched in the core genome after BH-FDR correction. Heavy-metals and lab-antibiotic genes trend below baseline but are not significant.

Core genome percentage for genes important under each condition class, with baseline

(Notebook: 03_fitness_conservation.ipynb)

Specificity Analysis: Lab-Specific Genes Are Surprisingly More Core

Specificity	Genes	Core %
Lab-specific (sick in lab only)	50	96.0%
Field-specific (sick in field only)	52	88.5%
Field-biased	89	83.1%
Universal (sick in both)	352	79.8%
Neutral (no strong effects)	2,083	74.5%

Counter to H1, genes with fitness defects only under lab conditions are 96% core (n=50), slightly higher than field-specific genes at 88.5% (n=52). Both are well above the 74.5% baseline, suggesting that any fitness importance -- regardless of ecological context -- predicts conservation. The field-specific vs lab-specific comparison is not statistically significant (Fisher exact OR=0.32, p=0.27). The universal vs neutral comparison is significant (OR=1.35, p=0.033).

Scatter plot of mean field fitness vs mean lab fitness per gene, colored by core/auxiliary status

(Notebook: 03_fitness_conservation.ipynb)

Fitness Effects Are Weak Predictors of Core Status

Logistic regression with 10-fold cross-validated AUC:

Model	CV-AUC	Std
Field fitness only	0.517	0.052
Lab fitness only	0.531	0.052
Field + Lab	0.548	0.053
Full (+ gene length)	0.645	0.068

Gene length is a much stronger predictor of core status than fitness effects from either field or lab conditions. Neither fitness dimension alone is informative (CV-AUC near 0.5). Cross-validation confirms the in-sample estimates are not inflated.

ROC curves comparing field-only, lab-only, combined, and full models for predicting core genome status

(Notebook: 03_fitness_conservation.ipynb)

Threshold Sensitivity Analysis

The pattern is robust across fitness thresholds (-1 to -3):

Condition class	-3.0	-2.0	-1.5	-1.0
Field-stress	89.4%	83.6%	84.0%	82.1%
Field-core	84.4%	82.4%	81.9%	79.0%
Lab-nutrient	83.5%	81.4%	81.0%	79.5%
Lab-other	85.2%	81.5%	80.1%	77.5%
Lab-antibiotic	82.2%	73.4%	77.3%	76.2%
Heavy-metals	70.6%	71.2%	76.7%	76.7%

Field-stress consistently has the highest conservation across all thresholds. Heavy-metals is consistently the lowest. The lab-antibiotic dip at -2 is driven by a small sample (n=109 at -2 vs n=45 at -3).

(Notebook: 03_fitness_conservation.ipynb)

Module-Level Conservation Shows No Field-Lab Difference (NB04)

Of 52 ICA fitness modules, the mean core fraction is 0.886 and the median is 1.000. No significant correlation exists between field condition activity and module conservation (Spearman rho=0.071, p=0.62). Using the mean core fraction (0.886) as the classification threshold, modules partition into:

Module type	Count	Mean core fraction
Ecological (field-active + conserved)	21	0.980
Conserved-quiet (low field activity + conserved)	17	0.983
Field-variable (field-active + less conserved)	5	0.829
Lab (low field + less conserved)	9	0.516

The 9 "lab" modules (mean core fraction 0.516) are notably less conserved, containing genes in the accessory genome. The 21 ecological modules contain 239 member genes, of which 52 are unannotated -- candidates for novel environmental adaptation functions.

$Module core fraction vs field and lab condition activity (two-panel scatter)$

Module conservation and activity by module type classification (box + scatter)

(Notebook: 04_module_analysis.ipynb)

Interpretation

Hypothesis Outcomes

H1 partially supported: Genes important for field-stress conditions are significantly more conserved than baseline (83.6% core, OR=1.58, FDR q=0.026), as are field-core genes (82.4%, OR=1.46, q=0.026). However, lab-nutrient genes are also significantly enriched (81.4%, q=0.037), and the field-specific vs lab-specific gene comparison is not significant (Fisher exact OR=0.32, p=0.27, n=50-52 per group). The strongest signal is that any fitness importance predicts conservation: universally important genes are significantly more core than neutral genes (OR=1.35, p=0.033). Threshold sensitivity analysis confirms field-stress genes are consistently the most conserved across thresholds from -1 to -3.
H2 not supported: Module-level conservation does not correlate with field condition activity (Spearman rho=0.071, p=0.62). However, the revised module classification (using mean rather than median core fraction) reveals 9 "lab" modules with strikingly low conservation (mean core fraction 0.516), suggesting that accessory-genome modules do exist and tend to respond to lab-type conditions.
H3 partially supported: 21 "ecological" modules (field-active + conserved, mean core fraction 0.980) contain 239 genes, of which 52 are unannotated candidates for environmental adaptation. These are distinct from the 9 "lab" modules, supporting the concept of a functional core relevant to environmental survival.

Key Biological Insight

The most striking finding is that lab-antibiotic and heavy-metal resistance genes are the least conserved (73.4% and 71.2%), well below the 76.3% baseline. This suggests that resistance functions are disproportionately in the accessory genome, consistent with these being recently acquired adaptive traits carried on mobile genetic elements. Conversely, genes for core metabolism (sulfate reduction, lactate/formate/pyruvate utilization) and FRC-relevant stress responses (uranium, mercury, nitrate) are deeply conserved, reflecting their importance for DvH's ecological niche.

Note on heavy-metals classification: we classified heavy-metals (cobalt, nickel, zinc, copper, manganese, selenium, molybdate, tungstate, aluminum) as "field" in the broad category because DvH encounters these metals at Oak Ridge FRC sites. However, the low conservation of heavy-metal-important genes (71.2%) contrasts with the high conservation of uranium/mercury-important genes (in field-stress at 83.6%). This may reflect that specific metal resistance mechanisms (efflux pumps, metal-binding proteins) are accessory traits, while the uranium and mercury response involves more fundamental stress pathways (DNA repair, sulfate reduction itself) that are part of the core genome.

Literature Context

Trotter et al. (2023) generated the comprehensive barcoded transposon library for DvH used in this analysis, identifying essential genes and condition-specific phenotypes across hundreds of conditions. Our work adds a pangenome conservation dimension to their fitness measurements.
Rosconi et al. (2022) showed that gene essentiality in S. pneumoniae is strain-dependent and influenced by accessory genome composition. This parallels our finding that fitness importance is a better predictor of conservation than the specific ecological context of the condition -- the magnitude of fitness effect matters more than which condition produces it.
Lee et al. (2015) identified 352 general and 199 condition-specific essential genes in P. aeruginosa across six growth media, establishing the concept of condition-dependent essentiality. Our analysis extends this framework by testing whether the ecological relevance of conditions predicts evolutionary conservation.
Akusobi et al. (2025) found 259 core essential genes and 425 differentially required genes across 21 M. abscessus clinical isolates, demonstrating that genetic diversity drives significant functional differences. This supports our finding that fitness effects vary by condition class, though conservation patterns do not cleanly separate field from lab conditions.
Shi et al. (2021) showed that pre-existing genetic variation in DvH influences chromate tolerance, with the chromate transporter DVU0426 identified as crucial for Cr(VI) resistance -- a specific example of accessory-genome-encoded environmental adaptation.
Huang et al. (2022) found that genomic islands and prophages are "major drivers for evolution and environmental adaptation" in S. algae, consistent with our finding that metal and antibiotic resistance genes are enriched in the accessory genome.
Price et al. (2018) generated the Fitness Browser data underpinning this analysis. Our work tests a prediction implicit in their dataset: whether the ecological relevance of experimental conditions modulates the fitness-conservation relationship.

Novel Contribution

This is the first analysis to stratify RB-TnSeq fitness effects by ecological relevance and test whether field-relevant conditions predict pangenome conservation differently than lab conditions. The finding that condition type matters less than fitness magnitude for conservation prediction was not anticipated and suggests that the core genome reflects general functional importance rather than niche-specific selection. The accessory-genome enrichment of antibiotic and metal resistance genes, however, identifies a clear functional partition that aligns with ecological expectations.

Limitations

678 essential genes (80.1% core) are excluded from the fitness analysis because they lack transposon mutants. Including them would raise the overall baseline slightly but would not change condition-class comparisons (which are among non-essential genes only)
Single organism (DvH) limits generalizability -- results may differ in organisms with larger accessory genomes
The Nitratidesulfovibrio vulgaris pangenome has relatively few genomes, creating a coarse core/auxiliary classification with a high baseline core fraction (76.3%), which compresses effect sizes
Condition classification relies on manual mapping of condition_1 labels; edge cases (e.g., zinc sulfate as metal vs sulfate source) required subjective judgment
The fitness < -2 threshold is the primary cutoff, but sensitivity analysis across -1 to -3 confirms pattern robustness
Gene length is confounded with both fitness measurement quality (short genes get fewer transposon insertions) and core status (core genes tend to be longer)
The field-specific and lab-specific gene sets are small (n=50-52), limiting statistical power for the key comparison
Field and lab fitness effects are correlated (r ~ 0.7 from the scatter plot), meaning most genes that are sick in one context are sick in the other

Future Directions

Multi-organism extension: Apply the field vs lab classification to other ENIGMA organisms (e.g., Pseudomonas FW300 strains) that have both environmental relevance and Fitness Browser data, to test whether the pattern generalizes
Finer-grained conservation metric: Replace binary core/auxiliary with quantitative conservation (fraction of genomes carrying the gene cluster) to increase statistical power
ENIGMA community data integration: Use the 213K ASVs and 4,346 field samples in ENIGMA CORAL to ask whether Desulfovibrio abundance at Oak Ridge sites correlates with the geochemistry conditions tested in the Fitness Browser
Accessory resistance gene characterization: Deeper analysis of the antibiotic and heavy-metal resistance genes in the accessory genome -- are they on mobile elements? Recently acquired? Shared with co-occurring species at Oak Ridge?
Quantitative fitness scores: Replace binary important/not-important with continuous fitness scores (e.g., mean or minimum fitness per condition class) as predictors in logistic regression

Data

Sources

Collection	Description
`kescience_fitnessbrowser`	RB-TnSeq fitness data for DvH (757 experiments, 2,741 genes)
`kbase_ke_pangenome`	Pangenome gene clusters with core/auxiliary/singleton classification

Generated Data

File	Description
`experiment_classification.csv`	757 experiments with category and broad_category assignments
`gene_fitness_conservation.csv`	2,725 genes with per-category fitness stats, conservation, and specificity class
`module_characterization.csv`	52 ICA modules with conservation, field/lab activity, type, and top annotation

References

Price MN, Wetmore KM, Waters RJ, Calef M, Tber J, Engel A, Chol J, Arkin AP, Deutschbauer AM. (2018). "Mutant phenotypes for thousands of bacterial genes of unknown function." Nature 557:503-509. DOI: 10.1038/s41586-018-0124-0. PMID: 29769716
Trotter VV, Mirasol R, Yung M, Chandran A, Kuehl JV, Katsnelson B, De Leon KB, Wall JD, Deutschbauer AM. (2023). "Large-scale genetic characterization of the model sulfate-reducing bacterium, Desulfovibrio vulgaris Hildenborough." Front Microbiol 14:1095132. PMID: 37065130
Rosconi F, Rudmann E, Li J, Surujon D, Anthony J, Frank M, Jones DS, Rock C, Bhatt AS, Roditi I, van Opijnen T. (2022). "A bacterial pan-genome makes gene essentiality strain-dependent and evolvable." Nat Microbiol 7:1580-1592. DOI: 10.1038/s41564-022-01208-7. PMID: 36097170
Lee SA, Gallagher LA, Thongdee M, Stodghill BJ, Hachey SJ, Sanowar S, Cooley RB, Leung KP. (2015). "General and condition-specific essential functions of Pseudomonas aeruginosa." Proc Natl Acad Sci USA 112:5189-5194. PMID: 25848053
Akusobi C, Benoit BM, Cavallo K, Williams EP, Bhatt A, Rubin EJ. (2025). "Transposon-sequencing across multiple Mycobacterium abscessus isolates reveals significant functional genomic diversity among strains." mBio 16:e02488-24. PMID: 39745363
Shi W, Zhan T, Guo Y, Zhou A, He Z, Deng Y. (2021). "Genetic basis of chromate adaptation and the role of pre-existing genetic divergence during an experimental evolution study with Desulfovibrio vulgaris populations." mSystems 6:e00351-21. PMID: 34061571
Huang Z, Yu K, Fu S, Xiao Y, Wei Q, Wang D. (2022). "Genomic analysis reveals high intra-species diversity of Shewanella algae." Microb Genom 8:000786. PMID: 35143386
Borchert AJ, Ernst DC, Downs DM. (2019). "Modular fitness landscapes reveal parallels between independent biological systems." Nat Ecol Evol 3:1233-1242. DOI: 10.1038/s41559-019-0938-7

Data Collections

🧬

Pangenome Collection

kbase_ke_pangenome

KBase, DOE

📊

Fitness Browser

kescience_fitnessbrowser

Price Lab, LBNL

🪰

ENIGMA CORAL

enigma_coral

ENIGMA SFA, LBNL

Derived Data

This project builds on processed data from other projects.

Conservation Vs Fitness

essential_genes.tsv, fb_pangenome_link.tsv, seed_annotations.tsv

View source project →

Fitness Modules

annotations/DvH_experiments.csv, matrices/DvH_fitness_matrix.csv, modules/DvH_gene_membership.csv +2 more

View source project →

Review

Summary

This is a well-designed and honestly reported project that asks whether genes important under environmentally-realistic conditions (uranium, mercury, sulfate reduction) show different pangenome conservation patterns than genes important under lab-only conditions (antibiotics, rich media). The analysis pipeline is clean and logically structured across four notebooks, the condition classification scheme is thoughtfully grounded in DvH ecology at the Oak Ridge FRC, and the statistical analyses are mostly appropriate. The project arrives at a nuanced answer: field-stress genes are modestly more conserved (83.6% core, OR=1.58, FDR q=0.026), but fitness magnitude matters more than condition type for predicting conservation (CV-AUC 0.52-0.55 for condition-type models vs 0.65 with gene length). Documentation is exemplary -- the README, RESEARCH_PLAN, and REPORT form a coherent three-file structure with clear data provenance, thorough limitations, and honest reporting of largely null results. The main areas for improvement are: (1) a missing data-saving step for gene_fitness_conservation.csv, (2) statsmodels is an undeclared dependency, and (3) the module-level correlation analysis in NB04 is inherently limited by the ceiling effect (median core fraction = 1.0 across 52 modules).

Methodology

Research question: Clearly stated and testable. Three hypotheses (H1-H3) are explicitly articulated alongside a null hypothesis (H0), which is a strong practice. The two-level approach (gene-level in NB03 and module-level in NB04) provides complementary perspectives.

Approach: Sound overall. The six-category condition classification (NB02) is well-designed with 52 pattern-matching rules and sensible fallback logic using expGroup. The classification is validated with a cross-tabulation (NB02 cell 8) showing that field-core aligns with respiratory growth and stress, while lab categories align with nutrient and nitrogen source. The specificity analysis (field-specific vs lab-specific vs universal genes) is a valuable addition that reveals the counter-intuitive finding that lab-specific genes are 96% core.

Data sources: All upstream data is clearly identified and properly attributed to prior projects (conservation_vs_fitness, fitness_modules). The ENIGMA CORAL discovery (NB01) is a responsible first step that documents the absence of DvH data in that database, avoiding incorrect assumptions.

Reproducibility concerns:
- The condition classification rules are embedded in code (NB02 cell 3) rather than an external configuration file. This is acceptable for a one-off analysis but makes it harder to version or modify the classification independently.
- Edge cases in classification (e.g., "zinc sulfate" as metal vs sulfate) are handled by rule ordering but documented only implicitly through the code structure.

Essential genes handling: NB03 correctly acknowledges that 678 essential genes are excluded from fitness analyses (cell 8 prints a detailed note explaining why and reports their 80.1% core fraction). They are also included as a separate bar in Figure 1 and as a row in the condition importance table (cell 10). This is a significant strength -- the analysis is transparent about what it cannot measure.

Code Quality

Notebook organization: All four notebooks follow a clean setup → data loading → analysis → visualization → summary flow. Markdown headers and summary cells at the end of each notebook consolidate key findings. NB03 is particularly well-structured with 8 clearly delineated analysis sections.

Statistical methods:
- Fisher exact tests with BH-FDR correction (NB03 cell 10) are appropriate for the 6 condition-class comparisons. The multiple testing correction is a strength.
- Logistic regression with 10-fold cross-validated AUC (NB03 cell 18) is properly implemented using sklearn.model_selection.cross_val_score. Both in-sample and CV-AUC are reported, and the CV results confirm the weak signal (0.517-0.548) is not an artifact of overfitting.
- Spearman correlation (NB04 cell 10) is appropriate for the module-level analysis given the non-normal distribution of core fractions.
- The threshold sensitivity analysis (NB03 cell following cell 23) strengthens the results by showing the pattern holds across thresholds from -1 to -3.

Condition classification (NB02): The rule-based system is carefully ordered (e.g., "persulfate" and "zinc sulfate" before "sulfate") with comments explaining why. The "heavy-metals" → "field" broad classification is debatable for some metals (aluminum, rubidium) but is discussed in the REPORT interpretation section.

Pitfall awareness:
- fillna(False).astype(bool) pattern correctly used in NB03 cell 4 and NB04 cell 5, per docs/pitfalls.md.
- String-to-boolean conversion for TSV columns is explicitly handled in NB03 cell 4.
- Essential gene exclusion from genefitness analyses is acknowledged, per the pitfall "Essential Genes Are Invisible in genefitness-Only Analyses."
- The .apply() on 757 rows in NB02 cell 5 is fine — the performance pitfall about row-wise apply only matters for large DataFrames.

Module classification (NB04): The classification in cell 13 uses the mean core fraction (0.886) rather than the median (1.000) as the threshold, which avoids the ceiling effect that would make the "ecological" category empty. This produces a reasonable distribution: 21 ecological, 17 conserved-quiet, 5 field-variable, and 9 lab modules. However, the underlying issue remains that 52 modules with a median core fraction of 1.0 leaves limited dynamic range for correlation analysis (Spearman rho=0.071, p=0.62).

Minor issues:
- NB03 cell 8 reports Auxiliary: 645 while the gene_info table has 583 auxiliary genes. The discrepancy (62 genes) is because genes without pangenome links (from the outer merge in cell 4) are counted as non-core in the gene_data table. The report summary correctly uses the 645 figure, but a clarifying comment would help.

Findings Assessment

Conclusions supported by data: Yes, thoroughly. The REPORT accurately represents the statistical results. H1 is honestly reported as "partially supported" — field-stress enrichment is significant (q=0.026) but the field-specific vs lab-specific comparison is not (p=0.27, n=50-52 per group). The key insight that fitness magnitude matters more than condition type is well-supported by the logistic regression results (gene length AUC=0.645 vs field/lab fitness AUC=0.517-0.548).

Limitations acknowledged: Exceptionally thorough. Eight specific limitations are listed in the REPORT, including the essential gene exclusion, single-organism caveat, coarse core/auxiliary classification, subjective condition mapping, and the correlation between field and lab fitness effects (r~0.7). The gene length confound is explicitly noted.

Counter-intuitive results honestly reported: The finding that lab-specific genes are 96% core (vs 88.5% for field-specific) is counter to H1 but is reported prominently. The interpretation that "any fitness importance predicts conservation" is well-reasoned, though the small sample sizes (n=50-52) warrant the caveat about limited statistical power.

Incomplete analysis: None. The REPORT's gene_fitness_conservation.csv is listed as a data file and exists on disk, but the save step is missing from the notebooks. All other analyses are complete.

Visualizations: Six figures are well-labeled with axes, titles, and legends. The bar chart (fig 1) clearly shows conservation gradients with sample sizes. The ROC curves (fig 3) effectively demonstrate weak predictive power. The module scatter plots (figs 5-6) honestly display the null result. Figures are saved at 150 DPI — adequate for screen use.

Literature context: Strong. Eight references are cited in the REPORT with specific connections to findings (e.g., Rosconi et al. 2022 linked to the magnitude-over-context conclusion; Huang et al. 2022 linked to accessory genome metal resistance). The references.md file lists 12 references with full bibliographic details and PubMed search queries, enabling others to reproduce the literature search.

Suggestions

Add statsmodels to requirements.txt (Critical): NB03 imports from statsmodels.stats.multitest import multipletests but statsmodels is not listed in requirements.txt. Add statsmodels>=0.13 to avoid a ModuleNotFoundError on fresh installs.
Add the missing save step for gene_fitness_conservation.csv (Important): The file exists on disk and is listed in the REPORT, but no notebook cell writes it. Add a cell at the end of NB03's section 6 or 7 to save gene_data to data/gene_fitness_conservation.csv. This ensures full reproducibility via nbconvert --execute.
Add NB01 code cell outputs or a summary data file (Moderate): NB01 code cells lack saved outputs since it requires Spark, which is expected. The markdown header has an execution summary, but saving a small summary file (e.g., data/enigma_summary.json with table names and row counts) would let downstream notebooks or readers verify the discovery results without Spark access.
Consider a quantitative conservation metric (Moderate): The binary core/auxiliary classification creates a high baseline (76.3% core) and a ceiling effect at the module level (median 1.0). Using the fraction of genomes carrying each gene cluster as a continuous variable would increase statistical power, as noted in the REPORT's Future Directions. This is the single change most likely to strengthen the results if the pangenome data supports it.
Clarify the auxiliary gene count discrepancy (Minor): NB03 cell 8 reports 645 auxiliary genes, while cell 4 shows 583 in the gene_info table. A brief comment explaining that the difference comes from genes without pangenome links being counted as non-core after the inner merge would prevent confusion.
Consider showing ROC curves with CV-AUC values (Minor): NB03 cell 19 plots ROC curves using in-sample probabilities but the legend labels show the in-sample AUC values (e.g., "Field-only (AUC=0.516)"). Since 10-fold CV-AUC values are computed in cell 18, updating the legend to show CV-AUC (e.g., "Field-only (CV-AUC=0.517)") would be more rigorous, though the difference is negligible here.
Document the heavy-metals broad classification rationale earlier (Minor): The README and RESEARCH_PLAN classify heavy-metals as "field" without explanation. The REPORT discusses this in the Interpretation section, but adding a one-line note in NB02's markdown header or classification scheme table would make the rationale visible at the point of decision.

This review was generated by an AI system. It should be treated as advisory input, not a definitive assessment.

Visualizations

Fig Condition Importance Heatmap

Fig Conservation By Condition Class

Fig Ecological Vs Lab Modules

Fig Field Vs Lab Specificity

Fig Module Conservation Vs Activity

Fig Roc Conservation Prediction

Notebooks

📄

01_enigma_discovery.ipynb

01 Enigma Discovery

View notebook →

📄

02_condition_classification.ipynb

02 Condition Classification

View notebook →

📄

03_fitness_conservation.ipynb

03 Fitness Conservation

View notebook →

📄

04_module_analysis.ipynb

04 Module Analysis

View notebook →

Field vs Lab Gene Importance in Desulfovibrio vulgaris Hildenborough

Research Question

Overview

Key Findings

ENIGMA CORAL Contains No DvH Fitness Data (NB01)

Condition Classification (NB02)

Genes Important for Field Conditions Are Significantly More Conserved (NB03)

Specificity Analysis: Lab-Specific Genes Are Surprisingly More Core

Fitness Effects Are Weak Predictors of Core Status

Threshold Sensitivity Analysis

Module-Level Conservation Shows No Field-Lab Difference (NB04)

Interpretation

Hypothesis Outcomes

Key Biological Insight

Literature Context

Novel Contribution

Limitations

Future Directions

Data

Sources

Generated Data

References

Data Collections

Pangenome Collection

Fitness Browser

ENIGMA CORAL

Derived Data

Conservation Vs Fitness

Fitness Modules

Review

Summary

Methodology

Code Quality

Findings Assessment

Suggestions

Visualizations

Notebooks