BacDive Phenotype Signatures of Metal Tolerance
CompletedResearch Question
Can BacDive-measured bacterial phenotypes (Gram stain, oxygen tolerance, metabolite utilization, enzyme activities) predict metal tolerance as measured by Fitness Browser experiments and the Metal Fitness Atlas?
Research Plan
Hypothesis
- H0: BacDive phenotypic traits do not predict metal tolerance beyond what is expected from phylogenetic relatedness alone.
- H1: Specific BacDive phenotypes — particularly oxygen tolerance, Gram stain, metabolite utilization breadth, and redox enzyme activities — are significant predictors of metal tolerance, even after controlling for phylogeny.
Sub-hypotheses
- H1a: Gram-negative bacteria show higher metal tolerance scores than Gram-positive bacteria (outer membrane acts as permeability barrier; Giovanella et al. 2017).
- H1b: Anaerobes and facultative anaerobes show higher tolerance to redox-active metals (Cu, Cr, U, Fe) due to dissimilatory metal reduction (Lovley 1993; Wang et al. 2018). Test against metal-specific genes (from
counter_ion_effectscorrected scores) to avoid shared-stress confounding. - H1c: Metabolite utilization breadth (number of positive BacDive metabolite tests) positively correlates with metal tolerance score (metabolic versatility co-occurs with metal resistance gene clusters; Martin-Moldes et al. 2015).
- H1d: Catalase-positive organisms show higher tolerance to redox-cycling metals (Cu, Cr, Fe) via ROS detoxification. Test against metal-specific genes to separate from general stress response.
- H1e: Urease-positive organisms show higher nickel tolerance (urease is a nickel-dependent enzyme; urease-positive organisms have nickel import/handling machinery).
- H1f: H₂S-producing organisms (sulfate/thiosulfate reducers) show higher tolerance to chalcophilic metals (Zn, Cu, Cd, Pb) via metal sulfide precipitation.
Expected effect sizes
The bacdive_metal_validation project found Cohen's d = +1.00 for isolation source (heavy metal contamination → metal tolerance score). BacDive phenotypes are less directly related to metal tolerance than isolation from metal-contaminated environments, so we expect moderate effect sizes:
- Gram stain, oxygen tolerance: d = 0.2–0.5 (small to moderate)
- Urease/H₂S (mechanistically specific): d = 0.3–0.7 (moderate)
- Metabolite breadth (indirect proxy): r = 0.1–0.3 (weak to moderate)
Approach
Two-scale design
- Direct validation (n=12): 12 Fitness Browser organisms match BacDive strains by NCBI taxonomy ID (species-level match, not strain-level — see Caveats). Present as descriptive case studies, not formal hypothesis tests, given the low n.
- Pangenome-scale prediction (n≈TBD): Link BacDive strains to pangenome species via genome accessions (27,502 GCA accessions). Test phenotype → metal tolerance score (from Metal Fitness Atlas) associations. Actual post-matching sample size must be computed in NB01 before proceeding — estimate 3,000-5,000 species but this is unverified.
Phenotype feature matrix
Value encoding rules (addresses four-value utilization and non-binary enzyme activity):
- metabolite_utilization.utilization: Map + → positive, - → negative, produced → positive, +/- → exclude (ambiguous). Filter to explicit +/- before computing utilization percentages.
- enzyme.activity: Map + → positive, - → negative, +/- → exclude (ambiguous). Other values (e.g., "variable") also excluded.
- physiology fields: Use measured values only (gram_stain, oxygen_tolerance, motility). Do NOT fill gaps with AI-predicted columns (predicted_gram, predicted_motility, predicted_oxygen). Predicted values could introduce model-dependent bias, especially if the prediction models used genomic features correlated with metal tolerance.
| Feature | Source | Strain Coverage | Type | Encoding |
|---|---|---|---|---|
| Gram stain | physiology.gram_stain |
15,296 strains | Binary | negative/positive (exclude "variable") |
| Oxygen tolerance | physiology.oxygen_tolerance |
23,252 strains | Categorical (5 levels) | aerobe/anaerobe/microaerophile/facultative/obligate |
| Cell shape | physiology.cell_shape |
14,990 strains | Categorical | Top 5 shapes; others → "other" |
| Motility | physiology.motility |
13,759 strains | Binary | yes/no |
| Catalase activity | enzyme (catalase) |
16,907 tests | Binary | +/- only |
| Oxidase activity | enzyme (oxidase) |
17,723 tests | Binary | +/- only |
| Urease activity | enzyme (urease) |
30,875 tests | Binary | +/- only |
| Nitrate reduction | metabolite_utilization (nitrate) |
27,388 tests | Binary | +/- only (exclude +/-) |
| H₂S production | metabolite_utilization (H₂S) |
5,322 tests | Binary | "produced" → positive; else negative |
| Acetate utilization | metabolite_utilization (acetate) |
2,199 tests | Binary | +/- only |
| Metabolite breadth | metabolite_utilization |
988K tests | Continuous | Count of explicit "+" results per strain |
| Enzyme breadth | enzyme |
643K tests | Continuous | Count of explicit "+" results per strain |
| Isolation source | isolation.cat1 |
57,935 strains | Categorical | #Host/#Environmental/#Engineered |
Species-level aggregation
- Categorical features (Gram, oxygen): Majority vote with agreement score (fraction of strains with majority call). Species with agreement < 0.6 flagged as "variable" and analyzed separately.
- Binary features (catalase, urease, etc.): Fraction positive across strains. Binarize at species level using 0.5 threshold, but retain continuous fraction for sensitivity analysis.
- Continuous features (breadth): Mean across strains within species.
Metal tolerance metrics
- Composite: Metal Fitness Atlas
metal_score_norm(27,702 pangenome species) - Per-metal corrected: Use metal-specific gene counts from
counter_ion_effects/data/corrected_metal_conservation.csvwhere available, to test H1b/H1d against metal-specific (not shared-stress) signal - Per-metal direct: Fitness scores from 14 metals across 12 FB-BacDive matched organisms
Circular reasoning mitigation
The Metal Fitness Atlas metal_score_norm is derived from gene functional signatures (COG/KEGG/SEED/domain annotations in pangenome clusters). If BacDive phenotypes are phylogenetically correlated with the same functional signatures, associations could be indirect. Mitigation strategies:
1. Control for metal resistance gene count: Include n_metal_clusters from the atlas as a covariate. If phenotypes predict metal tolerance beyond what the gene count already explains, the association is biologically meaningful.
2. Phylogenetic blocking in CV: Ensure train/test splits don't share genera (prevents phylogenetic leakage).
3. Partial correlation: Report both raw and partial correlations (controlling for phylum + n_metal_clusters).
Revision History
- v1 (2026-03-10): Initial plan
- v2 (2026-03-10): Addressed plan review — added value encoding rules, coverage waterfall, agreement scores, class-level stratification, circular reasoning mitigation, descriptive framing for n=12 validation, expected effect sizes, cross-project data reuse conventions, go/no-go gate
Overview
The Metal Fitness Atlas scored 27,702 pangenome species for metal tolerance using gene functional signatures validated against RB-TnSeq fitness data. BacDive provides standardized phenotypic measurements (Gram stain, oxygen tolerance, metabolite utilization, enzyme activities) for 97K bacterial strains. This project tests whether these readily measurable phenotypes predict metal tolerance, connecting classical microbiology (culture-based phenotyping) to genomic metal tolerance predictions. A two-scale design validates associations both directly (12 FB organisms matching BacDive) and at pangenome scale (~3,000-5,000 species linked via genome accessions).
Key Findings
1. Gram-Negative Bacteria Have Significantly Higher Metal Tolerance Scores (d=-0.61)

Gram-negative species have higher metal tolerance scores than Gram-positive species (Cohen's d = -0.61, p < 1e-60, n = 3,272 species). This is the largest effect among all phenotype features tested. However, class-stratified analysis reveals the effect cannot be tested within taxonomic classes — it is almost entirely a between-lineage signal (Gram-positive Actinomycetes vs Gram-negative Proteobacteria). The association is mechanistically plausible — the Gram-negative outer membrane provides a permeability barrier restricting metal cation uptake (Biswas et al. 2021; Paulsen et al. 1997) — but statistically confounded with phylogeny.
(Notebook: 03_univariate_associations.ipynb)
2. Seven of Ten Phenotype Features Are Individually Significant After FDR Correction
Seven features pass FDR correction at q < 0.05: Gram stain (d = -0.61), oxidase (d = +0.53), motility (d = +0.35), urease (d = -0.18), enzyme breadth (rho = -0.06), nitrate reduction (d = +0.10), and catalase (d = +0.10). Three do not reach significance: H₂S production (d = -0.87, q = 0.073 — but this effect size estimate is unreliable with only 8 negative controls and likely inflated by small-sample bias), metabolite breadth (rho = -0.01, ns), and acetate utilization (d = 0.005, ns).
(Notebook: 03_univariate_associations.ipynb)
3. Phenotype Features Add Nothing Beyond Taxonomy (Delta R² = -0.009)

The central finding: taxonomy alone (phylum/class/order) explains 35.4% of metal tolerance variance, and phenotype features alone explain 16.3%, but combining taxonomy + phenotype yields R² = 34.5% — slightly worse than taxonomy alone. The phenotype signal is entirely captured by phylogenetic structure. Adding the number of metal resistance gene clusters (n_metal_clusters) boosts the full model to R² = 0.63, demonstrating that genome-encoded metal resistance repertoire is the true predictor.
(Notebook: 04_multivariate_model.ipynb)
4. Urease-Positive Organisms Have Lower Metal Tolerance (H1e Reversed)
Urease-positive species have significantly lower metal tolerance scores (d = -0.18, p < 1e-5), opposite to the prediction that urease positivity (requiring nickel import machinery) would confer nickel tolerance. Class-stratified analysis reveals this is driven by Actinomycetes (d = -0.59, p < 1e-16), where urease-positive species form a distinct low-metal-score subgroup. Within Gammaproteobacteria (d = +0.08, ns) and Bacilli (d = +0.06, ns), the effect disappears — further evidence of phylogenetic confounding.
(Notebook: 03_univariate_associations.ipynb)
5. Anaerobe vs Aerobe Difference Is Negligible (H1b Not Supported)

Despite 3,751 species with oxygen tolerance data, the anaerobe-aerobe difference in metal tolerance is negligible (d = -0.016, p = 0.55). Facultative anaerobes have the highest mean score (0.221) versus aerobes (0.216) and anaerobes (0.215). The Kruskal-Wallis test across all three groups is marginally significant (H = 8.53, p = 0.014), but the effect is biologically trivial.
(Notebook: 03_univariate_associations.ipynb)
6. SHAP Analysis Confirms Taxonomy and Gene Count Dominate

SHAP feature importance from the full XGBoost model shows that taxonomic class/order codes and n_metal_clusters are the top predictors. Phenotype features contribute minimally to individual predictions once taxonomy is included. This confirms that classical microbiology phenotypes are phylogenetic proxies, not independent predictors of metal tolerance.
(Notebook: 04_multivariate_model.ipynb)
Results
Scale of the Analysis
| Metric | Value |
|---|---|
| BacDive strains in bridge | 97,334 |
| Matched to pangenome + metal score | 37,368 (38.4%) |
| Unique GTDB species | 5,647 |
| Species with ≥5 phenotype features | 3,994 |
| Phenotype features tested | 10 (8 binary, 2 continuous) |
| Significant after FDR | 7 / 10 |
| Taxonomic classes with ≥50 species | 9 |
Coverage Waterfall

| Feature | BacDive Strains | Matched Strains | Species with Metal Score |
|---|---|---|---|
| Isolation source | 43,378 | 22,581 | 4,531 |
| Metabolite breadth | 29,784 | 16,454 | 3,930 |
| Enzyme breadth | 28,836 | 16,005 | 3,746 |
| Oxygen tolerance | 23,252 | 11,708 | 3,751 |
| Gram stain | 15,194 | 7,411 | 3,272 |
| Motility | 13,759 | 6,474 | 3,138 |
| Nitrate reduction | 20,726 | 11,357 | 3,088 |
| Urease | 24,438 | 14,343 | 3,035 |
| Catalase | 15,295 | 7,876 | 2,930 |
| Oxidase | 7,813 | 5,014 | 1,799 |
| H₂S production | 5,254 | 2,611 | 880 |
| Acetate utilization | 1,980 | 475 | 422 |
Univariate Association Results
| Feature | Hypothesis | Effect | n | p-value | FDR q | Sig |
|---|---|---|---|---|---|---|
| Gram stain (+) | H1a | d = -0.610 | 3,272 | 4.0e-61 | 4.0e-60 | Yes |
| Oxidase (+) | — | d = +0.530 | 1,799 | 2.7e-25 | 1.3e-24 | Yes |
| Motility | — | d = +0.345 | 3,138 | 2.2e-23 | 7.2e-23 | Yes |
| Urease (+) | H1e | d = -0.175 | 3,035 | 3.7e-06 | 9.1e-06 | Yes |
| Enzyme breadth | — | rho = -0.058 | 3,746 | 4.1e-04 | 8.2e-04 | Yes |
| Nitrate reduction | — | d = +0.100 | 3,088 | 4.4e-03 | 7.4e-03 | Yes |
| Catalase (+) | H1d | d = +0.104 | 2,930 | 2.8e-02 | 4.1e-02 | Yes |
| H₂S production | H1f | d = -0.867 | 880 | 5.8e-02 | 7.3e-02 | No |
| Metabolite breadth | H1c | rho = -0.013 | 3,930 | 4.2e-01 | 4.7e-01 | No |
| Acetate utilization | — | d = +0.005 | 422 | 7.9e-01 | 7.9e-01 | No |
Model Comparison (5-Fold Phylogenetic-Blocked CV)
| Model | R² | RMSE | n | Features |
|---|---|---|---|---|
| Gene count only | 0.063 | 0.045 | 3,994 | 1 |
| Phenotype only | 0.163 | 0.043 | 3,994 | 13 |
| Taxonomy only | 0.354 | 0.038 | 3,994 | 3 |
| Taxonomy + Phenotype | 0.345 | 0.038 | 3,994 | 16 |
| Full (all combined) | 0.633 | 0.028 | 3,994 | 17 |
Hypothesis Outcomes
| Hypothesis | Prediction | Result |
|---|---|---|
| H1a (Gram-neg → metal tolerant) | Gram-neg higher metal scores | Supported univariately (d = -0.61) but phylogenetically confounded |
| H1b (Anaerobe → redox metals) | Anaerobes tolerate redox metals better | Not supported (d = -0.02, ns) |
| H1c (Metabolite breadth → tolerance) | Broader metabolism → higher scores | Not supported (rho = -0.01, ns) |
| H1d (Catalase → redox metals) | Catalase+ tolerate Cu, Cr, Fe | Marginally supported (d = +0.10, q = 0.04) |
| H1e (Urease → Ni tolerance) | Urease+ handle nickel better | Reversed (d = -0.18); driven by Actinomycetes |
| H1f (H₂S → chalcophilic metals) | H₂S producers tolerate Zn, Cu, Cd | Unreliable (d = -0.87 but only 8 negatives — effect size likely inflated by small-sample bias) |
Direct FB-BacDive Validation (n = 12)

Twelve Fitness Browser organisms match BacDive by taxonomy ID (6 unique species: Cupriavidus basilensis, Methanococcus maripaludis, Ralstonia solanacearum, Pseudomonas simiae, Azospirillum brasilense, Pseudomonas fluorescens). All Gram-typed organisms are Gram-negative, precluding within-set testing of H1a. All urease-typed organisms are urease-negative yet are routinely tested against nickel, consistent with the pangenome-scale finding that urease status does not predict metal tolerance. The single anaerobe (Methanococcus) has only 1 metal tested versus 4–5 for aerobes, but n = 1 is not interpretable.
(Notebook: 05_fb_bacdive_case_studies.ipynb)
Interpretation
The Phylogenetic Confounding Wall
The central result is a cautionary tale for phenotype-based prediction in microbial ecology: classical microbiology phenotypes (Gram stain, oxygen tolerance, enzyme activities) capture real biological differences in metal tolerance, but these differences are entirely explained by phylogenetic structure. Adding phenotype features to a taxonomy-based model actually decreases predictive power slightly (delta R² = -0.009), indicating that phenotypes are noisier proxies for taxonomy, not independent predictors.
This is consistent with Goberna & Verdú (2016), who showed through meta-analysis that bacterial trait-environment associations routinely reflect shared ancestry rather than direct ecological adaptation. Van Assche & Álvarez (2017) found phylogenetic signal values up to λ = 0.98 for chemical sensitivity traits in Acinetobacter, directly analogous to our findings.
Why Phenotypes Fail as Metal Tolerance Predictors
-
Gram stain is taxonomy, not a trait: The Gram-negative advantage in metal tolerance is real (outer membrane barrier; Paulsen et al. 1997) but indistinguishable from "being a Proteobacterium" in a predictive model. There are no Gram-positive Proteobacteria and no Gram-negative Actinomycetes in our data.
-
Urease reversal reflects lineage composition: The surprising urease-negative → higher metal tolerance finding is entirely driven by Actinomycetes, where urease-positive species are a phylogenetically distinct subgroup with low metal scores. Within other classes, the urease effect is zero.
-
Catalase exhibits Simpson's paradox: The overall catalase effect is d = +0.10 (catalase-positive species have slightly higher metal scores), marginally supporting H1d. But class-stratified analysis reveals the opposite within every major class: catalase-negative species score higher within Actinomycetes (d = -0.62, p < 1e-5), Gammaproteobacteria (d = -0.49, p = 0.004), and Betaproteobacteria (d = -0.51, p = 0.006). The overall positive effect is an artifact of between-class composition — catalase-positive classes (Proteobacteria) happen to have higher metal scores than catalase-negative classes. This is the same Simpson's paradox pattern as urease, reinforcing the phylogenetic confounding narrative.
-
Metabolic breadth is not a metal tolerance proxy: The hypothesis that metabolically versatile organisms harbor more resistance genes (Martin-Moldes et al. 2015) was not supported. Metabolite utilization breadth shows no correlation with metal score (rho = -0.01), likely because metabolic versatility operates in different genomic neighborhoods than metal resistance.
What Does Predict Metal Tolerance
The genome's metal resistance gene content (n_metal_clusters from the Metal Fitness Atlas) explains more variance than all phenotype features combined, and substantially more than taxonomy alone. The full model (taxonomy + phenotype + gene count) achieves R² = 0.63, with gene count contributing the lion's share of the improvement over taxonomy (delta R² = +0.28 over taxonomy alone). This aligns with Schwan et al. (2023), who found strong genotype-phenotype concordance for metal resistance gene presence in Salmonella and E. coli.
The Urease-Nickel Paradox
Urease-positive bacteria require nickel import and handling machinery (Maier & Benoit 2019), so they should tolerate nickel — yet they show lower overall metal tolerance. This is not a true paradox: urease positivity correlates with Actinomycetes and other lineages that have fewer metal resistance genes overall. The nickel-handling machinery associated with urease is narrowly specific to nickel and does not confer general metal tolerance. Furthermore, the tight coupling between urease activity and nickel efflux (Stähler et al. 2006) means that nickel homeostasis in urease producers is a balancing act, not a broad tolerance mechanism.
Novel Contribution
- First large-scale test of BacDive phenotypes as metal tolerance predictors — 5,647 species, 10 phenotype features, cross-referenced against genome-based metal tolerance scores from the Metal Fitness Atlas.
- Definitive demonstration of phylogenetic confounding — phenotype features capture real signal (R² = 0.16) but add nothing beyond what taxonomy already provides (delta R² = -0.009).
- Urease reversal — urease-positive bacteria have lower metal tolerance, driven by lineage composition, not nickel biology.
- Gene count as the true predictor — the number of metal resistance gene clusters outperforms all phenotype features combined.
Limitations
- Metal Fitness Atlas scores are genome-based predictions, not direct metal tolerance measurements. Circular reasoning is mitigated by controlling for
n_metal_clustersin partial correlations, but the phenotype-score associations are ultimately phenotype-to-genome correlations, not phenotype-to-phenotype. - BacDive species name matching achieves only 38.4% (5,647/27,702 GTDB species). GCA accession matching could recover additional links but was not implemented.
- 12-organism direct validation is underpowered — all Gram-typed organisms are Gram-negative, preventing within-set testing of the strongest association.
- BacDive testing bias: well-studied organisms (Pseudomonas, E. coli) have many phenotype tests; poorly-studied species have sparse data.
- H₂S production is underpowered: only 8 H₂S-negative species in the matched set, making the large effect size (d = -0.87) unreliable.
Future Directions
- Per-metal phenotype analysis: Test whether specific phenotypes predict tolerance to specific metals (e.g., catalase → copper, urease → nickel) rather than composite metal scores. This would require per-metal scores for the 27K species.
- GCA accession matching: The current bridge uses species name matching (38.4%). Adding GCA accession matching could recover 10–30% more species.
- Phylogenetic comparative methods: Use PGLS or phylogenetic PCA to formally remove phylogenetic signal before testing phenotype-metal associations.
- BacDive AI-predicted phenotypes: The current analysis uses only measured values. Including BacDive's ML-predicted Gram stain, motility, and oxygen tolerance could increase coverage substantially while introducing model-dependent bias.
- Experimental validation: The H₂S → chalcophilic metal hypothesis (d = -0.87, underpowered) could be tested experimentally with sulfate-reducing bacteria and zinc/copper challenge.
Data
Sources
| Collection | Tables Used | Purpose |
|---|---|---|
kescience_bacdive |
strain, physiology, metabolite_utilization, enzyme, isolation, taxonomy, sequence_info |
Phenotype features for 97K strains |
kbase_ke_pangenome |
(via Metal Fitness Atlas scores) | Metal tolerance genome-based predictions for 27,702 species |
kescience_fitnessbrowser |
(via organism mapping + metal experiments) | 12 direct FB-BacDive organism matches, metal experiment metadata |
Generated Data
| File | Rows | Description |
|---|---|---|
data/bacdive_pangenome_bridge.csv |
97,334 | Full BacDive-to-pangenome bridge (all strains, matched or not) |
data/matched_strains.csv |
37,368 | Matched BacDive strains with GTDB species and metal scores |
data/coverage_waterfall.csv |
13 | Per-feature coverage: strains → matched → species |
data/species_phenotype_matrix.csv |
5,647 | Species-level phenotype feature matrix (37 columns) |
data/univariate_results.csv |
10 | FDR-corrected univariate association results |
data/stratified_results.csv |
17 | Class-stratified binary feature tests |
data/model_comparison.csv |
5 | Cross-validated R² for 5 model configurations |
data/fb_bacdive_combined.csv |
12 | Combined phenotype + metal data for 12 FB organisms |
References
- Arkin AP et al. (2018). "KBase: The United States Department of Energy Systems Biology Knowledgebase." Nat Biotechnol 36:566-569. PMID: 29979655
- Biswas R et al. (2021). "Overview on the role of heavy metals tolerance on developing antibiotic resistance in both Gram-negative and Gram-positive bacteria." Arch Microbiol 203:2275. DOI: 10.1007/s00203-021-02275-w
- Goberna M, Verdú M (2016). "Predicting microbial traits with phylogenies." ISME J 10:959-967. DOI: 10.1038/ismej.2015.171
- Koblitz J et al. (2025). "Predicting bacterial phenotypic traits through improved machine learning using high-quality, curated datasets." Commun Biol. DOI: 10.1038/s42003-025-08313-3
- Maier RJ, Benoit SL (2019). "Role of Nickel in Microbial Pathogenesis." Inorganics 7:80. DOI: 10.3390/inorganics7070080
- Martin-Moldes Z et al. (2015). "Whole-genome analysis of Azoarcus sp. strain CIB provides genetic insights to its different lifestyles." Syst Appl Microbiol 38:462-471.
- Oleńska E et al. (2025). "Molecular mechanisms of bacterial survival strategies under heavy metal stress." Int J Mol Sci 26:5716. DOI: 10.3390/ijms26125716
- Paulsen IT et al. (1997). "The SMR family: a novel family of multidrug efflux proteins." FEMS Microbiol Lett 156:1-8.
- Price MN et al. (2018). "Mutant phenotypes for thousands of bacterial genes of unknown function." Nature 557:503-509. PMID: 29769716
- Reimer LC et al. (2022). "BacDive in 2022: the knowledge base for standardized bacterial and archaeal data." Nucleic Acids Res 50:D741-D746.
- Schwan CL et al. (2023). "Genotypic and phenotypic characterization of antimicrobial resistance and metal tolerance in Salmonella and E. coli." J Food Prot 86:100113. PMID: 37290750
- Stähler FN et al. (2006). "The CznABC metal efflux pump is required for cadmium, zinc, and nickel resistance, urease modulation, and gastric colonization." Infect Immun 74:3845-3852.
- Van Assche A, Álvarez-Pérez S (2017). "Phylogenetic signal in carbon source assimilation profiles and tolerance to chemical agents." Appl Microbiol Biotechnol 101:3049-3061. DOI: 10.1007/s00253-016-7866-0
Suggested Experiments
Testing whether urease+ organisms are specifically nickel-tolerant (rather than generally metal-tolerant) requires per-metal fitness profiling of urease+ vs urease- bacteria from the same taxonomic class, ideally Gammaproteobacteria (where the pangenome-scale urease effect is near zero). RB-TnSeq with matched urease+/- strains under nickel vs other metals would directly test whether urease enables nickel-specific tolerance.
Data Collections
Review
Suggestions
-
Implement GCA accession matching as a sensitivity analysis: Even a brief check (e.g., "using GCA matching recovers X additional species; re-running the model comparison yields delta R^2 = Y, consistent with the species-name-only result") would substantially strengthen the robustness claim. The pitfalls doc already provides the prefix-handling recipe (
GB_GCA_*/RS_GCF_*). -
Add confidence intervals to Cohen's d values: Especially important for H2S (n_neg = 8), where the 95% CI on d = -0.87 likely spans from near zero to well beyond -1.5. This would visually reinforce the "unreliable" characterization in the forest plot.
-
Test H1b and H1d against per-metal scores as planned: Even a brief analysis using
counter_ion_effects/data/corrected_metal_conservation.csvfor copper/chromium-specific scores would either strengthen or definitively close these hypotheses. -
Add a brief note in NB04 explaining why the analysis sample size (3,994) differs from NB02's "species with >= 5 features" count (3,437). One sentence referencing the oxygen dummy variable expansion would suffice.
-
Explain omission of cell_shape and isolation_source from the multivariate models. If they were dropped for methodological reasons (e.g., high cardinality, redundancy with taxonomy), state this explicitly. If it was an oversight, consider adding them.
-
Consider a Mantel test or PGLS as a complementary approach to the XGBoost model comparison. These are standard in phylogenetic comparative biology and would provide a more formal test of phylogenetic confounding than the delta R^2 approach alone.
This review was generated by an AI system. It should be treated as advisory input, not a definitive assessment.
Visualizations
Coverage Waterfall
Fb Bacdive Phenotype Table
Feature Completeness
Model Comparison
Shap Summary
Univariate Effect Sizes
Notebooks
01_bacdive_pangenome_bridge.ipynb
01 Bacdive Pangenome Bridge
View notebook →
02_phenotype_feature_engineering.ipynb
02 Phenotype Feature Engineering
View notebook →
03_univariate_associations.ipynb
03 Univariate Associations
View notebook →
04_multivariate_model.ipynb
04 Multivariate Model
View notebook →
05_fb_bacdive_case_studies.ipynb
05 Fb Bacdive Case Studies
View notebook →