BacDive Phenotype Signatures of Metal Tolerance

Paramvir S. Dehal

Research Question

Can BacDive-measured bacterial phenotypes (Gram stain, oxygen tolerance, metabolite utilization, enzyme activities) predict metal tolerance as measured by Fitness Browser experiments and the Metal Fitness Atlas?

Research Plan

Hypothesis

H0: BacDive phenotypic traits do not predict metal tolerance beyond what is expected from phylogenetic relatedness alone.
H1: Specific BacDive phenotypes — particularly oxygen tolerance, Gram stain, metabolite utilization breadth, and redox enzyme activities — are significant predictors of metal tolerance, even after controlling for phylogeny.

Sub-hypotheses

H1a: Gram-negative bacteria show higher metal tolerance scores than Gram-positive bacteria (outer membrane acts as permeability barrier; Giovanella et al. 2017).
H1b: Anaerobes and facultative anaerobes show higher tolerance to redox-active metals (Cu, Cr, U, Fe) due to dissimilatory metal reduction (Lovley 1993; Wang et al. 2018). Test against metal-specific genes (from counter_ion_effects corrected scores) to avoid shared-stress confounding.
H1c: Metabolite utilization breadth (number of positive BacDive metabolite tests) positively correlates with metal tolerance score (metabolic versatility co-occurs with metal resistance gene clusters; Martin-Moldes et al. 2015).
H1d: Catalase-positive organisms show higher tolerance to redox-cycling metals (Cu, Cr, Fe) via ROS detoxification. Test against metal-specific genes to separate from general stress response.
H1e: Urease-positive organisms show higher nickel tolerance (urease is a nickel-dependent enzyme; urease-positive organisms have nickel import/handling machinery).
H1f: H₂S-producing organisms (sulfate/thiosulfate reducers) show higher tolerance to chalcophilic metals (Zn, Cu, Cd, Pb) via metal sulfide precipitation.

Expected effect sizes

The bacdive_metal_validation project found Cohen's d = +1.00 for isolation source (heavy metal contamination → metal tolerance score). BacDive phenotypes are less directly related to metal tolerance than isolation from metal-contaminated environments, so we expect moderate effect sizes:
- Gram stain, oxygen tolerance: d = 0.2–0.5 (small to moderate)
- Urease/H₂S (mechanistically specific): d = 0.3–0.7 (moderate)
- Metabolite breadth (indirect proxy): r = 0.1–0.3 (weak to moderate)

Approach

Two-scale design

Direct validation (n=12): 12 Fitness Browser organisms match BacDive strains by NCBI taxonomy ID (species-level match, not strain-level — see Caveats). Present as descriptive case studies, not formal hypothesis tests, given the low n.
Pangenome-scale prediction (n≈TBD): Link BacDive strains to pangenome species via genome accessions (27,502 GCA accessions). Test phenotype → metal tolerance score (from Metal Fitness Atlas) associations. Actual post-matching sample size must be computed in NB01 before proceeding — estimate 3,000-5,000 species but this is unverified.

Phenotype feature matrix

Value encoding rules (addresses four-value utilization and non-binary enzyme activity):
- metabolite_utilization.utilization: Map + → positive, - → negative, produced → positive, +/- → exclude (ambiguous). Filter to explicit +/- before computing utilization percentages.
- enzyme.activity: Map + → positive, - → negative, +/- → exclude (ambiguous). Other values (e.g., "variable") also excluded.
- physiology fields: Use measured values only (gram_stain, oxygen_tolerance, motility). Do NOT fill gaps with AI-predicted columns (predicted_gram, predicted_motility, predicted_oxygen). Predicted values could introduce model-dependent bias, especially if the prediction models used genomic features correlated with metal tolerance.

Feature	Source	Strain Coverage	Type	Encoding
Gram stain	`physiology.gram_stain`	15,296 strains	Binary	negative/positive (exclude "variable")
Oxygen tolerance	`physiology.oxygen_tolerance`	23,252 strains	Categorical (5 levels)	aerobe/anaerobe/microaerophile/facultative/obligate
Cell shape	`physiology.cell_shape`	14,990 strains	Categorical	Top 5 shapes; others → "other"
Motility	`physiology.motility`	13,759 strains	Binary	yes/no
Catalase activity	`enzyme` (catalase)	16,907 tests	Binary	+/- only
Oxidase activity	`enzyme` (oxidase)	17,723 tests	Binary	+/- only
Urease activity	`enzyme` (urease)	30,875 tests	Binary	+/- only
Nitrate reduction	`metabolite_utilization` (nitrate)	27,388 tests	Binary	+/- only (exclude +/-)
H₂S production	`metabolite_utilization` (H₂S)	5,322 tests	Binary	"produced" → positive; else negative
Acetate utilization	`metabolite_utilization` (acetate)	2,199 tests	Binary	+/- only
Metabolite breadth	`metabolite_utilization`	988K tests	Continuous	Count of explicit "+" results per strain
Enzyme breadth	`enzyme`	643K tests	Continuous	Count of explicit "+" results per strain
Isolation source	`isolation.cat1`	57,935 strains	Categorical	#Host/#Environmental/#Engineered

Species-level aggregation

Categorical features (Gram, oxygen): Majority vote with agreement score (fraction of strains with majority call). Species with agreement < 0.6 flagged as "variable" and analyzed separately.
Binary features (catalase, urease, etc.): Fraction positive across strains. Binarize at species level using 0.5 threshold, but retain continuous fraction for sensitivity analysis.
Continuous features (breadth): Mean across strains within species.

Metal tolerance metrics

Composite: Metal Fitness Atlas metal_score_norm (27,702 pangenome species)
Per-metal corrected: Use metal-specific gene counts from counter_ion_effects/data/corrected_metal_conservation.csv where available, to test H1b/H1d against metal-specific (not shared-stress) signal
Per-metal direct: Fitness scores from 14 metals across 12 FB-BacDive matched organisms

Circular reasoning mitigation

The Metal Fitness Atlas metal_score_norm is derived from gene functional signatures (COG/KEGG/SEED/domain annotations in pangenome clusters). If BacDive phenotypes are phylogenetically correlated with the same functional signatures, associations could be indirect. Mitigation strategies:
1. Control for metal resistance gene count: Include n_metal_clusters from the atlas as a covariate. If phenotypes predict metal tolerance beyond what the gene count already explains, the association is biologically meaningful.
2. Phylogenetic blocking in CV: Ensure train/test splits don't share genera (prevents phylogenetic leakage).
3. Partial correlation: Report both raw and partial correlations (controlling for phylum + n_metal_clusters).

Revision History

v1 (2026-03-10): Initial plan
v2 (2026-03-10): Addressed plan review — added value encoding rules, coverage waterfall, agreement scores, class-level stratification, circular reasoning mitigation, descriptive framing for n=12 validation, expected effect sizes, cross-project data reuse conventions, go/no-go gate

Overview

The Metal Fitness Atlas scored 27,702 pangenome species for metal tolerance using gene functional signatures validated against RB-TnSeq fitness data. BacDive provides standardized phenotypic measurements (Gram stain, oxygen tolerance, metabolite utilization, enzyme activities) for 97K bacterial strains. This project tests whether these readily measurable phenotypes predict metal tolerance, connecting classical microbiology (culture-based phenotyping) to genomic metal tolerance predictions. A two-scale design validates associations both directly (12 FB organisms matching BacDive) and at pangenome scale (~3,000-5,000 species linked via genome accessions).

Key Findings

1. Gram-Negative Bacteria Have Significantly Higher Metal Tolerance Scores (d=-0.61)

Univariate effect sizes

Gram-negative species have higher metal tolerance scores than Gram-positive species (Cohen's d = -0.61, p < 1e-60, n = 3,272 species). This is the largest effect among all phenotype features tested. However, class-stratified analysis reveals the effect cannot be tested within taxonomic classes — it is almost entirely a between-lineage signal (Gram-positive Actinomycetes vs Gram-negative Proteobacteria). The association is mechanistically plausible — the Gram-negative outer membrane provides a permeability barrier restricting metal cation uptake (Biswas et al. 2021; Paulsen et al. 1997) — but statistically confounded with phylogeny.

(Notebook: 03_univariate_associations.ipynb)

2. Seven of Ten Phenotype Features Are Individually Significant After FDR Correction

Seven features pass FDR correction at q < 0.05: Gram stain (d = -0.61), oxidase (d = +0.53), motility (d = +0.35), urease (d = -0.18), enzyme breadth (rho = -0.06), nitrate reduction (d = +0.10), and catalase (d = +0.10). Three do not reach significance: H₂S production (d = -0.87, q = 0.073 — but this effect size estimate is unreliable with only 8 negative controls and likely inflated by small-sample bias), metabolite breadth (rho = -0.01, ns), and acetate utilization (d = 0.005, ns).

(Notebook: 03_univariate_associations.ipynb)

3. Phenotype Features Add Nothing Beyond Taxonomy (Delta R² = -0.009)

Model comparison

The central finding: taxonomy alone (phylum/class/order) explains 35.4% of metal tolerance variance, and phenotype features alone explain 16.3%, but combining taxonomy + phenotype yields R² = 34.5% — slightly worse than taxonomy alone. The phenotype signal is entirely captured by phylogenetic structure. Adding the number of metal resistance gene clusters (n_metal_clusters) boosts the full model to R² = 0.63, demonstrating that genome-encoded metal resistance repertoire is the true predictor.

(Notebook: 04_multivariate_model.ipynb)

4. Urease-Positive Organisms Have Lower Metal Tolerance (H1e Reversed)

Urease-positive species have significantly lower metal tolerance scores (d = -0.18, p < 1e-5), opposite to the prediction that urease positivity (requiring nickel import machinery) would confer nickel tolerance. Class-stratified analysis reveals this is driven by Actinomycetes (d = -0.59, p < 1e-16), where urease-positive species form a distinct low-metal-score subgroup. Within Gammaproteobacteria (d = +0.08, ns) and Bacilli (d = +0.06, ns), the effect disappears — further evidence of phylogenetic confounding.

(Notebook: 03_univariate_associations.ipynb)

5. Anaerobe vs Aerobe Difference Is Negligible (H1b Not Supported)

Feature completeness

Despite 3,751 species with oxygen tolerance data, the anaerobe-aerobe difference in metal tolerance is negligible (d = -0.016, p = 0.55). Facultative anaerobes have the highest mean score (0.221) versus aerobes (0.216) and anaerobes (0.215). The Kruskal-Wallis test across all three groups is marginally significant (H = 8.53, p = 0.014), but the effect is biologically trivial.

(Notebook: 03_univariate_associations.ipynb)

6. SHAP Analysis Confirms Taxonomy and Gene Count Dominate

SHAP summary

SHAP feature importance from the full XGBoost model shows that taxonomic class/order codes and n_metal_clusters are the top predictors. Phenotype features contribute minimally to individual predictions once taxonomy is included. This confirms that classical microbiology phenotypes are phylogenetic proxies, not independent predictors of metal tolerance.

(Notebook: 04_multivariate_model.ipynb)

Results

Scale of the Analysis

Metric	Value
BacDive strains in bridge	97,334
Matched to pangenome + metal score	37,368 (38.4%)
Unique GTDB species	5,647
Species with ≥5 phenotype features	3,994
Phenotype features tested	10 (8 binary, 2 continuous)
Significant after FDR	7 / 10
Taxonomic classes with ≥50 species	9

Coverage Waterfall

Coverage waterfall

Feature	BacDive Strains	Matched Strains	Species with Metal Score
Isolation source	43,378	22,581	4,531
Metabolite breadth	29,784	16,454	3,930
Enzyme breadth	28,836	16,005	3,746
Oxygen tolerance	23,252	11,708	3,751
Gram stain	15,194	7,411	3,272
Motility	13,759	6,474	3,138
Nitrate reduction	20,726	11,357	3,088
Urease	24,438	14,343	3,035
Catalase	15,295	7,876	2,930
Oxidase	7,813	5,014	1,799
H₂S production	5,254	2,611	880
Acetate utilization	1,980	475	422

Univariate Association Results

Feature	Hypothesis	Effect	n	p-value	FDR q	Sig
Gram stain (+)	H1a	d = -0.610	3,272	4.0e-61	4.0e-60	Yes
Oxidase (+)	—	d = +0.530	1,799	2.7e-25	1.3e-24	Yes
Motility	—	d = +0.345	3,138	2.2e-23	7.2e-23	Yes
Urease (+)	H1e	d = -0.175	3,035	3.7e-06	9.1e-06	Yes
Enzyme breadth	—	rho = -0.058	3,746	4.1e-04	8.2e-04	Yes
Nitrate reduction	—	d = +0.100	3,088	4.4e-03	7.4e-03	Yes
Catalase (+)	H1d	d = +0.104	2,930	2.8e-02	4.1e-02	Yes
H₂S production	H1f	d = -0.867	880	5.8e-02	7.3e-02	No
Metabolite breadth	H1c	rho = -0.013	3,930	4.2e-01	4.7e-01	No
Acetate utilization	—	d = +0.005	422	7.9e-01	7.9e-01	No

Model Comparison (5-Fold Phylogenetic-Blocked CV)

Model	R²	RMSE	n	Features
Gene count only	0.063	0.045	3,994	1
Phenotype only	0.163	0.043	3,994	13
Taxonomy only	0.354	0.038	3,994	3
Taxonomy + Phenotype	0.345	0.038	3,994	16
Full (all combined)	0.633	0.028	3,994	17

Hypothesis Outcomes

Hypothesis	Prediction	Result
H1a (Gram-neg → metal tolerant)	Gram-neg higher metal scores	Supported univariately (d = -0.61) but phylogenetically confounded
H1b (Anaerobe → redox metals)	Anaerobes tolerate redox metals better	Not supported (d = -0.02, ns)
H1c (Metabolite breadth → tolerance)	Broader metabolism → higher scores	Not supported (rho = -0.01, ns)
H1d (Catalase → redox metals)	Catalase+ tolerate Cu, Cr, Fe	Marginally supported (d = +0.10, q = 0.04)
H1e (Urease → Ni tolerance)	Urease+ handle nickel better	Reversed (d = -0.18); driven by Actinomycetes
H1f (H₂S → chalcophilic metals)	H₂S producers tolerate Zn, Cu, Cd	Unreliable (d = -0.87 but only 8 negatives — effect size likely inflated by small-sample bias)

Direct FB-BacDive Validation (n = 12)

FB-BacDive phenotype table

Twelve Fitness Browser organisms match BacDive by taxonomy ID (6 unique species: Cupriavidus basilensis, Methanococcus maripaludis, Ralstonia solanacearum, Pseudomonas simiae, Azospirillum brasilense, Pseudomonas fluorescens). All Gram-typed organisms are Gram-negative, precluding within-set testing of H1a. All urease-typed organisms are urease-negative yet are routinely tested against nickel, consistent with the pangenome-scale finding that urease status does not predict metal tolerance. The single anaerobe (Methanococcus) has only 1 metal tested versus 4–5 for aerobes, but n = 1 is not interpretable.

(Notebook: 05_fb_bacdive_case_studies.ipynb)

Interpretation

The Phylogenetic Confounding Wall

The central result is a cautionary tale for phenotype-based prediction in microbial ecology: classical microbiology phenotypes (Gram stain, oxygen tolerance, enzyme activities) capture real biological differences in metal tolerance, but these differences are entirely explained by phylogenetic structure. Adding phenotype features to a taxonomy-based model actually decreases predictive power slightly (delta R² = -0.009), indicating that phenotypes are noisier proxies for taxonomy, not independent predictors.

This is consistent with Goberna & Verdú (2016), who showed through meta-analysis that bacterial trait-environment associations routinely reflect shared ancestry rather than direct ecological adaptation. Van Assche & Álvarez (2017) found phylogenetic signal values up to λ = 0.98 for chemical sensitivity traits in Acinetobacter, directly analogous to our findings.

Why Phenotypes Fail as Metal Tolerance Predictors

Gram stain is taxonomy, not a trait: The Gram-negative advantage in metal tolerance is real (outer membrane barrier; Paulsen et al. 1997) but indistinguishable from "being a Proteobacterium" in a predictive model. There are no Gram-positive Proteobacteria and no Gram-negative Actinomycetes in our data.
Urease reversal reflects lineage composition: The surprising urease-negative → higher metal tolerance finding is entirely driven by Actinomycetes, where urease-positive species are a phylogenetically distinct subgroup with low metal scores. Within other classes, the urease effect is zero.
Catalase exhibits Simpson's paradox: The overall catalase effect is d = +0.10 (catalase-positive species have slightly higher metal scores), marginally supporting H1d. But class-stratified analysis reveals the opposite within every major class: catalase-negative species score higher within Actinomycetes (d = -0.62, p < 1e-5), Gammaproteobacteria (d = -0.49, p = 0.004), and Betaproteobacteria (d = -0.51, p = 0.006). The overall positive effect is an artifact of between-class composition — catalase-positive classes (Proteobacteria) happen to have higher metal scores than catalase-negative classes. This is the same Simpson's paradox pattern as urease, reinforcing the phylogenetic confounding narrative.
Metabolic breadth is not a metal tolerance proxy: The hypothesis that metabolically versatile organisms harbor more resistance genes (Martin-Moldes et al. 2015) was not supported. Metabolite utilization breadth shows no correlation with metal score (rho = -0.01), likely because metabolic versatility operates in different genomic neighborhoods than metal resistance.

What Does Predict Metal Tolerance

The genome's metal resistance gene content (n_metal_clusters from the Metal Fitness Atlas) explains more variance than all phenotype features combined, and substantially more than taxonomy alone. The full model (taxonomy + phenotype + gene count) achieves R² = 0.63, with gene count contributing the lion's share of the improvement over taxonomy (delta R² = +0.28 over taxonomy alone). This aligns with Schwan et al. (2023), who found strong genotype-phenotype concordance for metal resistance gene presence in Salmonella and E. coli.

The Urease-Nickel Paradox

Urease-positive bacteria require nickel import and handling machinery (Maier & Benoit 2019), so they should tolerate nickel — yet they show lower overall metal tolerance. This is not a true paradox: urease positivity correlates with Actinomycetes and other lineages that have fewer metal resistance genes overall. The nickel-handling machinery associated with urease is narrowly specific to nickel and does not confer general metal tolerance. Furthermore, the tight coupling between urease activity and nickel efflux (Stähler et al. 2006) means that nickel homeostasis in urease producers is a balancing act, not a broad tolerance mechanism.

Novel Contribution

First large-scale test of BacDive phenotypes as metal tolerance predictors — 5,647 species, 10 phenotype features, cross-referenced against genome-based metal tolerance scores from the Metal Fitness Atlas.
Definitive demonstration of phylogenetic confounding — phenotype features capture real signal (R² = 0.16) but add nothing beyond what taxonomy already provides (delta R² = -0.009).
Urease reversal — urease-positive bacteria have lower metal tolerance, driven by lineage composition, not nickel biology.
Gene count as the true predictor — the number of metal resistance gene clusters outperforms all phenotype features combined.

Limitations

Metal Fitness Atlas scores are genome-based predictions, not direct metal tolerance measurements. Circular reasoning is mitigated by controlling for n_metal_clusters in partial correlations, but the phenotype-score associations are ultimately phenotype-to-genome correlations, not phenotype-to-phenotype.
BacDive species name matching achieves only 38.4% (5,647/27,702 GTDB species). GCA accession matching could recover additional links but was not implemented.
12-organism direct validation is underpowered — all Gram-typed organisms are Gram-negative, preventing within-set testing of the strongest association.
BacDive testing bias: well-studied organisms (Pseudomonas, E. coli) have many phenotype tests; poorly-studied species have sparse data.
H₂S production is underpowered: only 8 H₂S-negative species in the matched set, making the large effect size (d = -0.87) unreliable.

Future Directions

Per-metal phenotype analysis: Test whether specific phenotypes predict tolerance to specific metals (e.g., catalase → copper, urease → nickel) rather than composite metal scores. This would require per-metal scores for the 27K species.
GCA accession matching: The current bridge uses species name matching (38.4%). Adding GCA accession matching could recover 10–30% more species.
Phylogenetic comparative methods: Use PGLS or phylogenetic PCA to formally remove phylogenetic signal before testing phenotype-metal associations.
BacDive AI-predicted phenotypes: The current analysis uses only measured values. Including BacDive's ML-predicted Gram stain, motility, and oxygen tolerance could increase coverage substantially while introducing model-dependent bias.
Experimental validation: The H₂S → chalcophilic metal hypothesis (d = -0.87, underpowered) could be tested experimentally with sulfate-reducing bacteria and zinc/copper challenge.

Data

Sources

Collection	Tables Used	Purpose
`kescience_bacdive`	`strain`, `physiology`, `metabolite_utilization`, `enzyme`, `isolation`, `taxonomy`, `sequence_info`	Phenotype features for 97K strains
`kbase_ke_pangenome`	(via Metal Fitness Atlas scores)	Metal tolerance genome-based predictions for 27,702 species
`kescience_fitnessbrowser`	(via organism mapping + metal experiments)	12 direct FB-BacDive organism matches, metal experiment metadata

Generated Data

File	Rows	Description
`data/bacdive_pangenome_bridge.csv`	97,334	Full BacDive-to-pangenome bridge (all strains, matched or not)
`data/matched_strains.csv`	37,368	Matched BacDive strains with GTDB species and metal scores
`data/coverage_waterfall.csv`	13	Per-feature coverage: strains → matched → species
`data/species_phenotype_matrix.csv`	5,647	Species-level phenotype feature matrix (37 columns)
`data/univariate_results.csv`	10	FDR-corrected univariate association results
`data/stratified_results.csv`	17	Class-stratified binary feature tests
`data/model_comparison.csv`	5	Cross-validated R² for 5 model configurations
`data/fb_bacdive_combined.csv`	12	Combined phenotype + metal data for 12 FB organisms

References

Arkin AP et al. (2018). "KBase: The United States Department of Energy Systems Biology Knowledgebase." Nat Biotechnol 36:566-569. PMID: 29979655
Biswas R et al. (2021). "Overview on the role of heavy metals tolerance on developing antibiotic resistance in both Gram-negative and Gram-positive bacteria." Arch Microbiol 203:2275. DOI: 10.1007/s00203-021-02275-w
Goberna M, Verdú M (2016). "Predicting microbial traits with phylogenies." ISME J 10:959-967. DOI: 10.1038/ismej.2015.171
Koblitz J et al. (2025). "Predicting bacterial phenotypic traits through improved machine learning using high-quality, curated datasets." Commun Biol. DOI: 10.1038/s42003-025-08313-3
Maier RJ, Benoit SL (2019). "Role of Nickel in Microbial Pathogenesis." Inorganics 7:80. DOI: 10.3390/inorganics7070080
Martin-Moldes Z et al. (2015). "Whole-genome analysis of Azoarcus sp. strain CIB provides genetic insights to its different lifestyles." Syst Appl Microbiol 38:462-471.
Oleńska E et al. (2025). "Molecular mechanisms of bacterial survival strategies under heavy metal stress." Int J Mol Sci 26:5716. DOI: 10.3390/ijms26125716
Paulsen IT et al. (1997). "The SMR family: a novel family of multidrug efflux proteins." FEMS Microbiol Lett 156:1-8.
Price MN et al. (2018). "Mutant phenotypes for thousands of bacterial genes of unknown function." Nature 557:503-509. PMID: 29769716
Reimer LC et al. (2022). "BacDive in 2022: the knowledge base for standardized bacterial and archaeal data." Nucleic Acids Res 50:D741-D746.
Schwan CL et al. (2023). "Genotypic and phenotypic characterization of antimicrobial resistance and metal tolerance in Salmonella and E. coli." J Food Prot 86:100113. PMID: 37290750
Stähler FN et al. (2006). "The CznABC metal efflux pump is required for cadmium, zinc, and nickel resistance, urease modulation, and gastric colonization." Infect Immun 74:3845-3852.
Van Assche A, Álvarez-Pérez S (2017). "Phylogenetic signal in carbon source assimilation profiles and tolerance to chemical agents." Appl Microbiol Biotechnol 101:3049-3061. DOI: 10.1007/s00253-016-7866-0

Suggested Experiments

Testing whether urease+ organisms are specifically nickel-tolerant (rather than generally metal-tolerant) requires per-metal fitness profiling of urease+ vs urease- bacteria from the same taxonomic class, ideally Gammaproteobacteria (where the pangenome-scale urease effect is near zero). RB-TnSeq with matched urease+/- strains under nickel vs other metals would directly test whether urease enables nickel-specific tolerance.

Data Collections

🧬

Pangenome Collection

kbase_ke_pangenome

KBase, DOE

📊

Fitness Browser

kescience_fitnessbrowser

Price Lab, LBNL

Review

Suggestions

Implement GCA accession matching as a sensitivity analysis: Even a brief check (e.g., "using GCA matching recovers X additional species; re-running the model comparison yields delta R^2 = Y, consistent with the species-name-only result") would substantially strengthen the robustness claim. The pitfalls doc already provides the prefix-handling recipe (GB_GCA_* / RS_GCF_*).
Add confidence intervals to Cohen's d values: Especially important for H2S (n_neg = 8), where the 95% CI on d = -0.87 likely spans from near zero to well beyond -1.5. This would visually reinforce the "unreliable" characterization in the forest plot.
Test H1b and H1d against per-metal scores as planned: Even a brief analysis using counter_ion_effects/data/corrected_metal_conservation.csv for copper/chromium-specific scores would either strengthen or definitively close these hypotheses.
Add a brief note in NB04 explaining why the analysis sample size (3,994) differs from NB02's "species with >= 5 features" count (3,437). One sentence referencing the oxygen dummy variable expansion would suffice.
Explain omission of cell_shape and isolation_source from the multivariate models. If they were dropped for methodological reasons (e.g., high cardinality, redundancy with taxonomy), state this explicitly. If it was an oversight, consider adding them.
Consider a Mantel test or PGLS as a complementary approach to the XGBoost model comparison. These are standard in phylogenetic comparative biology and would provide a more formal test of phylogenetic confounding than the delta R^2 approach alone.