Microbial Discovery Forge

Amr Environmental Resistome March 2026

Resistance mechanism composition is strongly environment-dependent across 14,723 species

Efflux pumps constitute 21% of AMR in human gut species but only 1% in aquatic (η²=0.127). Metal resistance constitutes 45% in soil/aquatic but 6% in human gut (η²=0.107). Clinical species have 68% accessory (acquired) AMR vs soil 43%. Within 823 multi-environment species, clinical strain fraction predicts total AMR (rho=0.465, p=2.2e-45). K. pneumoniae has 1,115 AMR clusters (only 7 core) across 13,637 genomes. All findings survive family-level phylogenetic control (20/141 families significant). This is the largest genomic AMR-environment analysis to date, extending Gibson et al. (2015, 6K genomes) by 50×.

Amr Cofitness Networks March 2026

AMR support networks enriched for flagellar motility and amino acid biosynthesis

Cofitness analysis of 801 AMR genes across 28 organisms reveals that AMR cofitness neighborhoods are enriched for flagellar motility (GO:0071973, 5 organisms FDR<0.05), flagellum assembly (GO:0044780, 5 orgs), histidine biosynthesis (GO:0000105, 3 orgs), and tryptophan biosynthesis (GO:0000162, 3 orgs). This enrichment is undetectable with old FB SEED annotations (0/280 significant) — only InterProScan GO (68% coverage) reveals it. Support networks are organism-specific (cross-mechanism Jaccard 0.375 >> within-mechanism 0.207, MWU p=4.3e-13). AMR genes are in larger-than-average ICA modules (median 46 vs 27, p=1.7e-8). The co-regulation with motility and biosynthesis suggests AMR costs reflect competition for the proton motive force.

Amr Fitness Cost March 2026

AMR genes impose a universal +0.086 fitness cost across 25 bacteria

Random-effects meta-analysis of 801 AMR genes across 25 organisms shows a pooled fitness shift of +0.086 [+0.074, +0.098] when AMR genes are knocked out — all 25/25 organisms positive. This cost is mechanism-independent (KW p=0.89), conservation-independent (core = accessory, p=0.33), and tier-independent (bakta_amr = keyword, p=0.26). The uniformity suggests compensatory evolution has equalized costs to an irreducible floor. However, mechanism strongly predicts conservation status (metal resistance 44% accessory vs efflux 13%, χ²=69.3, p=1.4e-13) — acquisition history, not cost, determines whether AMR genes are core or accessory. Efflux genes (broad-spectrum) show a stronger antibiotic-dependent fitness flip than enzymatic inactivation genes (narrow-spectrum): +0.094 vs −0.001, MWU p=0.007.

Amr Strain Variation March 2026

Acquired AMR genes track phylogeny more strongly than intrinsic

Mantel tests across 1,261 species show non-core (acquired) AMR genes have stronger phylogenetic signal (median r=0.222) than core (intrinsic) genes (median r=0.117), paired t-test p=7.0e-16. This contradicts the standard model that acquired resistance is phylogenetically random via HGT. Instead, once a lineage acquires resistance elements, they are stably maintained and vertically inherited, creating clonal AMR lineages. Note: core genes have near-zero Jaccard variance by definition (>=95% prevalence), which partly suppresses their distance-based signal.

Amr Strain Variation March 2026

Over half of within-species AMR genes are rare (<5% prevalence)

Across 1,305 species and 180,025 genomes, 51.3% of AMR gene-species occurrences are rare (<=5% prevalence), 41.3% variable (5-95%), and only 7.5% fixed (>=95%). Median pairwise Jaccard distance between strains = 0.435 — strains within the same species share less than 60% of their AMR repertoire. Atlas Core genes are 77% fixed within species; Singletons are 79% rare — validating the cross-species conservation classification at strain resolution.

Amr Strain Variation March 2026

Resistance islands are widespread — 54% of species, mean 6.2 genes

1,517 resistance islands detected across 705/1,305 species, with mean phi coefficient = 0.827 (very tight co-occurrence). 88% contain multiple resistance mechanisms (efflux + enzymatic inactivation most common). Maximum island = 43 genes. These multi-mechanism islands provide coordinated defense against multiple drug classes.

Amr Strain Variation March 2026

AMR variability weakly anti-correlates with pangenome openness

Spearman rho = -0.193, p = 2.2e-12. Species with more open pangenomes have slightly lower AMR variability — unexpected. Likely because open-pangenome species accumulate more rare/singleton AMR genes below the 5% threshold, deflating the variability index (which measures the 5-95% zone).

Truly Dark Genes March 2026

Only 16.3% of "dark matter" resists modern annotation

Of 39,532 Fitness Browser dark genes with pangenome links, bakta v1.12.0 reclassifies 33,105 (83.7%). Just 6,427 are "truly dark" — both FB and bakta agree: hypothetical protein. Truly dark genes are structurally distinct: shorter (121 vs 194 aa), less conserved (43% vs 73% core), fewer orthologs (29% vs 64%), higher GC deviation (d=0.247). These properties are consistent with recent HGT outpacing annotation databases.

Truly Dark Genes March 2026

Truly dark genes cluster in genomic "dark islands"

41% of neighbors of truly dark genes are also hypothetical, 12% are within 2 genes of mobile elements (transposases, integrases, phage proteins). This suggests dark genes concentrate in recently acquired genomic islands rather than being randomly distributed.

Truly Dark Genes March 2026

Stress enrichment hypothesis rejected — nutrient/community enrichment instead

Contrary to expectation, truly dark genes with strong fitness phenotypes are depleted in stress conditions (OR=0.53, p<0.001) and enriched in nutrient, mixed community, and iron conditions. This suggests novel metabolic or inter-species interaction functions rather than stress responses.

Truly Dark Genes March 2026

96% of truly dark genes have partial annotation clues

Only 246/6,427 (3.8%) truly dark genes have zero annotation clues. 84.7% have database cross-references, 43.5% have eggNOG hits (though 55% of COG assignments are "S"/unknown), 29.3% have cross-organism orthologs, 9.5% are in ICA fitness modules.

Bakta Reannotation March 2026

Bakta and eggNOG are complementary annotation sources

Comprehensive comparison of bakta v1.12.0 vs eggNOG-mapper across 132.5M gene clusters:
- EggNOG wins on enrichment-style annotations: COG (51% vs 8.2%), KEGG (38.5% vs 17.3%), Pfam (63% vs 7.7%)
- Bakta wins on GO terms (15% vs 7.4%), product descriptions (71.2% vs 70.4%), and UniRef50 links (79.2%, unique)
- Union raises coverage from ~70% to 77.3% for "any functional annotation"
- Bakta rescues 11.2M clusters (of 39.2M) that eggNOG misses entirely

Bakta Reannotation March 2026

Bakta's lower COG/KEGG/Pfam coverage is by design, not a bug

Confirmed by bakta developer Oliver Schwengers across GitHub issues #350, #385, #391, #393:
- PSC pre-computed annotations are sparse: only ~14% of PSC entries have COG, ~19% have KEGG. PSC DB is optimized for product descriptions, not functional categories.
- COG 2024 broke direct mapping: COG 2024 no longer provides representative sequences, forcing lossy indirect WP accession→IPS→PSC mapping.
- Pfam skipped by design: HMMER only runs on hypotheticals (~7.7%). The 92% with PSC matches never get Pfam searched.
- DIAMOND fast < HMM sensitivity: bakta's PSC uses DIAMOND fast mode; eggNOG uses HMM-based ortholog assignment.
- Developer planned to integrate eggNOG into PSC representatives (issue #325) — may have partially happened in DB v6.0 but limited impact.
- The Arche benchmark paper (Alonso-Reyes & Albarracin 2025, PMID 40811208) independently confirmed bakta's conservative annotation transfer vs. other tools.

Bakta Reannotation March 2026

UniRef50 bridge has limited value with current BERDL UniProt

Only 33.3% of bakta's 17.6M distinct UniRef50 IDs exist in kbase_uniprot.uniprot_identifier
The bridge reaches RefSeq (71%), OrthoDB (46%), STRING (40%), KEGG (36%) of matched IDs
Missing from bridge: GO, EC, InterPro, Pfam — importing full UniProt would unlock these
For now, eggNOG remains the better source for COG/KEGG/Pfam enrichment analysis

Bakta Reannotation March 2026

Spark broadcast joins fail on uniprot_identifier (2.5B rows)

Any join that touches kbase_uniprot.uniprot_identifier can exceed spark.driver.maxResultSize (1GB) via broadcast exchange. Fix: SET spark.sql.autoBroadcastJoinThreshold = -1 to force sort-merge joins. See data/bakta_reannotation/scripts/compare_bakta_eggnog.py for the pattern.

Ecotype Analysis January 2026

Environment vs Phylogeny: Phylogeny usually dominates

Analysis of 172 species with sufficient data showed:
- Median partial correlation for environment effect: 0.0025
- Median partial correlation for phylogeny effect: 0.0143
- Phylogeny dominates in 60.5% of species
- Environment dominates in 39.5% of species

No significant difference between "environmental" bacteria (where lat/lon is meaningful) and host-associated bacteria (p=0.66).

Ecotype Analysis January 2026

ANI extraction requires per-species iteration

Attempting to query ANI for all 13K+ genomes in a single IN clause fails/times out. Must iterate over species and query ANI for each species separately. See docs/performance.md for pattern.

Pangenome Openness January 2026

Pangenome table has pre-computed stats

The pangenome table already contains no_core, no_aux_genome, no_singleton_gene_clusters, no_gene_clusters - no need to compute from gene_cluster table.

Pangenome Openness January 2026

No correlation between openness and env/phylo effects

Tested whether open pangenomes (low core fraction) show different patterns. Results:
- rho=-0.05, p=0.54 for environment effect
- rho=0.03, p=0.73 for phylogeny effect
- No significant relationship found

Integrity Checks January 2026

12 orphan pangenomes are symbionts

The 12 pangenomes without matching species clades are mostly obligate symbionts:
- Portiera aleyrodidarum (whitefly symbiont)
- Profftella armatura (psyllid symbiont)
- Various uncultivated lineages (UBA, TMED, SCGC prefixes)

These are valid pangenomes but filtered from species metadata due to single-genome status.

Cog Analysis January 2026

Universal functional partitioning in bacterial pangenomes

Analysis of 32 species across 9 phyla (357,623 genes) reveals a remarkably consistent "two-speed genome":

Novel/singleton genes consistently enriched in:
- L (Mobile elements): +10.88% enrichment, 100% consistency across species - STRONGEST SIGNAL
- V (Defense mechanisms): +2.83% enrichment, 100% consistency
- S (Unknown function): +1.64% enrichment, 69% consistency

Core genes consistently enriched in:
- J (Translation): -4.65% enrichment, 97% consistency - STRONGEST DEPLETION
- F (Nucleotide metabolism): -2.09% enrichment, 100% consistency
- H (Coenzyme metabolism): -2.06% enrichment, 97% consistency
- E (Amino acid metabolism): -1.81% enrichment, 81% consistency
- C (Energy production): -1.75% enrichment, 88% consistency

Biological implications:
- Core genes = ancient, conserved "metabolic engine" (translation, energy, biosynthesis)
- Novel genes = recent acquisitions for ecological adaptation (mobile elements, defense, niche-specific)
- Horizontal gene transfer (HGT) is the primary innovation mechanism, not vertical inheritance
- The massive L enrichment (+10.88%) suggests most genomic novelty comes from mobile elements
- Patterns hold universally across bacterial phyla, suggesting deep evolutionary constraint

Hypothesis validation: All 8 predictions from initial N. gonorrhoeae analysis confirmed across 32 species.

This represents a fundamental organizing principle of bacterial pangenome structure.

Cog Analysis January 2026

Composite COG categories are biologically meaningful

Multi-function genes with composite COG assignments (e.g., "LV" = mobile+defense, "EGP" = amino acid+carb+inorganic ion) are not annotation artifacts:

LV (mobile+defense): +0.34% enrichment, 76% consistency
Suggests functional modules like "mobile defense islands"
Should not be filtered out as noise - they represent genuine multi-functional genes

Respiratory Chain Wiring February 2026

ADP1 uses qualitatively different respiratory configurations per carbon source

Each of ADP1's 8 carbon sources requires a distinct set of respiratory chain components. Quinate requires only Complex I (all other components dispensable). Acetate requires Complex I + cytochrome bo3 + ACIAD3522 + more (the most demanding). Glucose requires no specific component (full redundancy). This is not a quantitative gradient — it's qualitatively different wiring per substrate class.

Respiratory Chain Wiring February 2026

NADH flux rate, not total yield, determines Complex I essentiality

Quinate produces fewer NADH per carbon (0.57) than glucose (1.50), yet Complex I is more essential on quinate. The resolution: β-ketoadipate pathway products (succinyl-CoA + acetyl-CoA) enter the TCA cycle simultaneously, creating a concentrated NADH burst that exceeds NDH-2's reoxidation capacity. Glucose distributes NADH across Entner-Doudoroff + TCA steps where NDH-2 suffices.

Respiratory Chain Wiring February 2026

Respiratory chain wiring is metabolic, not transcriptional

Proteomics shows all three NADH dehydrogenases are expressed at similar levels under standard conditions: Complex I mean 27.6 (66th percentile), NDH-2 27.0 (59th), ACIAD3522 26.2 (48th) — all within 1.4 units of each other around the genome median (26.4). The cell doesn't switch respiratory configurations by regulating dehydrogenase expression. Instead, all three run simultaneously and the one that becomes limiting depends on the NADH flux rate from the carbon source. This is a passive, flux-based wiring system rather than an active regulatory switch.

Respiratory Chain Wiring February 2026

ADP1 has three NADH dehydrogenases with non-overlapping condition requirements

Complex I (13 subunits, proton-pumping): quinate-essential, glucose-dispensable. NDH-2 (1 subunit, non-pumping): no growth data but predicted backup for glucose. ACIAD3522 (NADH-FMN oxidoreductase): acetate-lethal (0.013), quinate-fine (1.39). Each handles a different metabolic regime — division of labor in electron transport.

Aromatic Catabolism Network February 2026

Aromatic catabolism requires 7× more support genes than pathway genes

The β-ketoadipate pathway in ADP1 has 8 core genes, but 51 genes total show quinate-specific growth defects — a 7:1 support-to-pathway ratio. The support genes organize into 3 biochemically rational subsystems: Complex I (21 genes, NADH reoxidation), iron acquisition (7 genes, Fe²⁺ for ring-cleavage dioxygenase), and PQQ biosynthesis (2 genes, cofactor for quinate dehydrogenase). This quantifies a general principle: the metabolic infrastructure required to run a pathway dwarfs the pathway itself.

Aromatic Catabolism Network February 2026

FBA models miss 59% of condition-specific essential genes

Of the 51 quinate-specific genes, 30 (59%) have no FBA reaction mappings — PQQ biosynthesis, iron acquisition, transcriptional regulators, and respiratory chain components are invisible to the metabolic model. For Complex I specifically, FBA predicts 1.76× higher flux on aromatics but 0% essentiality, because FBA's linear programming cannot represent the threshold effect of losing a single subunit from a multi-subunit complex.

Aromatic Catabolism Network February 2026

Complex I dependency is on NADH flux, not aromatic chemistry

Cross-species ortholog-transferred fitness data shows Complex I defects are largest on acetate (-1.55 vs background) and succinate (-1.39), not aromatics. ADP1's quinate-specificity likely reflects an alternative NADH dehydrogenase (NDH-2) that compensates on lower-NADH-flux substrates. The "aromatic support network" is really a "high-NADH-flux bottleneck network."

Aromatic Catabolism Network February 2026

Two DUF proteins are candidate Complex I accessory factors

ACIAD3137 (UPF0234/YitK) and ACIAD2176 (DUF2280) correlate at r > 0.98 with Complex I genes across 8 growth conditions. Both lack FBA reaction mappings and have no assigned metabolic function. Their near-perfect co-fitness with the nuo operon suggests physical or regulatory association with Complex I.

Respiratory Chain Wiring February 2026

ACIAD3522 is the most condition-specific gene in the ADP1 genome

ACIAD3522 (NADH-FMN oxidoreductase) has a growth ratio of 0.013 on acetate — essentially lethal — while showing no defect on quinate (1.39) or glucose (1.39). This is the largest condition-specific effect of any single gene in the 2,034-gene growth matrix. FBA predicts zero flux through ACIAD3522 on all conditions. Its biological role is unknown but the extreme acetate specificity suggests it catalyzes an NADH-linked reaction that is uniquely required for acetate catabolism.

Respiratory Chain Wiring February 2026

NDH-2 compensation hypothesis not supported cross-species

After correcting NDH-2 false positives (filtering text-matched candidates to organisms with ≤2 hits per genome), the cross-species compensation test reverses: organisms WITH validated NDH-2 show LARGER Complex I aromatic deficits (mean -0.297) than those without (-0.156, p=0.52). The initial analysis (10/14 organisms with NDH-2, apparent support for compensation) was driven by misannotated Complex I subunits leaking through the text filter in 5 organisms. This demonstrates the importance of validating text-based gene identification — a finding that looked supportive was actually an annotation artifact.

Adp1 Deletion Phenotypes February 2026

625 condition-specific genes map precisely to expected metabolic pathways

31% of dispensable genes have condition-specificity scores ≥ 1.0. Top condition-specific genes for each carbon source are the enzymes biochemically predicted to be required: urease subunits for urea, protocatechuate 3,4-dioxygenase for quinate, Entner-Doudoroff enzymes for glucose, glyoxylate shunt for acetate, lactate-responsive regulator for lactate. This validates the growth matrix as a high-quality functional genomics dataset.

Adp1 Deletion Phenotypes February 2026

ADP1 phenotype landscape is a continuum, not discrete modules

Hierarchical clustering of 2,034 genes by their 8-condition growth profiles produces an optimal K=3 with silhouette=0.24 — no discrete functional modules. The phenotype landscape is a gradient, with one exception: 24 genes form a tight quinate-specific module (the aromatic degradation pathway). Gene essentiality varies continuously across conditions, supporting the Guzman et al. (2018) "adaptive flexibility" framework.

Adp1 Deletion Phenotypes February 2026

Missing dispensable genes are shorter, less annotated, and less conserved

Of 2,593 TnSeq-dispensable genes, 272 (10.5%) lack growth data from the deletion collection. These are systematically different: shorter (813 vs 981 bp), less annotated (91% vs 100% RAST), and less conserved in the pangenome (76.5% core vs 93.3%, p=1.4e-20). Hypothetical proteins are massively enriched (q=2.4e-25). The deletion collection has a bias toward well-characterized, conserved genes.

Fitness Modules February 2026

ICA reliably decomposes fitness data into biologically coherent modules

Robust ICA (30-50 FastICA runs + DBSCAN clustering) consistently finds 17-52 stable modules per organism across 32 bacteria. 94.2% of modules show significantly elevated within-module cofitness (Mann-Whitney U, p < 0.05; mean |r| = 0.34 vs background 0.12). Genomic adjacency enrichment averages 22.7× across organisms, confirming modules capture operon-like co-regulated gene groups.

Fitness Modules February 2026

Membership thresholding is the critical parameter, not ICA itself

The initial D'Agostino K² normality-based thresholding gave 100-280 genes per module — biologically meaningless. The ICA components themselves were fine; the problem was deciding which genes "belong" to each module. Switching to an absolute weight threshold (|Pearson r| ≥ 0.3 with module profile, max 50 genes) reduced modules to 5-50 genes and dramatically improved all validation metrics:
- Cofitness enrichment: 59% → 94.2%
- Within-module |r|: 0.047 → 0.34 (mean across 32 organisms)
- Modules became biologically coherent (22.7× genomic adjacency enrichment)

Fitness Modules February 2026

Organisms with <200 experiments still produce valid modules

Caulo (198 experiments) showed the weakest signal in the pilot (2.9x correlation enrichment vs 31x for DvH), but organisms down to 104 experiments (Ponti) still produced stable modules. The key is capping components at 40% of experiments — higher ratios cause FastICA convergence failures and extreme slowness.

Fitness Modules February 2026

Fitness Browser schema documentation has inaccuracies

Several table schemas differ from the documented schema:
- keggmember uses keggOrg/keggId (not orgId/locusId) — must join through besthitkegg
- kgroupec uses ecnum (not ec)
- seedclass has orgId, locusId, type, num (not subsystem/category hierarchy)
- fitbyexp_* tables are long format (not pre-pivoted as documented)

Fitness Modules February 2026

Spark is accessible from CLI via berdl_notebook_utils

from berdl_notebook_utils.setup_spark_session import get_spark_session works from regular Python scripts on JupyterHub — not just notebook kernels. This enables running full analysis pipelines from the command line without jupyter nbconvert. The auto-import in /configs/ipython_startup/00-notebookutils.py only affects notebook kernels.

Fitness Modules February 2026

Pan-bacterial module families exist across diverse phyla

Cross-organism module alignment using BBH ortholog fingerprints revealed 156 module families spanning 2+ organisms. The largest family spans 21 of 32 organisms across Proteobacteria, Bacteroidetes, Firmicutes, and Archaea — evidence of deeply conserved fitness regulons. 28 families span 5+ organisms.

This required using orthologs from ALL organisms in the analysis — an initial run using only 5-organism orthologs found just 27 families with no family spanning more than 4 organisms. The ortholog graph density is critical.

Fitness Modules February 2026

Ortholog scope dramatically affects cross-organism analysis

Using BBH pairs from 5 organisms (10K pairs, 1,861 OGs) vs all 32 organisms (1.15M pairs, 13,402 OGs) produced radically different results:
- Module families: 27 → 156 (6x)
- Families spanning 5+ orgs: 0 → 28
- Family-backed predictions: 31 (4%) → 493 (56%)

Lesson: always extract orthologs for ALL organisms in the analysis, not just a pilot subset. The ortholog graph is not additive — adding organisms creates new transitive connections.

Fitness Modules February 2026

PFam domains are essential for module enrichment — KEGG KOs are too fine-grained

The initial enrichment pipeline (KEGG + SEED + TIGRFam, min_annotated=3) annotated only 8.2% of modules (92/1,116). Root cause: KEGG KO groups average ~1.2 genes per term, so modules with 5-50 members almost never have 3+ genes sharing the same KO. Adding PFam domains (which have 814 terms with 2+ genes vs TIGRFam's 88) and lowering the overlap threshold to 2 increased annotation to 79.7% (890/1,116). PFam is the dominant annotation source — it provides domain-level functional labels that match module granularity.

Impact on downstream analysis:
- Predictions: 878 → 6,691 (7.6×), now covering all 32 organisms
- Annotated families: 32 → 145 out of 156 (93%)
- Three organisms (Caulo, Methanococcus_S2, Korea) went from 0 enrichments to 24-29 enriched modules

Fitness Modules February 2026

Module-ICA is complementary to sequence-based methods, not competitive

Held-out KO prediction benchmark across 32 organisms showed ortholog transfer dominates at gene-level function prediction (95.8% precision, 91.2% coverage), while Module-ICA has <1% precision at the KO level. This is expected and informative, not a failure: modules capture process-level co-regulation (validated by 94.2% cofitness enrichment and 22.7× adjacency enrichment), not specific molecular function. An ABC transporter module correctly groups binding, permease, and ATPase subunits — but each has a different KO. The right framing for module-based predictions is "involved in [biological process]" not "has function [specific KO]."

Fitness Modules February 2026

Enrichment min_annotated threshold must match annotation granularity

The min_annotated parameter (minimum overlapping genes to test for enrichment) must be calibrated to the annotation database's granularity. For gene-specific annotations (KEGG KOs: ~1 gene/term), min_annotated=3 eliminates nearly all tests. For domain-level annotations (PFam: many genes share domains), min_annotated=2 works well with FDR correction. Fisher's exact test handles small counts correctly — the statistical validity comes from FDR correction across all tests, not from requiring large overlaps per test.

Conservation Vs Fitness February 2026

Essential genes are modestly enriched in core pangenome clusters

Across 33 diverse bacteria, putative essential genes (no transposon insertions in RB-TnSeq) are 86.1% core vs 81.2% for non-essential genes (median OR=1.56, 18/33 significant after BH-FDR). The enrichment is real but modest — most genes in well-characterized bacteria are core regardless of essentiality. The signal is strongest in organisms with larger clades (more genomes = more reliable core classification).

Conservation Vs Fitness February 2026

Essential-core genes are functionally distinct from essential-auxiliary

Essential genes that map to core clusters are 41.9% enzymes and only 13.0% hypothetical — they are the well-characterized metabolic backbone (ribosomes, DNA replication, cell wall, cofactor biosynthesis). Essential-auxiliary genes (essential but not in all strains) are only 13.4% enzymes and 38.2% hypothetical. Essential-unmapped genes (strain-specific, no pangenome match) are 44.7% hypothetical — prime targets for functional discovery.

Conservation Vs Fitness February 2026

SEED functional categories distinguish essential from non-essential

Essential-core genes are enriched in Protein Metabolism (+13.7 pp vs non-essential), Cofactors/Vitamins (+6.2 pp), Cell Wall (+3.9 pp). They are depleted in Carbohydrates (-7.9 pp), Amino Acids (-5.6 pp), Membrane Transport (-4.0 pp) — functions that are conditionally important rather than universally essential. This aligns with Rosconi et al. (2022) who found pan-genome composition influences essentiality in S. pneumoniae.

Conservation Vs Fitness February 2026

FB aaseqs download uses different locus tags than the gene table for some organisms

The Fitness Browser aaseqs file (fit.genomics.lbl.gov/cgi_data/aaseqs) uses RefSeq-style locus tags (e.g., ABZR86_RS) for some organisms, while the FB gene table uses the original annotation locus tags (e.g., N515DRAFT_). This caused a complete join failure for Dyella79 (0% merge rate). Only 1 of 34 organisms was affected, but any pipeline joining aaseqs-derived data with gene table data should verify locus tag consistency.

Fitness Effects Conservation February 2026

Fitness importance and pangenome conservation form a continuous gradient

Across 194,216 protein-coding genes in 43 bacteria, there is a clear 16-percentage-point gradient from essential genes (82% core) to always-neutral genes (66% core). The same pattern holds when binning by strongest negative fitness effect (min_fit < -3 → 78% core vs min_fit near 0 → 66%) and by fitness breadth (important in 20+ conditions → 79% core vs 0 conditions → 66%). This establishes that the essentiality-conservation link from conservation_vs_fitness is not binary but quantitative.

Fitness Effects Conservation February 2026

Core genes are MORE likely to be burdens, not less

Counter to the expectation that accessory genes impose a carrying cost, core genes are more likely to show positive fitness effects when deleted (24.4% ever beneficial vs 19.9% for auxiliary; OR=0.77 for auxiliary vs core, p=5.5e-48). Core genes participate in more pathways and trade-off situations — they help in some conditions but cost in others. This challenges the "streamlining" model where accessory genes are metabolic burdens.

Fitness Effects Conservation February 2026

Condition-specific fitness genes are more core, not more accessory

Genes with strong condition-specific phenotypes (from the FB specificphenotype table) are 77.3% core vs 70.3% for genes without specific phenotypes (OR=1.78, p=1.8e-97). This contradicts the intuition that condition-specific fitness = niche-specific genes = accessory genome. Instead, core genes are more likely to have detectable condition-specific effects because they are embedded in well-characterized, essential pathways.

Module Conservation February 2026

Fitness modules are enriched in core genome genes

ICA fitness modules (co-regulated gene groups) are 86.0% core vs 81.5% baseline across 29 organisms (Fisher OR=1.46, p=1.6e-87; per-organism paired Wilcoxon p=1.0e-03, 22/29 organisms show enrichment). 59% of modules are >90% core genes. Co-regulated fitness response units are preferentially embedded in the conserved genome — the core genome is not just structurally conserved but functionally coherent at the module level.

Module Conservation February 2026

Module family breadth does NOT predict conservation

Surprisingly, module families spanning more organisms do not have higher core fractions (Spearman rho=-0.01, p=0.914). The baseline core rate (~82%) is so high that there is no room for a gradient — families are nearly all core regardless of breadth. This is a ceiling effect, not evidence against the conservation-function relationship.

Module Conservation February 2026

Essential genes are absent from ICA modules

0 essential genes appear in any of the 1,116 fitness modules across 32 organisms. ICA decomposes fitness variation, so genes with no fitness data (essential = no transposon insertions) are invisible to it. This means fitness modules capture only the non-essential portion of the genome's functional architecture.

Core Gene Tradeoffs February 2026

Trade-off genes are enriched in the core genome

25,271 genes (17.8%) are true trade-offs — important (fit < -1) in some conditions, burdensome (fit > 1) in others. These are 1.29x more likely to be core (OR=1.29, p=1.2e-44). Core genes have more trade-offs because they participate in more pathways with condition-dependent costs and benefits. This explains why core genes are simultaneously more burdensome AND more essential than accessory genes.

Core Gene Tradeoffs February 2026

The burden paradox is function-specific, not universal

The core-burden paradox is driven by specific functional categories: RNA Metabolism (+12.9pp), Motility/Chemotaxis (+7.8pp), Protein Metabolism (+6.2pp) all show core genes as more burdensome. But Cell Wall reverses: non-core cell wall genes are MORE burdensome (-14.1pp). The paradox is not a uniform property of the core genome but reflects the trade-off architecture of specific functional systems.

Core Gene Tradeoffs February 2026

28,017 "costly + conserved" genes = natural selection signature

Genes that are both burdensome in the lab AND core in the pangenome represent the strongest evidence for purifying selection in natural environments. They're costly to maintain, yet every strain keeps them — nature requires them in conditions not captured by the lab. By contrast, 5,526 genes are costly + dispensable (candidates for ongoing gene loss), and 21,886 are neutral + dispensable (niche-specific).

Essential Genome February 2026

15 gene families are essential in ALL 48 bacteria

Cross-organism essential gene analysis across 48 diverse bacteria (221,005 genes, 17,222 ortholog families) reveals 15 families that are essential in every organism: ribosomal proteins (rpsC, rplW, rplK, rplB, rplA, rplF, rps11, rpsJ, rpsI, rpsM), chaperonin (groEL), CTP synthase (pyrG), translation elongation factor G (fusA), valyl-tRNA synthetase (valS), and geranyltranstransferase (SelGGPS). These represent the irreducible essential core of bacterial life.

Essential Genome February 2026

Only 5% of ortholog families are universally essential

Of 17,222 ortholog families across 48 bacteria, only 859 (5.0%) are universally essential — essential in every organism where the family has members. 4,799 families (27.9%) are variably essential (essential in some organisms, non-essential in others), and 11,564 (67.1%) are never essential. Variable essentiality (the majority of essential families) demonstrates that genomic context profoundly shapes which genes an organism requires for viability.

Essential Genome February 2026

Orphan essential genes are 58.7% hypothetical — top functional discovery targets

7,084 essential genes have no orthologs in any other FB organism. Of these, 58.7% are annotated as hypothetical proteins — the least characterized yet most functionally important genes in each organism. By contrast, universally essential genes are only 8.2% hypothetical. The orphan essentials are prime candidates for novel biology.

Essential Genome February 2026

Universally essential families are overwhelmingly core in the pangenome

Universally essential genes are 91.7% core vs 80.7% for non-essential genes. 71% of universally essential families are 100% core across all genomes in their species. Orphan essentials (no orthologs) are only 49.5% core — they are strain-specific essential functions that haven't been conserved across species.

Essential Genome February 2026

Module transfer predicts function for 1,382 hypothetical essential genes

By finding non-essential orthologs that participate in ICA fitness modules, we generated 1,382 function predictions for hypothetical essential genes across 48 organisms. All predictions are backed by cross-organism module family conservation. This demonstrates that module context from non-essential orthologs can illuminate the function of essential genes that are invisible to fitness-based methods.

Ecotype Env Reanalysis February 2026

Clinical sampling bias does NOT explain weak environment-gene content signal

The env_embedding_explorer project showed that human-associated samples dampen AlphaEarth geographic signal (2.0x vs 3.4x for environmental). We hypothesized this bias explained the weak environment effect in the ecotype analysis (median partial correlation 0.003). However, stratifying 213 species by dominant environment type (47% human-associated, 21% environmental) shows the opposite: human-associated species have higher partial correlations (median 0.084) than environmental species (0.051). Mann-Whitney p=0.83. The original conclusion — phylogeny dominates — holds regardless of sample environment. The geographic signal in embeddings and the environment-gene content relationship are distinct phenomena.

Ecotype Env Reanalysis February 2026

47% of ecotype species are human-associated by genome-level classification

Using isolation_source harmonization (12 categories from 5,774 values), 106/224 species in the ecotype analysis are majority human-associated (gut + clinical). Only 47/224 (21%) are majority environmental (Soil, Marine, Freshwater, Extreme, Plant). This is a more systematic classification than the original manual species-level categorization, and confirms the clinical bias but shows it doesn't confound the ecotype results.

Env Embedding Explorer February 2026

AlphaEarth embeddings encode geographic signal, but strength depends on sample type

Environmental samples (Soil, Marine, Freshwater, Extreme, Plant) show a 3.4x ratio in embedding cosine distance between nearby (<100 km, mean=0.27) and far (>10K km, mean=0.90) genome pairs. Human-associated samples (gut, clinical) show only a 2.0x ratio (0.37 vs 0.75). Hospitals worldwide have similar satellite imagery (urban built environment), so human-associated genomes have more homogeneous embeddings regardless of geography. The pooled "All samples" curve (2.0x ratio) is dominated by the 38% human-associated fraction and substantially underestimates the true geographic signal in the embeddings.

Implication: Any analysis using AlphaEarth embeddings as environment proxies should stratify by sample type, or at minimum exclude human-associated samples. The weak environment signal found in ecotype_analysis (median partial correlation 0.0025) may be partially explained by this clinical bias.

Env Embedding Explorer February 2026

38% of AlphaEarth genomes are human-associated — strong clinical sampling bias

Of 83,287 genomes with AlphaEarth embeddings, 38% are human-associated (clinical 20%, gut 16%, other 2%). Environmental categories are much smaller (soil 7%, marine 7%, freshwater 7%). This reflects NCBI's bias toward pathogen sequencing — clinical isolates have good geographic metadata from epidemiological tracking, which is why they disproportionately have embeddings.

Env Embedding Explorer February 2026

36% of AlphaEarth coordinates may be institutional addresses, but some are legitimate field sites

30,469 genomes (36.6%) cluster at coordinates with >50 genomes of >10 species. However, several flagged locations are legitimate DOE/research field sites: Rifle, CO (1,883 genomes, DOE IFRC), Saanich Inlet (1,529, oceanographic time series), Siberian soda lakes (812, extremophile campaigns). The heuristic needs refinement — checking isolation_source homogeneity at each location would better distinguish field sites from institutional addresses.

Env Embedding Explorer February 2026

4.6% of AlphaEarth genomes have NaN embedding dimensions

3,838 of 83,287 genomes have NaN in at least one of the 64 embedding dimensions. The cause is unclear — possibly missing satellite imagery at those coordinates, or coordinates that fall outside satellite coverage (polar regions, ocean). These must be filtered before any embedding-based analysis.

Enigma Contamination Functional Potential February 2026

Primary contamination-functional tests are null, but defense signal appears in coverage-aware sensitivity models

In the ENIGMA contamination-functional project (108 overlap samples), primary univariate associations were non-significant across defense, stress, metabolism, and mobilome outcomes in both mapping modes. However, after adding mapped-coverage-aware sensitivity models, contamination-defense association became significant (coverage-adjusted OLS contamination p = 3.98e-4 relaxed, 3.54e-3 strict; high-coverage subset Spearman p = 0.0207 relaxed, 0.0098 strict). This indicates the strongest signal is conditional on mapping coverage/model specification rather than broad across all outcomes.

Enigma Contamination Functional Potential February 2026

Independent strict vs relaxed feature construction materially changes mode-specific outputs

When strict and relaxed mapping modes were computed independently in NB02 (instead of reusing strict outputs), 1,140 of 1,590 genus-feature pairs differed between modes (~71.7%), and site stress scores differed for all 108 samples. Strict-vs-relaxed sensitivity is therefore analytically meaningful and should not be approximated by copied feature tables.

Enigma Contamination Functional Potential February 2026

Species-proxy bridge is currently coverage-limited under genus-only ENIGMA taxonomy

enigma_coral.ddt_brick0000454 currently provides taxonomy through Genus (no species/strain rows). A species-proxy mode (species_proxy_unique_genus) that keeps only unique genus->single GTDB clade mappings retained 150 genera, but mapped abundance collapsed to mean 0.031 (vs 0.343 in strict/relaxed genus modes). In this low-coverage regime, defense trend was positive (rho=0.169) but non-significant (p=0.081), indicating that true species-level gains likely require higher-resolution ENIGMA taxonomy or metagenomic mapping.

Enigma Contamination Functional Potential February 2026

Confirmatory defense tests remain null after global FDR; strongest signal remains exploratory and coverage-dependent

After adding explicit confirmatory vs exploratory labels in NB03 and applying global BH-FDR across all reported p-values, confirmatory defense Spearman tests in genus-level modes remained null (p=0.546/0.483; q=0.862/0.849). The strongest surviving signal was exploratory: relaxed coverage-adjusted defense model retained q=0.046, while strict and other sensitivity signals attenuated above q<0.1. Interpretation should therefore prioritize null confirmatory evidence and frame positive results as exploratory.

Enigma Contamination Functional Potential February 2026

Confirmatory null is robust to contamination-index choice

Re-testing confirmatory defense endpoint under four index constructions (all-metals composite, uranium-only, top-3 variance metals, PCA-PC1) did not produce significant confirmatory associations after FDR correction (all q=0.546 across 8 confirmatory-variant tests). This indicates the null confirmatory result is not an artifact of one specific contamination-index formulation.

Nmdc Community Metabolic Ecology February 2026

Black Queen Hypothesis is detectable at community scale in environmental metabolomics

Spearman correlation of community-weighted GapMind amino acid biosynthesis completeness against
ambient amino acid metabolomics across 131 NMDC soil samples:

11/13 tested aa pathways trend in BQH-predicted direction (negative r = lower community
completeness co-occurs with higher ambient metabolite concentration); binomial sign test
p = 0.011
Two pathways reach FDR significance (BH-corrected q < 0.05): leucine (r = −0.390, q = 0.022,
n = 62) and arginine (r = −0.297, q = 0.049, n = 80)
Both FDR-significant pathways are among the most energetically expensive to synthesize (leucine:
37 ATP equivalents; arginine: 26 ATP equivalents) — consistent with BQH predicting loss is
most favored when the benefit of gene retention is lowest
Methionine shows the largest effect (r = −0.496) but is underpowered (n = 18, q = 0.117)
leu signal is robust to soil-only stratification (q = 0.022 unchanged) and to study-level
blocking (dataset is 95% single-study; see below)
Tyrosine is an anti-BQH outlier (r = +0.42, ns): likely explained by phenylalanine hydroxylation
providing an alternative tyrosine source that decouples biosynthesis completeness from pool size

First demonstration that community-weighted pangenome completeness scores correlate with measured
metabolite pools at landscape scale using public multi-omics data.

Nmdc Community Metabolic Ecology February 2026

Carbon utilization pathways, not amino acid pathways, are the primary axis of ecosystem metabolic differentiation

PCA of 220-sample × 80-pathway GapMind completeness matrix:

PC1 = 49.4% of variance captures near-complete Soil vs Freshwater separation (Mann-Whitney
p < 0.0001; median PC1 Soil = +3.86, Freshwater = −6.28)
PC1 loadings are dominated by carbon utilization pathways (glucuronate, fumarate, succinate,
cellobiose, galactose) — not amino acid pathways
Amino acid pathways load more strongly on PC2 (16.6%), separating sub-clusters within Soil
17/18 aa pathways are significantly differentiated across ecosystem types (q < 0.05 KW),
but the primary driver is carbon substrate availability, not biosynthesis capacity

Implication: when community metabolic potential is used as a feature space, carbon substrate range
is the dominant ecological axis and will confound aa pathway analyses unless controlled for.
Use PC1-residualized or ecosystem-stratified analyses for BQH-type tests.

Nmdc Community Metabolic Ecology February 2026

NMDC metabolomics overlap is dominated by a single study

Study-level analysis of the 131-sample H1 dataset revealed that 95% of samples (125/131) come
from one NMDC study (nmdc:sty-11-r2h77870). The remaining 6 samples are from a second study.
This means:
- Cross-study LC-MS protocol heterogeneity is effectively absent as a confounder (contrary to
the naive expectation for NMDC metabolomics)
- The leucine BQH signal is internally consistent within one analytical protocol
- Future multi-study NMDC metabolomics analyses should count study-level sample representation
before treating the aggregate as cross-study data

Nmdc Community Metabolic Ecology February 2026

NMDC metabolomics KEGG annotation rate is ~2%; string matching required

Only ~2% of compounds in nmdc_arkin.metabolomics_gold have KEGG compound IDs populated.
Pathway matching must rely on compound name substring patterns rather than KEGG-based ontologies.
Key caveats:
- Substring matching introduces collision risk (e.g., 'leucine' matches inside 'isoleucine') —
use first-match-wins ordering with longer/more specific patterns before shorter ones
- Chorismate itself is rarely detected; shikimic acid and 3-dehydroshikimic acid serve as upstream
proxies but reflect precursor availability, not chorismate pool size
- Three aa pathways remain untestable (cys, his, lys) due to absent compound detections in the
175-sample overlap cohort

See also docs/pitfalls.md for the isoleucine/leucine substring collision fix.

Pangenome Pathway Geography February 2026

Ecological niche breadth strongly predicts metabolic pathway diversity

Analysis of 1,872 bacterial species with complete pangenome, GapMind pathway, and AlphaEarth embedding data reveals:

H1: Pangenome openness → pathway completeness (r=0.107, p=3.6e-6)
- Weak but significant positive correlation
- Open pangenomes (high accessory/core ratio) have slightly more complete pathways
- Stronger relationship with pathway VARIABILITY (std dev): r=0.066, p=0.004
- Suggests accessory genes enable metabolic heterogeneity within species

H2: Niche breadth → pathway completeness (r=0.392, p=7.1e-70) — STRONGEST SIGNAL
- Moderate positive correlation using AlphaEarth embedding diversity
- Embedding variance (ecological diversity) has even stronger effect: r=0.412, p=1.8e-77
- Geographic distance alone: r=0.360, p=1.8e-58 (weaker than embedding-based metrics)
- Species with broader ecological niches require more complete metabolic toolkits

H3: Pangenome openness → niche breadth (r=0.324, p=5.6e-47)
- Moderate positive correlation
- Open pangenomes correlate with broader ecological niches
- Core fraction shows strong NEGATIVE correlation with niche breadth: r=-0.445, p=1.4e-91
- Pangenome flexibility enables adaptation to diverse environments

Key insight: AlphaEarth embedding diversity is a better predictor of metabolic completeness than geography alone. The embeddings capture ecological context (soil vs marine vs clinical), not just geographic distance. This explains why embedding variance (r=0.412) outperforms geographic range (r=0.360) in predicting pathway completeness.

Data coverage limitation: Only 6.8% of species (1,872/27,690) have sufficient AlphaEarth coverage (≥5 genomes with embeddings). This is due to missing lat/lon metadata for most NCBI genomes, especially clinical isolates.