Research Question
How do pangenome characteristics (open vs. closed) correlate with metabolic pathway completeness, phylogenetic distances, and species ecology?
Browse all analysis projects exploring BERDL data collections. Each project addresses a specific research question using the KBase Data Lakehouse.
How do pangenome characteristics (open vs. closed) correlate with metabolic pathway completeness, phylogenetic distances, and species ecology?
Do pangenome characteristics (open vs. closed) correlate with metabolic pathway diversity and biogeographic distribution patterns?
Which microbial species and ecological environments show the highest concentration of antibiotic resistance genes, and can we predict resistance accumulation from phylogenetic and ecological features?
Do species with open pangenomes show different COG functional enrichment patterns than species with closed pangenomes?
How are polyhydroxybutyrate (PHB) granule-forming pathways distributed across bacterial clades and environments, and does this distribution support the hypothesis that carbon storage granules are most beneficial in temporally variable feast/famine environments?
Across 27,690 GTDB species, 21.9% carry phaC (PHA synthase, the committed step for PHB biosynthesis) and 21.7% have a complete PHB pathway (phaC + phaA/phaB). The near-identical prevalence of phaC-only a
How are prophage gene modules and terminase-defined prophage lineages distributed across bacterial phylogeny and environmental gradients, and which modules/lineages show environmental enrichment exceeding phylogenetic expectation?
All 27,702 species in the BERDL pangenome carry prophage-associated gene clusters, with 4,005,537 total prophage gene clusters identified via eggNOG annotations. Three modules are...
Do the GapMind-predicted pathway completeness profiles of community resident taxa predict or
correlate with observed metabolomics profiles in NMDC environmental samples across diverse
habitat types?
Across 13 testable amino acid biosynthesis pathways, 11 of 13 (85%) showed negative
Spearman correlations between community pathway completeness and ambient amino acid
metabolite intensity — the direction predicted by...
How prevalent are SNIPE (Surface-associated Nuclease Inhibiting Phage Entry) homologues across the 293K-genome BERDL pangenome, and does their taxonomic distribution, environmental context, or pangenome status (core vs. accessory) reveal ecological patterns of phage defense?
Saxton et al. (2026) showed that SNIPE constitutively localizes to the inner membrane and cleaves phage DNA as it passes through the ManYZ mannose transporter pore. Published knockout and coevolution studies (not from...
Which genes of unknown function across 48 bacteria have strong fitness phenotypes, and can biogeographic patterns, pathway gap analysis, and cross-organism fitness concordance — combined with existing function predictions and conservation data — prioritize them for experimental follow-up?
Across 48 Fitness Browser organisms (228,709 genes), 57,011 (24.9%) lack functional annotation ("hypothetical protein," DUF, or "uncharacterized"). Of these, 7,787 show strong...
Among the ~6,400 Fitness Browser genes that remain functionally unannotated even after bakta v1.12.0 reannotation, what distinguishes them from "annotation-lag" dark matter, and can their fitness phenotypes, genomic context, and sparse annotations prioritize them for experimental characterization?
Of 39,532 Fitness Browser dark genes with pangenome links, bakta v1.12.0 reannotation reclassifies 33,105 (83.7%) — leaving just 6,427 "truly dark" genes where both the original pipeline and bakta agree: these are hypothetical...
Do antimicrobial resistance (AMR) genes impose a fitness cost in the absence of antibiotic selection pressure? Using genome-wide RB-TnSeq fitness data from 28 bacteria, we test whether transposon knockouts of AMR genes show systematically positive fitness (mutant grows better than wildtype) under standard growth conditions, indicating the intact AMR gene is a metabolic burden.
AMR gene knockouts show systematically higher fitness than non-AMR gene knockouts under non-antibiotic conditions, confirming that resistance genes impose a metabolic burden. A DerSimonian-Laird random-effects...
Within a species, how does the AMR repertoire vary between strains, and what drives that variation?
Across 1,305 species and 180,025 genomes, 51.3% of AMR gene-species occurrences are rare (present in <=5% of strains), 41.3% are variable (5-95%), and only 7.5% are fixed (>=95%). The median variabilit
What does the kescience_paperblast collection contain, how current is it, and what are its coverage patterns across organisms, domains of life, and functional databases?
Homo sapiens alone accounts for 46.7% of all gene-paper records in PaperBLAST. The top 5 organisms (H. sapiens, M. musculus, R. norvegicus, A. thaliana, D. melanogaster) capture 72.8%. Of 20,723 organisms...
For Pseudomonas fluorescens FW300-N2E3 (ENIGMA groundwater isolate), how consistent are exometabolomic outputs (Web of Microbes), genome-wide gene fitness (Fitness Browser), species-level utilization phenotypes (BacDive), and computational pathway predictions (GapMind)?
Of the 58 metabolites produced or increased by FW300-N2E3 (Web of Microbes), 21 could be cross-referenced against at least one other database. Among these testable metabolites, 17/21 (81%) were fully concordant across all...
Are ICA fitness modules enriched in core pangenome genes, and do cross-organism module families map to the core genome?
(Notebook: 01_module_conservation.ipynb)
What characterizes genes that are simultaneously burdensome (fitness improves when deleted) and not conserved in the pangenome? Are they mobile elements, recent acquisitions, degraded pathways, or something else?
The 5,526 costly+dispensable genes are overwhelmingly associated with mobile genetic elements. They are 7.45x more likely to contain mobile element keywords in their descriptions (transposase, integrase, phage, IS element,...
Among genes that TnSeq says are dispensable in Acinetobacter baylyi ADP1, does FBA correctly predict which ones have growth defects? Can direct mutant growth rate measurements serve as an independent axis to evaluate where computational (FBA) and genetic (TnSeq) methods agree or disagree?
Does the environment effect on gene content become stronger when analysis is restricted to genuinely environmental samples, excluding human-associated genomes whose AlphaEarth embeddings reflect hospital satellite imagery rather than ecological habitat?
Environmental species (n=37, median partial correlation 0.051) do NOT show stronger environment–gene content correlations than human-associated species (n=93, median 0.084). The Mann-Whitney U test is far from...
Which essential genes are conserved across bacteria, which are context-dependent, and can we predict function for uncharacterized essential genes using module context from non-essential orthologs?
The absolute core of bacterial life: ribosomal proteins (rpsC, rplW, rplK, rplB, rplA, rplF, rps11, rpsJ, rpsI, rpsM), chaperonin (groEL), CTP synthase (pyrG), translation elongation factor G (fusA), valyl-tRNA synthetase (valS), and...
Which genes matter for survival under environmentally-realistic conditions but appear dispensable in the lab, and vice versa? Do field-relevant fitness effects predict pangenome conservation better than lab-only effects?
The ENIGMA CORAL database (47 tables, enigma_coral on BERDL) was surveyed for complementary data. Key finding: DvH is completely absent from the database. The single TnSeq library is for FW300-N2E2 (Pseudomonas), DubSeq libraries...
When bacteria are exposed to metal salts (CoCl₂, NiCl₂, CuCl₂), how much of the observed fitness effect is caused by the metal cation versus the counter anion (chloride)? Does correcting for chloride confounding change the conclusions of the Pan-Bacterial Metal Fitness Atlas?
Across 19 organisms and 14 metals (86 organism × metal pairs), 4,304 of 10,821 metal-important gene records (39.8%) are also important under NaCl stress. This substantial overlap exists for every metal tested — from 9.2% for...
Among the 12,838 metal-important genes identified by the Metal Fitness Atlas, which are specifically required for metal tolerance vs general stress survival — and do the metal-specific genes show the expected accessory-genome enrichment?
Of the 7,609 metal-important gene records with fitness matrix data across 24 organisms, 4,177 (54.9%) are metal-specific — they show significant fitness defects under metal stress but a <5% sick rate across 5,945 non-metal experiments....
Which biochemical reactions are universally essential across bacteria, and what does the essential metabolome reveal about the minimal core metabolism required for microbial life?
17 of 18 amino acid biosynthesis pathways are present in all 7 organisms analyzed (100% within this sample):
- Complete pathways: arg, asn, chorismate, cys, gln, gly, his, ile, leu, lys, met, phe, pro, thr, trp, tyr, val
**One...
Do genes with correlated fitness profiles (co-fit) tend to co-occur in the same genomes across a species' pangenome? Does functional coupling constrain which genes are gained and lost together?
Across 9 organisms with co-fitness data (2.25M cofit pairs vs 22.5M prevalence-matched random pairs), co-fit gene pairs show a weak but consistent positive co-occurrence signal. The mean delta phi (cofit - random) is +0.011 across organisms,...
What do AlphaEarth environmental embeddings capture, and how do they relate to geographic coordinates and NCBI environment labels?
AlphaEarth embeddings encode geographic/environmental signal, but the strength depends on the sample source. For environmental samples (Soil, Marine, Freshwater, Extreme, Plant), nearby genomes...
How does a gene's importance for bacterial survival relate to its evolutionary conservation, and what does the conserved genome actually look like?
There is a clear, quantitative gradient from essential genes (82% core) to always-neutral genes (66% core). More important genes are more conserved -- but the effect is modest. Even genes with no detectable fitness effect in any experiment are 66% core. The gradient spans...
How is Acinetobacter baylyi ADP1's branched respiratory chain wired across carbon sources — which NADH dehydrogenases and terminal oxidases are required for which substrates?
ADP1's branched respiratory chain (62 genes across 8 subsystems) is wired in a condition-dependent manner. Quinate requires only Complex I; acetate requires Complex I, cytochrome bo3, ACIAD3522, and more; glucose requires...
What does the kescience_webofmicrobes exometabolomics collection contain, which organisms overlap with the Fitness Browser, and how well do metabolite uptake/release profiles connect to pangenome-predicted metabolic capabilities?
The WoM database encodes metabolite observations with a 4-action system that differs between control and organism entries:
| Actor | Action | Meaning | Count |
|---|---|---|---|
| Control ("The Environment")... |
What is the condition-dependent structure of gene essentiality in Acinetobacter baylyi ADP1, as revealed by the de Berardinis single-gene deletion collection grown on 8 carbon sources?
The 8 carbon sources partition into demanding, moderate, and robust tiers based on the fraction of genes showing growth defects. Urea is the most demanding (97.9% of genes show severe defects at ratio < 0.5), while quinate is the...
How does core genome composition change over sampling time, and do genes transition in and out of core status?
Can we decompose RB-TnSeq fitness compendia into latent functional modules via robust ICA, align them across organisms using orthology, and use module context to predict gene function?
Do high-contamination Oak Ridge groundwater communities show enrichment for taxa with higher inferred stress-related functional potential compared with low-contamination communities?
Model-family sample counts from data/model_family_sample_counts.tsv frame how much data each analysis used:
| Mode | Base Spearman n | Adj+Cov n | Adj+Fraction n | High-coverage subset n |
|---|---|---|---|---|
| ... |
Are essential genes preferentially conserved in the core genome, and what functional categories distinguish essential-core from essential-auxiliary genes?
Is there a continuous gradient from essential genes (core) to dispensable genes (accessory) across the full fitness spectrum, and what does the fitness landscape of novel genes look like?
A clear gradient from essential to neutral genes:
| Fitness category | n genes | % Core |
|---|---|---|
| Essential (no viable mutants) | 27,693 | 82% |
| Often sick (>10% experiments) | 15,989 | 78% |
| Mixed... |
Can BacDive-measured bacterial phenotypes (Gram stain, oxygen tolerance, metabolite utilization, enzyme activities) predict metal tolerance as measured by Fitness Browser experiments and the Metal Fitness Atlas?
Gram-negative species have higher metal tolerance scores than Gram-positive species (Cohen's d = -0.61, p < 1e-60, n = 3,272 species). This is the largest effect among all phenotype features tested....
What is the scope and structure of a comprehensive ADP1 database, and how do its annotations, metabolic models, and phenotype data intersect with BERDL collections (pangenome, biochemistry, fitness, PhageFoundry)?
The user-provided SQLite database contains 15 tables with 461,522 total rows and 135 MB of data for Acinetobacter baylyi ADP1 and 13 related genomes. The central genome_features table has 5,852 genes with 51 annotation columns spanning...
Just because a bacterium's genome encodes a complete metabolic pathway (metabolic capability), does the organism actually depend on it? Can we distinguish genomic capability from functional dependency using experimental fitness data?
Across 1,695 pathway-organism pairs from 48 organisms, 15.8% of genomically complete pathways were classified as latent capabilities — pathways the genome encodes but that show no detectable fitness...
Do open pangenomes show different patterns of environmental vs phylogenetic effects compared to closed pangenomes?
Analysis of pangenome openness vs environment/phylogeny effects revealed no significant relationship:
| Metric | Spearman rho | p-value |
|---|---|---|
| Openness vs Environment effect | -0.05 | 0.54 |
| Openness vs Phylogeny effect | 0.03 | ... |
Why does aromatic catabolism in Acinetobacter baylyi ADP1 require Complex I (NADH dehydrogenase), iron acquisition, and PQQ biosynthesis when growth on other carbon sources does not?
The 51 quinate-specific genes in ADP1 organize into a coherent metabolic dependency network around the β-ketoadipate pathway. Co-fitness analysis assigns 44/51 genes (86%) to four functional...
Do bacteria isolated from metal-contaminated environments have higher predicted metal tolerance scores than bacteria from uncontaminated environments?
Organisms isolated from heavy metal contamination sites have metal tolerance scores a full standard deviation above the environmental baseline (Cohen's d = +1.00, Mann-Whitney p=0.006, n=10)....
How do COG functional category distributions differ across core, auxiliary, and novel genes in bacterial pangenomes?
Analysis of 32 species across 9 phyla (357,623 genes) reveals a remarkably consistent "two-speed genome":
Novel/singleton genes consistently enriched in:
- L (Mobile elements): +10.88% enrichment, 100% consistency across species...
When a bacterium's genome encodes a complete biosynthetic or catabolic pathway, does the organism actually depend on it? Can we use fitness data to distinguish active dependencies from latent capabilities — and predict which pathways are candidates for evolutionary gene loss?
Of 161 classified organism-pathway pairs (7 Fitness Browser organisms, 23 GapMind pathways), only 35.4% (57/161) are Active Dependencies where a complete pathway contains fitness-important genes. The largest...
What is the distribution, conservation, phylogenetic structure, functional context, and environmental association of antimicrobial resistance (AMR) genes across 27,000 bacterial species pangenomes?
AMR genes are significantly less conserved than the pangenome average: only 30.3% are core vs 46.8% baseline (OR=0.49, chi-squared=23,117, p≈0). The auxiliary genome is 2.2x enriched for AMR (33.6% vs 15.3%). This depletion...
Do lab-measured fitness effects under contaminant stress predict the field abundance of Fitness Browser organisms across Oak Ridge groundwater sites with varying geochemistry?
Of 26 unique genera represented in the Fitness Browser, 14 are detected in Oak Ridge groundwater communities via 16S amplicon sequencing. The most prevalent are Sphingomonas (93% of 108 sites), Pseudomonas (91%), and Caulobacter...
Why are core genome genes MORE likely to show positive fitness effects when deleted, and what functions and conditions drive this burden paradox?
Not all functional categories show the paradox. Core genes are disproportionately burdensome in Protein Metabolism (+6.2pp), Motility (+7.8pp), and RNA Metabolism (+12.9pp). But Cell Wall reverses: non-core cell wall genes are MORE burdensome...
Across diverse bacteria subjected to genome-wide fitness profiling under metal stress, what is the genetic architecture of metal tolerance — is it encoded in the core or accessory genome, is it conserved across species, and can fitness-validated metal tolerance genes predict capabilities across the broader pangenome?
Across 22 organisms and 14 metals, genes with significant fitness defects under metal stress are 87.4% core vs 76.9% baseline (OR=2.08, p=4.3e-162). This is the opposite of the initial hypothesis (H1a), which predicted accessory...
What drives gene content similarity between bacterial genomes: environmental similarity or phylogenetic relatedness?
Analysis of 172 species with sufficient environmental and phylogenetic data reveals:
What genes are co-regulated with antimicrobial resistance (AMR) genes across growth conditions, and do these "support networks" explain the uniform fitness cost of resistance? Using cofitness data and ICA fitness modules from 25 bacteria, we identify the functional context in which AMR genes operate.
Only 24% of AMR genes (192/801) are assigned to ICA fitness modules, but the modules they inhabit are significantly larger than non-AMR modules: median 46 vs 27 genes (MWU p = 1.7×10⁻⁸). This indicates that when AMR...
Do antimicrobial resistance gene profiles differ between ecological niches across 27,000 bacterial species? Using 83K AMR gene clusters mapped to 293K genomes with environmental metadata, we test whether the resistome is structured by ecology — and whether intrinsic (core) and acquired (accessory) resistance show different environmental signatures.
Species from clinical sources have a median of 5 AMR gene clusters, compared to 2 for soil, aquatic, and host-associated species (Kruskal-Wallis H = 781.9, p = 9.4×10⁻¹⁶⁷, η² = 0.056)....
Does environmental selection shape the distribution of plant growth-promoting (PGP) bacterial genes across the BERDL pangenome (293K genomes, 27K species), and are those genes core or accessory within their carrier species?
Across 11,272 species with at least one PGP gene, 8 of 10 focal-gene pairs were significantly associated after BH-FDR correction. Five pairs showed positive co-occurrence and three showed...
Can we build a multi-criterion framework that explains measured P. aeruginosa PA14 inhibition from metabolic competition, growth kinetics, and patient ecology data, and use it to design optimal 1–5 organism commensal formulations for competitive exclusion in CF lungs?
Among free-living Pseudomonas clades, does the carbon source utilization profile predict the soil ecosystem type from which strains were isolated — and do clades that have transitioned to host-associated lifestyles show predictable losses of specific carbon pathways?
Pseudomonas sensu stricto (the P. aeruginosa group) shows near-complete loss of plant-derived sugar catabolism compared to Pseudomonas_E (the P. fluorescens/putida group). Of the 62 GapMind...