Data Explorer
Discover how BERDL collections connect through shared organisms, join keys, and cross-collection research projects.
Collection Network
Click a collection to see its connections. Edges come from explicit links (schema relationships) and projects that use multiple collections (project co-usage).
Cross-Collection Join Paths
Documented ways to connect data across collections. Each path shows the relationship and linking strategy between two collections.
Source:
kbase_ke_pangenome
Target:
kescience_fitnessbrowser
Bridging projects:
Source:
kbase_ke_pangenome
Target:
kbase_msd_biochemistry
Source:
kbase_ke_pangenome
Target:
kbase_genomes
Bridging projects:
Source:
kescience_fitnessbrowser
Target:
kbase_phenotype
Schema-linked collections. See individual collection pages for join examples.
Source:
kbase_msd_biochemistry
Target:
kescience_fitnessbrowser
Source:
enigma_coral
Target:
kbase_ke_pangenome
Bridging projects:
Source:
enigma_coral
Target:
kescience_fitnessbrowser
Bridging projects:
Source:
nmdc_arkin
Target:
kbase_ke_pangenome
Source:
nmdc_ncbi_biosamples
Target:
nmdc_arkin
Schema-linked collections. See individual collection pages for join examples.
Explorer Project Highlights
Deep-dive explorations of BERDL collections, characterizing their content, cross-collection links, and research potential.
What do AlphaEarth environmental embeddings capture, and how do they relate to geographic coordinates and NCBI environment labels?
- 1. Environmental samples show 3.4x stronger geographic signal than human-associated samples
- 2. AlphaEarth embeddings encode real geographic signal — not noise
- 3. Strong clinical/human sampling bias in the AlphaEarth subset
- 4. 36% of coordinates flagged as potential institutional addresses
- 5. UMAP reveals fine-grained embedding structure with environment-correlated clusters
- 6. Embedding space also shows taxonomic structure
PaperBLAST Data Explorer
CompletedWhat does the `kescience_paperblast` collection contain, how current is it, and what are its coverage patterns across organisms, domains of life, and functional databases?
- Finding 1: One organism dominates nearly half of all literature
- Finding 2: 65.6% of genes have exactly one paper
- Finding 3: Literature inequality is extreme — Lorenz curves
- Finding 4: Bacterial research is concentrated on pathogens
- Finding 5: 345K protein families from 816K sequences
- Finding 6: 55% of protein families are dark or dim
Web of Microbes Data Explorer
CompletedWhat does the `kescience_webofmicrobes` exometabolomics collection contain, which organisms overlap with the Fitness Browser, and how well do metabolite uptake/release profiles connect to pangenome-pr...
- 1. WoM Action Encoding Uses Four Distinct Semantics, Not Three
- 2. Two Direct Fitness Browser Strain Matches Plus Two Genus-Level Matches
- 3. 19 WoM-Produced Metabolites Are Tested as FB Carbon/Nitrogen Sources
- 4. 26.8% of WoM Metabolites Have Definitive ModelSEED Links (68.5% with Ambiguous Formula Matches)
- 5. ENIGMA Isolates Show Distinct "Metabolic Novelty Rates"
- 6. All WoM Genera Have Pangenome Species Clades
Acinetobacter baylyi ADP1 Data Explorer
CompletedWhat is the scope and structure of a comprehensive ADP1 database, and how do its annotations, metabolic models, and phenotype data intersect with BERDL collections (pangenome, biochemistry, fitness, P...
- 1. Rich Multi-Omics Database with 6 Data Modalities
- 2. Strong BERDL Connectivity: 4 of 5 Connection Types at >90% Match
- 3. Pangenome Cluster ID Bridge: 100% Mapping via Gene Junction Table
- 4. FBA and TnSeq Essentiality Agree 74% of the Time
- 5. Condition-Specific Fitness: Urea and Quinate Stand Apart
- 6. Essential Genes Are 6x More Likely to Have COG Annotations
- 7. Highly Conserved Core Metabolism Across 14 Genomes
- 8. 87% of Growth Predictions Depend on Gapfilled Reactions