14
Collections
22
Connections
30
Cross-Collection Projects
4
Explorer Projects

Collection Network

Click a collection to see its connections. Edges come from explicit links (schema relationships) and projects that use multiple collections (project co-usage).

🧬 Pangenome Collection 8 connections
📊 Fitness Browser 9 connections
💬 KBase Genomes 2 connections
ModelSEED Biochemistry 5 connections
🌱 Phenotype Collection 1 connection
🦠 PhageFoundry Browsers 5 connections
🪰 ENIGMA CORAL 2 connections
🔬 NMDC Multi-omics 3 connections
🔬 NMDC BioSamples 1 connection
🌊 PlanetMicrobe 0 connections
🦠 PROTECT Pathogen Browser 4 connections
📄 UniRef Clusters 4 connections
📄 UniProt Annotations 0 connections
📄 Ontologies 0 connections

Cross-Collection Join Paths

Documented ways to connect data across collections. Each path shows the relationship and linking strategy between two collections.

Source: kbase_ke_pangenome

Target: kbase_genomes

Bridging projects:

Source: kescience_fitnessbrowser

Target: kbase_phenotype

Schema-linked collections. See individual collection pages for join examples.

Source: kbase_msd_biochemistry

Target: kescience_fitnessbrowser

Bridging projects:

Source: enigma_coral

Target: kbase_ke_pangenome

Bridging projects:

Source: enigma_coral

Target: kescience_fitnessbrowser

Bridging projects:

Source: nmdc_ncbi_biosamples

Target: nmdc_arkin

Schema-linked collections. See individual collection pages for join examples.

Explorer Project Highlights

Deep-dive explorations of BERDL collections, characterizing their content, cross-collection links, and research potential.

AlphaEarth Embeddings, Geography & Environment Explorer

Completed

What do AlphaEarth environmental embeddings capture, and how do they relate to geographic coordinates and NCBI environment labels?

Pangenome Collection
  • 1. Environmental samples show 3.4x stronger geographic signal than human-associated samples
  • 2. AlphaEarth embeddings encode real geographic signal — not noise
  • 3. Strong clinical/human sampling bias in the AlphaEarth subset
  • 4. 36% of coordinates flagged as potential institutional addresses
  • 5. UMAP reveals fine-grained embedding structure with environment-correlated clusters
  • 6. Embedding space also shows taxonomic structure
View full project →

PaperBLAST Data Explorer

Completed

What does the `kescience_paperblast` collection contain, how current is it, and what are its coverage patterns across organisms, domains of life, and functional databases?

Fitness Browser
  • Finding 1: One organism dominates nearly half of all literature
  • Finding 2: 65.6% of genes have exactly one paper
  • Finding 3: Literature inequality is extreme — Lorenz curves
  • Finding 4: Bacterial research is concentrated on pathogens
  • Finding 5: 345K protein families from 816K sequences
  • Finding 6: 55% of protein families are dark or dim
View full project →

Web of Microbes Data Explorer

Completed

What does the `kescience_webofmicrobes` exometabolomics collection contain, which organisms overlap with the Fitness Browser, and how well do metabolite uptake/release profiles connect to pangenome-pr...

Pangenome Collection Fitness Browser ModelSEED Biochemistry
  • 1. WoM Action Encoding Uses Four Distinct Semantics, Not Three
  • 2. Two Direct Fitness Browser Strain Matches Plus Two Genus-Level Matches
  • 3. 19 WoM-Produced Metabolites Are Tested as FB Carbon/Nitrogen Sources
  • 4. 26.8% of WoM Metabolites Have Definitive ModelSEED Links (68.5% with Ambiguous Formula Matches)
  • 5. ENIGMA Isolates Show Distinct "Metabolic Novelty Rates"
  • 6. All WoM Genera Have Pangenome Species Clades
View full project →

Acinetobacter baylyi ADP1 Data Explorer

Completed

What is the scope and structure of a comprehensive ADP1 database, and how do its annotations, metabolic models, and phenotype data intersect with BERDL collections (pangenome, biochemistry, fitness, P...

Pangenome Collection Fitness Browser ModelSEED Biochemistry PhageFoundry Browsers UniRef Clusters
  • 1. Rich Multi-Omics Database with 6 Data Modalities
  • 2. Strong BERDL Connectivity: 4 of 5 Connection Types at >90% Match
  • 3. Pangenome Cluster ID Bridge: 100% Mapping via Gene Junction Table
  • 4. FBA and TnSeq Essentiality Agree 74% of the Time
  • 5. Condition-Specific Fitness: Urea and Quinate Stand Apart
  • 6. Essential Genes Are 6x More Likely to Have COG Annotations
  • 7. Highly Conserved Core Metabolism Across 14 Genomes
  • 8. 87% of Growth Predictions Depend on Gapfilled Reactions
View full project →