🧬

Pangenome Collection

kbase_ke_pangenome

Primary

Philosophy

Enable comparative genomics at scale. Understand core vs accessory genome content, functional distributions, and evolutionary patterns across bacterial and archaeal species. Answer questions about what makes species unique and what they share.

Data sources: GTDB r214 eggNOG v6 GapMind AlphaEarth

Citation & Attribution

Provider: KBase, DOE

Cite as: Arkin AP et al. (2018) KBase: The United States Department of Energy Systems Biology Knowledgebase. Nat Biotechnol 36:566-569

DOI: 10.1038/nbt.4163

Website: https://www.kbase.us/

Scale

293,059
genomes
27,690
species
1B+
genes
14
tables

Schema Browser

genome 293,059 rows

Central genome table linking to species clades and taxonomy.

Column Type Description
genome_id string **Primary Key**. Format: `RS_GCF_XXXXXXXXX.X` or `GB_GCA_XXXXXXXXX.X`
gtdb_species_clade_id string FK → `gtdb_species_clade`
gtdb_taxonomy_id string FK → `gtdb_taxonomy_r214v1`
ncbi_biosample_id string NCBI BioSample accession
fna_file_path_nersc string Path to nucleotide FASTA at NERSC
faa_file_path_nersc string Path to protein FASTA at NERSC

Sample Queries

Get species with most genomes

SELECT
  s.GTDB_species,
  p.no_genomes,
  p.no_core,
  p.no_aux_genome,
  s.mean_intra_species_ANI
FROM kbase_ke_pangenome.pangenome p
JOIN kbase_ke_pangenome.gtdb_species_clade s
  ON p.gtdb_species_clade_id = s.gtdb_species_clade_id
ORDER BY p.no_genomes DESC
LIMIT 20

Get functional annotations for a species

SELECT
  e.COG_category,
  e.Description,
  COUNT(*) as gene_count
FROM kbase_ke_pangenome.gene g
JOIN kbase_ke_pangenome.eggnog_mapper_annotations e
  ON g.gene_id = e.gene_id
WHERE g.gtdb_species_clade_id = 's__Escherichia_coli--RS_GCF_000005845.2'
GROUP BY e.COG_category, e.Description
ORDER BY gene_count DESC
LIMIT 20

Get genomes for a species with quality metrics

SELECT
  g.genome_id,
  g.ncbi_biosample_id,
  m.checkm_completeness,
  m.checkm_contamination,
  m.genome_size,
  m.gc_percentage
FROM kbase_ke_pangenome.genome g
LEFT JOIN kbase_ke_pangenome.gtdb_metadata m
  ON g.genome_id = m.accession
WHERE g.gtdb_species_clade_id = 's__Escherichia_coli--RS_GCF_000005845.2'
LIMIT 100

Related Collections

Projects Using This Collection

Carbon Source Utilization Predicts Ecology and Lifestyle in Pseudomonas

Among free-living *Pseudomonas* clades, does the carbon source utilization profile predict the soil ecosystem type from ...

CF Protective Microbiome Formulation Design

Can we build a multi-criterion framework that explains measured *P. aeruginosa* PA14 inhibition from metabolic competiti...

PGP Gene Distribution Across Environments & Pangenomes

Does environmental selection shape the distribution of plant growth-promoting (PGP) bacterial genes across the BERDL pan...

Environmental Resistome at Pangenome Scale

Do antimicrobial resistance gene profiles differ between ecological niches across 27,000 bacterial species? Using 83K AM...

AMR Co-Fitness Support Networks

What genes are co-regulated with antimicrobial resistance (AMR) genes across growth conditions, and do these "support ne...

Ecotype Correlation Analysis

What drives gene content similarity between bacterial genomes: environmental similarity or phylogenetic relatedness?

Pan-Bacterial Metal Fitness Atlas

Across diverse bacteria subjected to genome-wide fitness profiling under metal stress, what is the genetic architecture ...

Pan-Bacterial AMR Gene Landscape

What is the distribution, conservation, phylogenetic structure, functional context, and environmental association of ant...

Metabolic Capability vs Dependency

When a bacterium's genome encodes a complete biosynthetic or catabolic pathway, does the organism actually depend on it?...

COG Functional Category Analysis

How do COG functional category distributions differ across core, auxiliary, and novel genes in bacterial pangenomes?

BacDive Isolation Environment × Metal Tolerance Prediction

Do bacteria isolated from metal-contaminated environments have higher predicted metal tolerance scores than bacteria fro...

Aromatic Catabolism Support Network in ADP1

Why does aromatic catabolism in *Acinetobacter baylyi* ADP1 require Complex I (NADH dehydrogenase), iron acquisition, an...

Pangenome Openness Analysis

Do open pangenomes show different patterns of environmental vs phylogenetic effects compared to closed pangenomes?

Metabolic Capability vs Metabolic Dependency

Just because a bacterium's genome encodes a complete metabolic pathway (metabolic *capability*), does the organism actua...

Acinetobacter baylyi ADP1 Data Explorer

What is the scope and structure of a comprehensive ADP1 database, and how do its annotations, metabolic models, and phen...

BacDive Phenotype Signatures of Metal Tolerance

Can BacDive-measured bacterial phenotypes (Gram stain, oxygen tolerance, metabolite utilization, enzyme activities) pred...

Conservation vs Fitness -- Linking FB Genes to Pangenome Clusters

Are essential genes preferentially conserved in the core genome, and what functional categories distinguish essential-co...

Contamination Gradient vs Functional Potential in ENIGMA Communities

Do high-contamination Oak Ridge groundwater communities show enrichment for taxa with higher inferred stress-related fun...

Pan-bacterial Fitness Modules via Independent Component Analysis

Can we decompose RB-TnSeq fitness compendia into latent functional modules via robust ICA, align them across organisms u...

Temporal Core Genome Dynamics

How does core genome composition change over sampling time, and do genes transition in and out of core status?

ADP1 Deletion Collection Phenotype Analysis

What is the condition-dependent structure of gene essentiality in *Acinetobacter baylyi* ADP1, as revealed by the de Ber...

Web of Microbes Data Explorer

What does the `kescience_webofmicrobes` exometabolomics collection contain, which organisms overlap with the Fitness Bro...

Condition-Specific Respiratory Chain Wiring in ADP1

How is *Acinetobacter baylyi* ADP1's branched respiratory chain wired across carbon sources — which NADH dehydrogenases ...

AlphaEarth Embeddings, Geography & Environment Explorer

What do AlphaEarth environmental embeddings capture, and how do they relate to geographic coordinates and NCBI environme...

Co-fitness Predicts Co-inheritance in Bacterial Pangenomes

Do genes with correlated fitness profiles (co-fit) tend to co-occur in the same genomes across a species' pangenome? Doe...

The Pan-Bacterial Essential Metabolome

Which biochemical reactions are universally essential across bacteria, and what does the essential metabolome reveal abo...

Counter Ion Effects on Metal Fitness Measurements

When bacteria are exposed to metal salts (CoCl₂, NiCl₂, CuCl₂), how much of the observed fitness effect is caused by the...

Field vs Lab Gene Importance in *Desulfovibrio vulgaris* Hildenborough

Which genes matter for survival under environmentally-realistic conditions but appear dispensable in the lab, and vice v...

Ecotype Reanalysis: Environmental-Only Samples

Does the environment effect on gene content become stronger when analysis is restricted to genuinely environmental sampl...

The 5,526 Costly + Dispensable Genes

What characterizes genes that are simultaneously burdensome (fitness improves when deleted) and not conserved in the pan...

Metabolic Consistency of Pseudomonas FW300-N2E3

For *Pseudomonas fluorescens* FW300-N2E3 (ENIGMA groundwater isolate), how consistent are exometabolomic outputs (Web of...

Within-Species AMR Strain Variation

Within a species, how does the AMR repertoire vary between strains, and what drives that variation?

Fitness Cost of Antimicrobial Resistance Genes

Do antimicrobial resistance (AMR) genes impose a fitness cost in the absence of antibiotic selection pressure? Using gen...

Truly Dark Genes — What Remains Unknown After Modern Annotation?

Among the ~6,400 Fitness Browser genes that remain functionally unannotated even after bakta v1.12.0 reannotation, what ...

Functional Dark Matter — Experimentally Prioritized Novel Genetic Systems

Which genes of unknown function across 48 bacteria have strong fitness phenotypes, and can biogeographic patterns, pathw...

SNIPE Defense System: Prevalence and Taxonomic Distribution in the BERDL Pangenome

How prevalent are SNIPE (Surface-associated Nuclease Inhibiting Phage Entry) homologues across the 293K-genome BERDL pan...

Community Metabolic Ecology via NMDC × Pangenome Integration

Do the GapMind-predicted pathway completeness profiles of community resident taxa predict or correlate with observed met...

Prophage Gene Modules and Terminase-Defined Lineages Across Bacterial Phylogeny and Environmental Gradients

How are prophage gene modules and terminase-defined prophage lineages distributed across bacterial phylogeny and environ...

Polyhydroxybutyrate Granule Formation Pathways: Distribution Across Clades and Environmental Selection

How are polyhydroxybutyrate (PHB) granule-forming pathways distributed across bacterial clades and environments, and doe...

Openness vs Functional Composition

Do species with open pangenomes show different COG functional enrichment patterns than species with closed pangenomes?

Antibiotic Resistance Hotspots in Microbial Pangenomes

Which microbial species and ecological environments show the highest concentration of antibiotic resistance genes, and c...

Pangenome Openness, Metabolic Pathways, and Biogeography

Do pangenome characteristics (open vs. closed) correlate with metabolic pathway diversity and biogeographic distribution...

Pangenome Openness, Metabolic Pathways, and Phylogenetic Distances

How do pangenome characteristics (open vs. closed) correlate with metabolic pathway completeness, phylogenetic distances...

Start Exploring

Access the full Pangenome Collection data through BERDL JupyterHub.

Open JupyterHub