Pangenome Collection
kbase_ke_pangenome
Philosophy
Enable comparative genomics at scale. Understand core vs accessory genome content, functional distributions, and evolutionary patterns across bacterial and archaeal species. Answer questions about what makes species unique and what they share.
Citation & Attribution
Provider: KBase, DOE
Cite as: Arkin AP et al. (2018) KBase: The United States Department of Energy Systems Biology Knowledgebase. Nat Biotechnol 36:566-569
DOI: 10.1038/nbt.4163
Website: https://www.kbase.us/
Scale
Schema Browser
Tables (23)
genome 293,059 gtdb_species_clade 27,690 pangenome 27,702 gtdb_metadata 293,059 gtdb_taxonomy_r214v1 293,059 gene 1,011,650,903 gene_cluster 132,531,501 gene_genecluster_junction 1,011,650,762 eggnog_mapper_annotations 93,558,330 genome_ani 421,218,641 sample 293,059 ncbi_env 4,124,801 alphaearth_embeddings_all_years 83,287 gapmind_pathways 305,471,280 phylogenetic_tree 330 bakta_annotations 132,538,155 bakta_db_xrefs 572,376,477 bakta_pfam_domains 18,807,208 bakta_amr 83,008 interproscan_domains 833,303,130 interproscan_go 266,317,724 interproscan_pathways 287,228,475 phylogenetic_tree_distance_pairs 22,570,755
gtdb_species_clade
27,690 rows
Species-level taxonomy and ANI statistics.
| Column | Type | Description |
|---|---|---|
gtdb_species_clade_id |
string | **Primary Key**. Format: `s__Genus_species--RS_GCF_XXXXXXXXX.X` |
representative_genome_id |
string | Type strain / representative genome |
GTDB_species |
string | Species name (e.g., "s__Escherichia coli") |
GTDB_taxonomy |
string | Full taxonomy string (d__;p__;c__;o__;f__;g__;s__) |
ANI_circumscription_radius |
float | ANI threshold for species boundary |
mean_intra_species_ANI |
float | Average ANI within species |
min_intra_species_ANI |
float | Minimum ANI within species |
mean_intra_species_AF |
float | Mean alignment fraction |
min_intra_species_AF |
float | Minimum alignment fraction |
no_clustered_genomes_unfiltered |
int | Raw genome count before QC |
no_clustered_genomes_filtered |
int | Genome count after QC filtering |
Sample Queries
Get species with most genomes
SELECT
s.GTDB_species,
p.no_genomes,
p.no_core,
p.no_aux_genome,
s.mean_intra_species_ANI
FROM kbase_ke_pangenome.pangenome p
JOIN kbase_ke_pangenome.gtdb_species_clade s
ON p.gtdb_species_clade_id = s.gtdb_species_clade_id
ORDER BY p.no_genomes DESC
LIMIT 20
Get functional annotations for a species
SELECT
e.COG_category,
e.Description,
COUNT(*) as gene_count
FROM kbase_ke_pangenome.gene g
JOIN kbase_ke_pangenome.eggnog_mapper_annotations e
ON g.gene_id = e.gene_id
WHERE g.gtdb_species_clade_id = 's__Escherichia_coli--RS_GCF_000005845.2'
GROUP BY e.COG_category, e.Description
ORDER BY gene_count DESC
LIMIT 20
Get genomes for a species with quality metrics
SELECT
g.genome_id,
g.ncbi_biosample_id,
m.checkm_completeness,
m.checkm_contamination,
m.genome_size,
m.gc_percentage
FROM kbase_ke_pangenome.genome g
LEFT JOIN kbase_ke_pangenome.gtdb_metadata m
ON g.genome_id = m.accession
WHERE g.gtdb_species_clade_id = 's__Escherichia_coli--RS_GCF_000005845.2'
LIMIT 100
Related Collections
Fitness Browser
Gene fitness data from transposon mutant experiments across 40+ bacterial organi...
ModelSEED Biochemistry
Biochemistry reference data for metabolic modeling. Reactions, compounds, and pa...
KBase Genomes
Structural genomics data including contigs, features, and protein sequences from...
Projects Using This Collection
Carbon Source Utilization Predicts Ecology and Lifestyle in Pseudomonas
Among free-living *Pseudomonas* clades, does the carbon source utilization profile predict the soil ecosystem type from ...
CF Protective Microbiome Formulation Design
Can we build a multi-criterion framework that explains measured *P. aeruginosa* PA14 inhibition from metabolic competiti...
PGP Gene Distribution Across Environments & Pangenomes
Does environmental selection shape the distribution of plant growth-promoting (PGP) bacterial genes across the BERDL pan...
Environmental Resistome at Pangenome Scale
Do antimicrobial resistance gene profiles differ between ecological niches across 27,000 bacterial species? Using 83K AM...
AMR Co-Fitness Support Networks
What genes are co-regulated with antimicrobial resistance (AMR) genes across growth conditions, and do these "support ne...
Ecotype Correlation Analysis
What drives gene content similarity between bacterial genomes: environmental similarity or phylogenetic relatedness?
Pan-Bacterial Metal Fitness Atlas
Across diverse bacteria subjected to genome-wide fitness profiling under metal stress, what is the genetic architecture ...
Pan-Bacterial AMR Gene Landscape
What is the distribution, conservation, phylogenetic structure, functional context, and environmental association of ant...
Metabolic Capability vs Dependency
When a bacterium's genome encodes a complete biosynthetic or catabolic pathway, does the organism actually depend on it?...
COG Functional Category Analysis
How do COG functional category distributions differ across core, auxiliary, and novel genes in bacterial pangenomes?
BacDive Isolation Environment × Metal Tolerance Prediction
Do bacteria isolated from metal-contaminated environments have higher predicted metal tolerance scores than bacteria fro...
Aromatic Catabolism Support Network in ADP1
Why does aromatic catabolism in *Acinetobacter baylyi* ADP1 require Complex I (NADH dehydrogenase), iron acquisition, an...
Pangenome Openness Analysis
Do open pangenomes show different patterns of environmental vs phylogenetic effects compared to closed pangenomes?
Metabolic Capability vs Metabolic Dependency
Just because a bacterium's genome encodes a complete metabolic pathway (metabolic *capability*), does the organism actua...
Acinetobacter baylyi ADP1 Data Explorer
What is the scope and structure of a comprehensive ADP1 database, and how do its annotations, metabolic models, and phen...
BacDive Phenotype Signatures of Metal Tolerance
Can BacDive-measured bacterial phenotypes (Gram stain, oxygen tolerance, metabolite utilization, enzyme activities) pred...
Conservation vs Fitness -- Linking FB Genes to Pangenome Clusters
Are essential genes preferentially conserved in the core genome, and what functional categories distinguish essential-co...
Contamination Gradient vs Functional Potential in ENIGMA Communities
Do high-contamination Oak Ridge groundwater communities show enrichment for taxa with higher inferred stress-related fun...
Pan-bacterial Fitness Modules via Independent Component Analysis
Can we decompose RB-TnSeq fitness compendia into latent functional modules via robust ICA, align them across organisms u...
Temporal Core Genome Dynamics
How does core genome composition change over sampling time, and do genes transition in and out of core status?
ADP1 Deletion Collection Phenotype Analysis
What is the condition-dependent structure of gene essentiality in *Acinetobacter baylyi* ADP1, as revealed by the de Ber...
Web of Microbes Data Explorer
What does the `kescience_webofmicrobes` exometabolomics collection contain, which organisms overlap with the Fitness Bro...
Condition-Specific Respiratory Chain Wiring in ADP1
How is *Acinetobacter baylyi* ADP1's branched respiratory chain wired across carbon sources — which NADH dehydrogenases ...
AlphaEarth Embeddings, Geography & Environment Explorer
What do AlphaEarth environmental embeddings capture, and how do they relate to geographic coordinates and NCBI environme...
Co-fitness Predicts Co-inheritance in Bacterial Pangenomes
Do genes with correlated fitness profiles (co-fit) tend to co-occur in the same genomes across a species' pangenome? Doe...
The Pan-Bacterial Essential Metabolome
Which biochemical reactions are universally essential across bacteria, and what does the essential metabolome reveal abo...
Counter Ion Effects on Metal Fitness Measurements
When bacteria are exposed to metal salts (CoCl₂, NiCl₂, CuCl₂), how much of the observed fitness effect is caused by the...
Field vs Lab Gene Importance in *Desulfovibrio vulgaris* Hildenborough
Which genes matter for survival under environmentally-realistic conditions but appear dispensable in the lab, and vice v...
Ecotype Reanalysis: Environmental-Only Samples
Does the environment effect on gene content become stronger when analysis is restricted to genuinely environmental sampl...
The 5,526 Costly + Dispensable Genes
What characterizes genes that are simultaneously burdensome (fitness improves when deleted) and not conserved in the pan...
Metabolic Consistency of Pseudomonas FW300-N2E3
For *Pseudomonas fluorescens* FW300-N2E3 (ENIGMA groundwater isolate), how consistent are exometabolomic outputs (Web of...
Within-Species AMR Strain Variation
Within a species, how does the AMR repertoire vary between strains, and what drives that variation?
Fitness Cost of Antimicrobial Resistance Genes
Do antimicrobial resistance (AMR) genes impose a fitness cost in the absence of antibiotic selection pressure? Using gen...
Truly Dark Genes — What Remains Unknown After Modern Annotation?
Among the ~6,400 Fitness Browser genes that remain functionally unannotated even after bakta v1.12.0 reannotation, what ...
Functional Dark Matter — Experimentally Prioritized Novel Genetic Systems
Which genes of unknown function across 48 bacteria have strong fitness phenotypes, and can biogeographic patterns, pathw...
SNIPE Defense System: Prevalence and Taxonomic Distribution in the BERDL Pangenome
How prevalent are SNIPE (Surface-associated Nuclease Inhibiting Phage Entry) homologues across the 293K-genome BERDL pan...
Community Metabolic Ecology via NMDC × Pangenome Integration
Do the GapMind-predicted pathway completeness profiles of community resident taxa predict or correlate with observed met...
Prophage Gene Modules and Terminase-Defined Lineages Across Bacterial Phylogeny and Environmental Gradients
How are prophage gene modules and terminase-defined prophage lineages distributed across bacterial phylogeny and environ...
Polyhydroxybutyrate Granule Formation Pathways: Distribution Across Clades and Environmental Selection
How are polyhydroxybutyrate (PHB) granule-forming pathways distributed across bacterial clades and environments, and doe...
Openness vs Functional Composition
Do species with open pangenomes show different COG functional enrichment patterns than species with closed pangenomes?
Antibiotic Resistance Hotspots in Microbial Pangenomes
Which microbial species and ecological environments show the highest concentration of antibiotic resistance genes, and c...
Pangenome Openness, Metabolic Pathways, and Biogeography
Do pangenome characteristics (open vs. closed) correlate with metabolic pathway diversity and biogeographic distribution...
Pangenome Openness, Metabolic Pathways, and Phylogenetic Distances
How do pangenome characteristics (open vs. closed) correlate with metabolic pathway completeness, phylogenetic distances...