COG Functional Category Analysis
CompletedResearch Question
How do COG functional category distributions differ across core, auxiliary, and novel genes in bacterial pangenomes?
Research Plan
Hypothesis
Novel genes may have different functional profiles compared to core genes:
- Core genes: Expected to be enriched in essential functions (translation, metabolism, cell processes)
- Auxiliary genes: Expected to show intermediate patterns
- Novel genes: May be enriched in mobile elements, defense mechanisms, or poorly characterized functions
Approach
- Query the
kbase_ke_pangenomedatabase for gene cluster information - Extract COG functional category annotations from
eggnog_mapper_annotationstable - Classify genes based on their gene_cluster attributes:
- Core:
is_core = 1(present in most/all genomes) - Auxiliary:
is_auxiliary = 1ANDis_singleton = 0(present in some genomes) - Singleton/Novel:
is_singleton = 1(present in only one genome) - Compare COG category distributions across these three classes
- Calculate enrichment/depletion of each COG category in novel vs core genes
- Visualize results with heatmaps and bar plots
Revision History
- v1 (2026-02): Migrated from README.md
Overview
This project analyzes COG (Clusters of Orthologous Groups) functional category distributions across core, auxiliary, and singleton gene classes in bacterial pangenomes. Using annotations from the eggnog_mapper_annotations table and gene classifications from the gene_cluster table in BERDL, it reveals a remarkably consistent "two-speed genome" pattern that holds universally across bacterial phyla.
Key Findings
Universal Functional Partitioning in Bacterial Pangenomes
Analysis of 32 species across 9 phyla (357,623 genes) reveals a remarkably consistent "two-speed genome":
Novel/singleton genes consistently enriched in:
- L (Mobile elements): +10.88% enrichment, 100% consistency across species -- STRONGEST SIGNAL
- V (Defense mechanisms): +2.83% enrichment, 100% consistency
- S (Unknown function): +1.64% enrichment, 69% consistency
Core genes consistently enriched in:
- J (Translation): -4.65% enrichment, 97% consistency -- STRONGEST DEPLETION
- F (Nucleotide metabolism): -2.09% enrichment, 100% consistency
- H (Coenzyme metabolism): -2.06% enrichment, 97% consistency
- E (Amino acid metabolism): -1.81% enrichment, 81% consistency
- C (Energy production): -1.75% enrichment, 88% consistency
(Notebook: cog_analysis.ipynb)
Composite COG Categories Are Biologically Meaningful
Multi-function genes with composite COG assignments (e.g., "LV" = mobile+defense, "EGP" = amino acid+carb+inorganic ion) are not annotation artifacts:
- LV (mobile+defense): +0.34% enrichment, 76% consistency
- Suggests functional modules like "mobile defense islands"
- Should not be filtered out as noise -- they represent genuine multi-functional genes
(Notebook: cog_analysis.ipynb)
Interpretation
- Core genes = ancient, conserved "metabolic engine" (translation, energy, biosynthesis)
- Novel genes = recent acquisitions for ecological adaptation (mobile elements, defense, niche-specific)
- Horizontal gene transfer (HGT) is the primary innovation mechanism, not vertical inheritance
- The massive L enrichment (+10.88%) suggests most genomic novelty comes from mobile elements
- Patterns hold universally across bacterial phyla, suggesting deep evolutionary constraint
All 8 predictions from initial N. gonorrhoeae analysis were confirmed across 32 species. This represents a fundamental organizing principle of bacterial pangenome structure.
Literature Context
The finding that mobile elements (COG L) dominate novel genes connects to Koonin & Wolf (2008) "Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world" and Treangen & Rocha (2011) "Horizontal transfer, not duplication, drives the expansion of protein families in prokaryotes" — both showed HGT as the primary source of gene novelty in prokaryotes.
The core gene enrichment in translation (COG J) and energy production (COG C) is consistent with Charlesworth & Jensen (2022) and the broader "minimal genome" literature, which identifies these housekeeping functions as the most conserved and indispensable components of bacterial genomes.
Limitations
- COG annotations cover ~70% of genes; unassigned genes may skew distributions
- Composite COG categories (multi-letter) are counted once per gene, not split
- Analysis uses 32 species — a larger sample might reveal phylum-specific patterns
- eggNOG v6 annotations may differ from original COG assignments
Future Directions
- Analyze multiple species to identify consistent patterns
- Compare COG distributions across different taxonomic groups
- Investigate specific COG categories (e.g., V-Defense, L-Recombination) in detail
- Correlate with environmental metadata to see if novel gene functions vary by habitat
Data
Sources
kbase_ke_pangenomedatabase (tables:gene_cluster,gene_genecluster_junction,eggnog_mapper_annotations)
Generated Data
| File | Description |
|---|---|
data/cog_distributions.csv |
COG category proportions by gene class for 32 species |
References
- Koonin EV, Wolf YI (2008). "Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world." Nucleic Acids Res 36:6688-6719. PMID: 18948295
- Treangen TJ, Rocha EP (2011). "Horizontal transfer, not duplication, drives the expansion of protein families in prokaryotes." PLoS Genet 7:e1001284. PMID: 21298028
- Parks DH et al. (2022). "GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy." Nucleic Acids Res 50:D199-D207. PMID: 34520557
Discoveries
Analysis of 32 species across 9 phyla (357,623 genes) reveals a remarkably consistent "two-speed genome":
Novel/singleton genes consistently enriched in:
- L (Mobile elements): +10.88% enrichment, 100% consistency across species - STRONGEST SIGNAL
- V (Defense mechanisms): +2.83% enrichment, 10
Composite COG categories are biologically meaningful
January 2026Multi-function genes with composite COG assignments (e.g., "LV" = mobile+defense, "EGP" = amino acid+carb+inorganic ion) are not annotation artifacts:
- LV (mobile+defense): +0.34% enrichment, 76% consistency
- Suggests functional modules like "mobile defense islands"
- Should not be filtered out a
Data Collections
Visualizations
Cog Enrichment
Cog Enrichment By Phylum
Cog Heatmap
Cog Mean Enrichment
Genome Count Distribution
Multi Species Enrichment Heatmap
Pangenome Characteristics
Phylum Distribution 100 500
Notebooks
cog_analysis.ipynb
Cog Analysis
View notebook →
multi_species_cog_analysis.ipynb
Multi Species Cog Analysis
View notebook →
multi_species_cog_analysis_optimized.ipynb
Multi Species Cog Analysis Optimized
View notebook →
species_selection_exploration.ipynb
Species Selection Exploration
View notebook →
Data Files
| Filename | Size |
|---|---|
cog_distribution_raw.json |
0.0 KB |
query.json |
0.1 KB |