Paramvir S. Dehal ORCID

Research Question

How do COG functional category distributions differ across core, auxiliary, and novel genes in bacterial pangenomes?

Research Plan

Hypothesis

Novel genes may have different functional profiles compared to core genes:
- Core genes: Expected to be enriched in essential functions (translation, metabolism, cell processes)
- Auxiliary genes: Expected to show intermediate patterns
- Novel genes: May be enriched in mobile elements, defense mechanisms, or poorly characterized functions

Approach

  1. Query the kbase_ke_pangenome database for gene cluster information
  2. Extract COG functional category annotations from eggnog_mapper_annotations table
  3. Classify genes based on their gene_cluster attributes:
  4. Core: is_core = 1 (present in most/all genomes)
  5. Auxiliary: is_auxiliary = 1 AND is_singleton = 0 (present in some genomes)
  6. Singleton/Novel: is_singleton = 1 (present in only one genome)
  7. Compare COG category distributions across these three classes
  8. Calculate enrichment/depletion of each COG category in novel vs core genes
  9. Visualize results with heatmaps and bar plots

Revision History

  • v1 (2026-02): Migrated from README.md

Overview

This project analyzes COG (Clusters of Orthologous Groups) functional category distributions across core, auxiliary, and singleton gene classes in bacterial pangenomes. Using annotations from the eggnog_mapper_annotations table and gene classifications from the gene_cluster table in BERDL, it reveals a remarkably consistent "two-speed genome" pattern that holds universally across bacterial phyla.

Key Findings

Universal Functional Partitioning in Bacterial Pangenomes

Analysis of 32 species across 9 phyla (357,623 genes) reveals a remarkably consistent "two-speed genome":

Novel/singleton genes consistently enriched in:
- L (Mobile elements): +10.88% enrichment, 100% consistency across species -- STRONGEST SIGNAL
- V (Defense mechanisms): +2.83% enrichment, 100% consistency
- S (Unknown function): +1.64% enrichment, 69% consistency

Core genes consistently enriched in:
- J (Translation): -4.65% enrichment, 97% consistency -- STRONGEST DEPLETION
- F (Nucleotide metabolism): -2.09% enrichment, 100% consistency
- H (Coenzyme metabolism): -2.06% enrichment, 97% consistency
- E (Amino acid metabolism): -1.81% enrichment, 81% consistency
- C (Energy production): -1.75% enrichment, 88% consistency

(Notebook: cog_analysis.ipynb)

Composite COG Categories Are Biologically Meaningful

Multi-function genes with composite COG assignments (e.g., "LV" = mobile+defense, "EGP" = amino acid+carb+inorganic ion) are not annotation artifacts:

  • LV (mobile+defense): +0.34% enrichment, 76% consistency
  • Suggests functional modules like "mobile defense islands"
  • Should not be filtered out as noise -- they represent genuine multi-functional genes

(Notebook: cog_analysis.ipynb)

Interpretation

  • Core genes = ancient, conserved "metabolic engine" (translation, energy, biosynthesis)
  • Novel genes = recent acquisitions for ecological adaptation (mobile elements, defense, niche-specific)
  • Horizontal gene transfer (HGT) is the primary innovation mechanism, not vertical inheritance
  • The massive L enrichment (+10.88%) suggests most genomic novelty comes from mobile elements
  • Patterns hold universally across bacterial phyla, suggesting deep evolutionary constraint

All 8 predictions from initial N. gonorrhoeae analysis were confirmed across 32 species. This represents a fundamental organizing principle of bacterial pangenome structure.

Literature Context

The finding that mobile elements (COG L) dominate novel genes connects to Koonin & Wolf (2008) "Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world" and Treangen & Rocha (2011) "Horizontal transfer, not duplication, drives the expansion of protein families in prokaryotes" — both showed HGT as the primary source of gene novelty in prokaryotes.

The core gene enrichment in translation (COG J) and energy production (COG C) is consistent with Charlesworth & Jensen (2022) and the broader "minimal genome" literature, which identifies these housekeeping functions as the most conserved and indispensable components of bacterial genomes.

Limitations

  • COG annotations cover ~70% of genes; unassigned genes may skew distributions
  • Composite COG categories (multi-letter) are counted once per gene, not split
  • Analysis uses 32 species — a larger sample might reveal phylum-specific patterns
  • eggNOG v6 annotations may differ from original COG assignments

Future Directions

  • Analyze multiple species to identify consistent patterns
  • Compare COG distributions across different taxonomic groups
  • Investigate specific COG categories (e.g., V-Defense, L-Recombination) in detail
  • Correlate with environmental metadata to see if novel gene functions vary by habitat

Data

Sources

  • kbase_ke_pangenome database (tables: gene_cluster, gene_genecluster_junction, eggnog_mapper_annotations)

Generated Data

File Description
data/cog_distributions.csv COG category proportions by gene class for 32 species

References

  • Koonin EV, Wolf YI (2008). "Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world." Nucleic Acids Res 36:6688-6719. PMID: 18948295
  • Treangen TJ, Rocha EP (2011). "Horizontal transfer, not duplication, drives the expansion of protein families in prokaryotes." PLoS Genet 7:e1001284. PMID: 21298028
  • Parks DH et al. (2022). "GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy." Nucleic Acids Res 50:D199-D207. PMID: 34520557

Discoveries

Analysis of 32 species across 9 phyla (357,623 genes) reveals a remarkably consistent "two-speed genome":

Novel/singleton genes consistently enriched in:
- L (Mobile elements): +10.88% enrichment, 100% consistency across species - STRONGEST SIGNAL
- V (Defense mechanisms): +2.83% enrichment, 10

Read more →

Multi-function genes with composite COG assignments (e.g., "LV" = mobile+defense, "EGP" = amino acid+carb+inorganic ion) are not annotation artifacts:

  • LV (mobile+defense): +0.34% enrichment, 76% consistency
  • Suggests functional modules like "mobile defense islands"
  • Should not be filtered out a
Read more →

Data Collections

Visualizations

Cog Enrichment

Cog Enrichment

Cog Enrichment By Phylum

Cog Enrichment By Phylum

Cog Heatmap

Cog Heatmap

Cog Mean Enrichment

Cog Mean Enrichment

Genome Count Distribution

Genome Count Distribution

Multi Species Enrichment Heatmap

Multi Species Enrichment Heatmap

Pangenome Characteristics

Pangenome Characteristics

Phylum Distribution 100 500

Phylum Distribution 100 500

Notebooks

Data Files

Filename Size
cog_distribution_raw.json 0.0 KB
query.json 0.1 KB