03 Cluster Id Mapping
Jupyter notebook from the Acinetobacter baylyi ADP1 Data Explorer project.
03 — Pangenome Cluster ID Mapping¶
The ADP1 database uses mmseqs2-style cluster IDs (NHSXFYEX_mmseqsCluster_NNNN) while
BERDL uses centroid gene IDs (NC_005966.1_1024). NB02 showed 0% direct match.
This notebook finds the bridge: the pan_genome_features table in the ADP1 database
contains genes from all 13 BERDL genomes with contig-based feature_id values that
match the gene IDs used in BERDL's gene_genecluster_junction table.
Mapping chain:
BERDL gene_cluster.gene_cluster_id (centroid gene ID)
→ gene_genecluster_junction.gene_id (all genes in cluster)
→ ADP1 pan_genome_features.feature_id (same gene IDs)
→ ADP1 pan_genome_features.cluster_id (NHSXFYEX format)
import sqlite3
import pandas as pd
from pathlib import Path
from berdl_notebook_utils.setup_spark_session import get_spark_session
spark = get_spark_session()
print(f'Spark version: {spark.version}')
DB_PATH = Path('../user_data/berdl_tables.db')
conn = sqlite3.connect(DB_PATH)
CLADE = 's__Acinetobacter_baylyi--RS_GCF_000368685.1'
Spark version: 4.0.1
1. Fetch BERDL Cluster IDs¶
Get all gene clusters for A. baylyi from the BERDL pangenome.
berdl_clusters = spark.sql(f'''
SELECT gene_cluster_id, is_core, is_auxiliary, is_singleton
FROM kbase_ke_pangenome.gene_cluster
WHERE gtdb_species_clade_id = "{CLADE}"
''').toPandas()
print(f'BERDL clusters for A. baylyi: {len(berdl_clusters)}')
print(f' Core: {berdl_clusters.is_core.sum()}')
print(f' Auxiliary: {berdl_clusters.is_auxiliary.sum()}')
print(f' Singleton: {berdl_clusters.is_singleton.sum()}')
print(f'\nSample cluster IDs (these are centroid gene IDs):')
print(f' {berdl_clusters["gene_cluster_id"].head(5).tolist()}')
BERDL clusters for A. baylyi: 4891 Core: 3207 Auxiliary: 1684 Singleton: 1521 Sample cluster IDs (these are centroid gene IDs): ['NC_005966.1_1', 'NC_005966.1_1024', 'NC_005966.1_1025', 'NC_005966.1_1026', 'NC_005966.1_1027']
2. Fetch Gene-Cluster Junction¶
For each BERDL cluster, get all member gene IDs via the junction table. This is a large table so we query in batches.
berdl_cids = berdl_clusters['gene_cluster_id'].tolist()
batch_size = 200
all_mappings = []
for i in range(0, len(berdl_cids), batch_size):
batch = berdl_cids[i:i+batch_size]
id_list = ','.join([repr(c) for c in batch])
batch_df = spark.sql(f'''
SELECT gene_id, gene_cluster_id
FROM kbase_ke_pangenome.gene_genecluster_junction
WHERE gene_cluster_id IN ({id_list})
''').toPandas()
all_mappings.append(batch_df)
junction_df = pd.concat(all_mappings, ignore_index=True)
print(f'Gene-cluster junction records: {len(junction_df):,}')
print(f'Unique clusters: {junction_df.gene_cluster_id.nunique()}')
print(f'Unique genes: {junction_df.gene_id.nunique()}')
Gene-cluster junction records: 43,754 Unique clusters: 4891 Unique genes: 43754
3. Match to ADP1 Pan-Genome Features¶
The ADP1 database's pan_genome_features table contains genes from all 13 BERDL
genomes, with contig-based feature_id values identical to BERDL's gene_id format.
Each gene also has a cluster_id in the ADP1 mmseqs2 format.
# Load ADP1 pan_genome_features (feature_id -> cluster_id mapping)
adp1_pan = pd.read_sql('SELECT feature_id, cluster_id FROM pan_genome_features', conn)
print(f'ADP1 pan_genome_features: {len(adp1_pan):,} genes')
print(f'Unique ADP1 cluster IDs: {adp1_pan.cluster_id.nunique()}')
# Join: junction.gene_id = adp1_pan.feature_id
merged = junction_df.merge(adp1_pan, left_on='gene_id', right_on='feature_id', how='inner')
print(f'\nMatched gene-level records: {len(merged):,} / {len(junction_df):,} ({len(merged)/len(junction_df)*100:.1f}%)')
ADP1 pan_genome_features: 43,754 genes Unique ADP1 cluster IDs: 4084 Matched gene-level records: 43,754 / 43,754 (100.0%)
4. Build Cluster-to-Cluster Mapping¶
For each BERDL cluster, determine which ADP1 cluster it maps to. Uses the mode (most common) ADP1 cluster_id among all matched genes.
cluster_map = merged.groupby('gene_cluster_id')['cluster_id'].agg(
lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else x.iloc[0]
).reset_index()
cluster_map.columns = ['berdl_cluster_id', 'adp1_cluster_id']
# Add BERDL cluster metadata
cluster_map = cluster_map.merge(
berdl_clusters, left_on='berdl_cluster_id', right_on='gene_cluster_id', how='left'
).drop(columns=['gene_cluster_id'])
print(f'Cluster-to-cluster mapping:')
print(f' BERDL clusters: {len(berdl_clusters)}')
print(f' Mapped: {len(cluster_map)} ({len(cluster_map)/len(berdl_clusters)*100:.1f}%)')
print(f' Unique ADP1 clusters: {cluster_map.adp1_cluster_id.nunique()}')
print()
# Breakdown by core/accessory
print('Mapping by cluster type:')
for col in ['is_core', 'is_auxiliary', 'is_singleton']:
n = cluster_map[col].sum()
print(f' {col}: {n}')
print(f'\nSample mapping:')
display(cluster_map.head(10))
Cluster-to-cluster mapping: BERDL clusters: 4891 Mapped: 4891 (100.0%) Unique ADP1 clusters: 4081 Mapping by cluster type: is_core: 3207 is_auxiliary: 1684 is_singleton: 1521 Sample mapping:
| berdl_cluster_id | adp1_cluster_id | is_core | is_auxiliary | is_singleton | |
|---|---|---|---|---|---|
| 0 | NC_005966.1_1 | NHSXFYEX_mmseqsCluster_3739 | True | False | False |
| 1 | NC_005966.1_1024 | NHSXFYEX_mmseqsCluster_1955 | True | False | False |
| 2 | NC_005966.1_1025 | NHSXFYEX_mmseqsCluster_3879 | True | False | False |
| 3 | NC_005966.1_1026 | NHSXFYEX_mmseqsCluster_2741 | True | False | False |
| 4 | NC_005966.1_1027 | NHSXFYEX_mmseqsCluster_3092 | True | False | False |
| 5 | NC_005966.1_1028 | NHSXFYEX_mmseqsCluster_0114 | True | False | False |
| 6 | NC_005966.1_1056 | NHSXFYEX_mmseqsCluster_3221 | True | False | False |
| 7 | NC_005966.1_1057 | NHSXFYEX_mmseqsCluster_1766 | True | False | False |
| 8 | NC_005966.1_1088 | NHSXFYEX_mmseqsCluster_2097 | True | False | False |
| 9 | NC_005966.1_1120 | NHSXFYEX_mmseqsCluster_2371 | True | False | False |
5. Validate: Cross-Reference with ADP1 Genome Features¶
Verify that the mapped ADP1 cluster IDs appear in genome_features.pangenome_cluster_id
for the user-annotated ADP1 genome.
# Get ADP1 genome_features cluster IDs
adp1_gf_clusters = pd.read_sql(
'SELECT DISTINCT pangenome_cluster_id FROM genome_features WHERE pangenome_cluster_id IS NOT NULL',
conn
)
adp1_gf_set = set(adp1_gf_clusters['pangenome_cluster_id'].tolist())
mapped_set = set(cluster_map['adp1_cluster_id'].tolist())
overlap = adp1_gf_set & mapped_set
only_gf = adp1_gf_set - mapped_set
only_berdl = mapped_set - adp1_gf_set
print(f'ADP1 genome_features clusters: {len(adp1_gf_set)}')
print(f'BERDL-mapped clusters: {len(mapped_set)}')
print(f'Overlap: {len(overlap)} ({len(overlap)/len(adp1_gf_set)*100:.1f}% of genome_features)')
print(f'Only in genome_features: {len(only_gf)}')
print(f'Only in BERDL mapping: {len(only_berdl)}')
print()
print(f'The {len(only_gf)} clusters only in genome_features likely correspond to the')
print(f'user-annotated ADP1 genome (not in BERDL) which has its own cluster assignments.')
ADP1 genome_features clusters: 2538 BERDL-mapped clusters: 4081 Overlap: 2538 (100.0% of genome_features) Only in genome_features: 0 Only in BERDL mapping: 1543 The 0 clusters only in genome_features likely correspond to the user-annotated ADP1 genome (not in BERDL) which has its own cluster assignments.
# Save the mapping
out_path = '../data/cluster_id_mapping.csv'
cluster_map.to_csv(out_path, index=False)
print(f'Saved: {out_path} ({len(cluster_map)} rows)')
print()
print('Usage: join this table to link BERDL pangenome annotations')
print('(eggnog_mapper_annotations, functional data) to ADP1 genes')
print('via their shared cluster IDs.')
conn.close()
spark.stop()
print('\nConnections closed.')
Saved: ../data/cluster_id_mapping.csv (4891 rows) Usage: join this table to link BERDL pangenome annotations (eggnog_mapper_annotations, functional data) to ADP1 genes via their shared cluster IDs. Connections closed.