02 Spark Enrichment
Jupyter notebook from the Truly Dark Genes — What Remains Unknown After Modern Annotation? project.
NB02: Spark Data Enrichment for Truly Dark Genes¶
Goal: Query BERDL for Pfam domain hits, database cross-references, eggNOG annotations, and gene properties for the ~6,400 truly dark genes.
Environment: Requires BERDL JupyterHub with get_spark_session() available.
Input: data/truly_dark_genes.tsv (from NB01)
In [1]:
import pandas as pd
import os
from scipy import stats
try:
spark = get_spark_session()
except NameError:
from berdl_notebook_utils.setup_spark_session import get_spark_session
spark = get_spark_session()
print(f'Spark version: {spark.version}')
OUT = '../data'
os.makedirs(OUT, exist_ok=True)
PAN = 'kbase_ke_pangenome'
FB = 'kescience_fitnessbrowser'
Spark version: 4.0.1
1. Load Data and Create Temp Views¶
In [2]:
td = pd.read_csv(f'{OUT}/truly_dark_genes.tsv', sep='\t')
al = pd.read_csv(f'{OUT}/annotation_lag_genes.tsv', sep='\t')
print(f'Truly dark: {len(td):,}, Annotation-lag: {len(al):,}')
td_clusters = td['gene_cluster_id'].dropna().unique().tolist()
print(f'Cluster IDs to query: {len(td_clusters):,}')
# Temp view for cluster-based JOINs
td_spark = spark.createDataFrame([(c,) for c in td_clusters], ['gene_cluster_id'])
td_spark.createOrReplaceTempView('truly_dark_ids')
print(f'Created truly_dark_ids temp view')
Truly dark: 6,427, Annotation-lag: 33,105
Cluster IDs to query: 5,870 Created truly_dark_ids temp view
2. Bakta Pfam Domain Hits¶
In [3]:
pfam_df = spark.sql(f"""
SELECT p.gene_cluster_id, p.pfam_id, p.pfam_name,
p.score, p.evalue, p.aa_cov, p.hmm_cov
FROM {PAN}.bakta_pfam_domains p
JOIN truly_dark_ids t ON p.gene_cluster_id = t.gene_cluster_id
""").toPandas()
print(f'Pfam hits: {len(pfam_df):,} for {pfam_df["gene_cluster_id"].nunique():,} clusters')
print(f'Coverage: {pfam_df["gene_cluster_id"].nunique() / len(td_clusters) * 100:.1f}%')
pfam_df['is_duf'] = pfam_df['pfam_name'].str.contains('DUF|UPF|Uncharacteri', case=False, na=False)
duf_only = set(pfam_df[pfam_df['is_duf']]['gene_cluster_id'].unique()) - set(pfam_df[~pfam_df['is_duf']]['gene_cluster_id'].unique())
has_nonduf = set(pfam_df[~pfam_df['is_duf']]['gene_cluster_id'].unique())
print(f'\nDUF-only: {len(duf_only)}, Non-DUF: {len(has_nonduf)}')
print(f'\nTop 20 Pfam domains:')
print(pfam_df['pfam_name'].value_counts().head(20).to_string())
pfam_df.to_csv(f'{OUT}/truly_dark_pfam.tsv', sep='\t', index=False)
print(f'\nSaved: truly_dark_pfam.tsv')
Pfam hits: 362 for 235 clusters Coverage: 4.0% DUF-only: 56, Non-DUF: 179 Top 20 Pfam domains: pfam_name TPR_19 9 TPR_17 7 TPR_14 6 TPR_2 6 TPR_8 6 TPR_16 5 TPR_12 4 TPR_6 4 Xre-like-HTH 3 Xre_MbcA_ParS_C 3 DUF7716 3 RHS_repeat 3 DUF7661 3 DUF7673 3 TPR_1 3 TPR_10 3 TPR_11 3 DUF6630 3 S_layer_C 3 DUF553 3 Saved: truly_dark_pfam.tsv
3. Database Cross-References¶
In [4]:
xrefs_df = spark.sql(f"""
SELECT d.gene_cluster_id, d.db, d.accession
FROM {PAN}.bakta_db_xrefs d
JOIN truly_dark_ids t ON d.gene_cluster_id = t.gene_cluster_id
""").toPandas()
print(f'Cross-refs: {len(xrefs_df):,} for {xrefs_df["gene_cluster_id"].nunique():,} clusters')
print(f'Coverage: {xrefs_df["gene_cluster_id"].nunique() / len(td_clusters) * 100:.1f}%')
print(f'\nBy database:')
print(xrefs_df.groupby('db').agg(
n_refs=('accession', 'count'),
n_clusters=('gene_cluster_id', 'nunique')
).sort_values('n_clusters', ascending=False).to_string())
xrefs_df.to_csv(f'{OUT}/truly_dark_xrefs.tsv', sep='\t', index=False)
print(f'\nSaved: truly_dark_xrefs.tsv')
Cross-refs: 26,917 for 4,971 clusters
Coverage: 84.7%
By database:
n_refs n_clusters
db
SO 4811 4811
UniRef 13726 4811
UniParc 4212 4212
RefSeq 3799 3799
PFAM 362 235
KEGG 6 6
EC 1 1
Saved: truly_dark_xrefs.tsv
4. eggNOG Annotations¶
In [5]:
eggnog_df = spark.sql(f"""
SELECT e.query_name AS gene_cluster_id,
e.seed_ortholog, e.evalue, e.score,
e.eggnog_ogs, e.cog_category,
e.description, e.preferred_name,
e.gos, e.ec, e.kegg_ko, e.kegg_pathway, e.kegg_module,
e.pfams
FROM {PAN}.eggnog_mapper_annotations e
JOIN truly_dark_ids t ON e.query_name = t.gene_cluster_id
""").toPandas()
print(f'eggNOG annotations: {len(eggnog_df):,} for {eggnog_df["gene_cluster_id"].nunique():,} clusters')
print(f'Coverage: {eggnog_df["gene_cluster_id"].nunique() / len(td_clusters) * 100:.1f}%')
for col, label in [('cog_category','COG'), ('kegg_ko','KEGG'), ('description','Desc'), ('pfams','Pfam'), ('gos','GO')]:
has = eggnog_df[col].notna() & (eggnog_df[col] != '') & (eggnog_df[col] != '-')
print(f' {label}: {has.sum()} ({has.mean()*100:.1f}%)')
# COG category distribution
has_cog = eggnog_df['cog_category'].notna() & (eggnog_df['cog_category'] != '') & (eggnog_df['cog_category'] != '-')
if has_cog.sum() > 0:
all_cogs = [c for cats in eggnog_df.loc[has_cog, 'cog_category'] for c in str(cats) if c.isalpha()]
print(f'\nCOG categories:')
print(pd.Series(all_cogs).value_counts().head(15).to_string())
eggnog_df.to_csv(f'{OUT}/truly_dark_eggnog.tsv', sep='\t', index=False)
print(f'\nSaved: truly_dark_eggnog.tsv')
eggNOG annotations: 2,551 for 2,551 clusters Coverage: 43.5% COG: 433 (17.0%) KEGG: 118 (4.6%) Desc: 433 (17.0%) Pfam: 309 (12.1%) GO: 10 (0.4%) COG categories: S 240 L 34 M 22 K 18 T 16 C 16 J 16 O 15 U 14 N 13 P 13 E 11 G 9 F 7 H 6 Saved: truly_dark_eggnog.tsv
5. Gene Properties from Fitness Browser¶
In [6]:
# Build temp view using list of tuples (avoids pyarrow ChunkedArray issue)
all_dark = pd.concat([
td[['orgId', 'locusId']].assign(gene_class='truly_dark'),
al[['orgId', 'locusId']].assign(gene_class='annotation_lag')
])
rows = [(r.orgId, r.locusId, r.gene_class) for r in all_dark.itertuples()]
spark.createDataFrame(rows, ['orgId', 'locusId', 'gene_class']).createOrReplaceTempView('dark_gene_list')
gene_props = spark.sql(f"""
SELECT g.orgId, g.locusId, g.scaffoldId,
CAST(g.begin AS INT) AS gene_begin,
CAST(g.end AS INT) AS gene_end,
g.strand,
ABS(CAST(g.end AS INT) - CAST(g.begin AS INT)) AS gene_length_bp,
CAST(g.GC AS DOUBLE) AS GC,
d.gene_class
FROM {FB}.gene g
JOIN dark_gene_list d ON g.orgId = d.orgId AND g.locusId = d.locusId
""").toPandas()
print(f'Gene properties: {len(gene_props):,}')
print(f' Truly dark: {(gene_props["gene_class"]=="truly_dark").sum():,}')
print(f' Annotation-lag: {(gene_props["gene_class"]=="annotation_lag").sum():,}')
td_gc = gene_props.loc[gene_props['gene_class']=='truly_dark', 'GC'].dropna()
al_gc = gene_props.loc[gene_props['gene_class']=='annotation_lag', 'GC'].dropna()
stat, pval = stats.mannwhitneyu(td_gc, al_gc, alternative='two-sided')
d_gc = (td_gc.mean() - al_gc.mean()) / (((td_gc.std()**2 + al_gc.std()**2) / 2) ** 0.5)
print(f'\nGC: TD mean={td_gc.mean():.4f}, AL mean={al_gc.mean():.4f}, d={d_gc:.3f}, p={pval:.2e}')
td_bp = gene_props.loc[gene_props['gene_class']=='truly_dark', 'gene_length_bp'].dropna()
al_bp = gene_props.loc[gene_props['gene_class']=='annotation_lag', 'gene_length_bp'].dropna()
print(f'Length (bp): TD median={td_bp.median():.0f}, AL median={al_bp.median():.0f}')
gene_props.to_csv(f'{OUT}/gene_properties.tsv', sep='\t', index=False)
print(f'\nSaved: gene_properties.tsv')
Gene properties: 39,532 Truly dark: 6,427 Annotation-lag: 33,105 GC: TD mean=0.5424, AL mean=0.5843, d=-0.395, p=1.98e-115 Length (bp): TD median=356, AL median=575 Saved: gene_properties.tsv
6. Orthologs for Truly Dark Genes¶
In [7]:
td_rows = [(r.orgId, r.locusId) for r in td[['orgId', 'locusId']].drop_duplicates().itertuples()]
spark.createDataFrame(td_rows, ['orgId', 'locusId']).createOrReplaceTempView('td_gene_ids')
# ortholog table has columns: orgId1, locusId1, orgId2, locusId2, ratio
orthologs = spark.sql(f"""
SELECT o.orgId1, o.locusId1, o.orgId2, o.locusId2, o.ratio
FROM {FB}.ortholog o
JOIN td_gene_ids t ON o.orgId1 = t.orgId AND o.locusId1 = t.locusId
""").toPandas()
print(f'Ortholog pairs: {len(orthologs):,}')
print(f'TD genes with orthologs: {orthologs[["orgId1","locusId1"]].drop_duplicates().shape[0]:,}')
print(f'Partner organisms: {orthologs["orgId2"].nunique()}')
orth_counts = orthologs.groupby(['orgId1', 'locusId1'])['orgId2'].nunique()
print(f'\nOrtholog breadth: mean={orth_counts.mean():.1f}, median={orth_counts.median():.1f}')
print(f' 1 org: {(orth_counts == 1).sum()}, 2-5: {((orth_counts >= 2) & (orth_counts <= 5)).sum()}, 6+: {(orth_counts >= 6).sum()}')
orthologs.to_csv(f'{OUT}/truly_dark_orthologs.tsv', sep='\t', index=False)
print(f'Saved: truly_dark_orthologs.tsv')
Ortholog pairs: 3,449 TD genes with orthologs: 1,885 Partner organisms: 48 Ortholog breadth: mean=1.8, median=1.0 1 org: 1275, 2-5: 548, 6+: 62 Saved: truly_dark_orthologs.tsv
7. Summary¶
In [8]:
print('=== NB02 Complete ===')
print(f'\nData enrichment for {len(td_clusters):,} truly dark gene clusters:')
print(f' Pfam: {len(pfam_df):,} hits, {pfam_df["gene_cluster_id"].nunique():,} clusters')
print(f' Xrefs: {len(xrefs_df):,} refs, {xrefs_df["gene_cluster_id"].nunique():,} clusters')
print(f' eggNOG: {len(eggnog_df):,} annots, {eggnog_df["gene_cluster_id"].nunique():,} clusters')
print(f' Gene props: {len(gene_props):,} (TD + AL)')
print(f' Orthologs: {len(orthologs):,} pairs')
for f in ['truly_dark_pfam.tsv', 'truly_dark_xrefs.tsv', 'truly_dark_eggnog.tsv',
'gene_properties.tsv', 'truly_dark_orthologs.tsv']:
path = f'{OUT}/{f}'
if os.path.exists(path):
print(f' {f}: {os.path.getsize(path)/1024:.0f} KB')
=== NB02 Complete === Data enrichment for 5,870 truly dark gene clusters: Pfam: 362 hits, 235 clusters Xrefs: 26,917 refs, 4,971 clusters eggNOG: 2,551 annots, 2,551 clusters Gene props: 39,532 (TD + AL) Orthologs: 3,449 pairs truly_dark_pfam.tsv: 41 KB truly_dark_xrefs.tsv: 1098 KB truly_dark_eggnog.tsv: 452 KB gene_properties.tsv: 2875 KB truly_dark_orthologs.tsv: 225 KB