03B Enrichment Interproscan
Jupyter notebook from the AMR Co-Fitness Support Networks project.
NB03b: Support Network Enrichment with InterProScan Annotations¶
Re-run the H1 enrichment analysis using InterProScan GO terms (68% coverage) instead of old FB SEED annotations (variable coverage). This should provide much better power to detect functional enrichment in AMR support networks.
Approach:
- Map FB genes → gene_cluster_id → InterProScan GO terms
- Use GO biological process terms for functional categorization
- Re-run Fisher enrichment and permutation tests
Inputs: data/amr_cofitness_partners.csv, data/fb_interproscan_go.csv, data/fb_interproscan_pfam.csv
In [1]:
import os
import warnings
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
from statsmodels.stats.multitest import multipletests
warnings.filterwarnings('ignore', category=FutureWarning)
_here = os.path.abspath('')
if os.path.basename(_here) == 'notebooks':
REPO = os.path.abspath(os.path.join(_here, '..', '..', '..'))
elif os.path.exists(os.path.join(_here, 'projects', 'amr_cofitness_networks')):
REPO = _here
else:
REPO = os.path.abspath(os.path.join(_here, '..', '..', '..'))
PROJECT = os.path.join(REPO, 'projects', 'amr_cofitness_networks')
DATA = os.path.join(PROJECT, 'data')
FIGURES = os.path.join(PROJECT, 'figures')
CVF_DATA = os.path.join(REPO, 'projects', 'conservation_vs_fitness', 'data')
AMR_DATA = os.path.join(REPO, 'projects', 'amr_fitness_cost', 'data')
# Load cofitness partners
partners = pd.read_csv(os.path.join(DATA, 'amr_cofitness_partners.csv'))
partners['amr_locusId'] = partners['amr_locusId'].astype(str)
partners['partner_locusId'] = partners['partner_locusId'].astype(str)
extra = partners[~partners['is_near_operon']].copy()
# Load FB-pangenome link for locusId → gene_cluster_id mapping
fb_link = pd.read_csv(os.path.join(CVF_DATA, 'fb_pangenome_link.tsv'), sep='\t')
fb_link['locusId'] = fb_link['locusId'].astype(str)
locus_to_cluster = dict(zip(
zip(fb_link['orgId'], fb_link['locusId']),
fb_link['gene_cluster_id']
))
# Load InterProScan GO
go_annot = pd.read_csv(os.path.join(DATA, 'fb_interproscan_go.csv'))
# Load Pfam
pfam_annot = pd.read_csv(os.path.join(DATA, 'fb_interproscan_pfam.csv'))
print(f'Extra-operon partners: {len(extra):,}')
print(f'GO annotations: {len(go_annot):,} ({go_annot["gene_cluster_id"].nunique():,} clusters)')
print(f'Pfam annotations: {len(pfam_annot):,} ({pfam_annot["gene_cluster_id"].nunique():,} clusters)')
Extra-operon partners: 179,375 GO annotations: 438,000 (111,253 clusters) Pfam annotations: 228,672 (143,794 clusters)
In [2]:
# Data extraction from BERDL (requires Spark on JupyterHub)
# Guarded: only runs if cached files are missing
extraction_needed = not all(
os.path.exists(os.path.join(DATA, f))
for f in ['fb_interproscan_go.csv', 'fb_interproscan_pfam.csv', 'fb_bakta_kegg.csv']
)
if extraction_needed:
print('Cached annotation files missing — extracting from BERDL via Spark...')
from berdl_notebook_utils.setup_spark_session import get_spark_session
spark = get_spark_session()
fb_link_tmp = pd.read_csv(os.path.join(CVF_DATA, 'fb_pangenome_link.tsv'), sep='\t')
cluster_ids = fb_link_tmp['gene_cluster_id'].unique().tolist()
spark.createDataFrame([(c,) for c in cluster_ids], ['gene_cluster_id']).createOrReplaceTempView('fb_clusters')
# InterProScan GO
go = spark.sql('''
SELECT g.gene_cluster_id, g.go_id, g.go_source, g.n_supporting_analyses
FROM kbase_ke_pangenome.interproscan_go g
JOIN fb_clusters fc ON g.gene_cluster_id = fc.gene_cluster_id
''').toPandas()
go.to_csv(os.path.join(DATA, 'fb_interproscan_go.csv'), index=False)
print(f' GO: {len(go):,} rows')
# InterProScan Pfam
pfam = spark.sql('''
SELECT d.gene_cluster_id, d.signature_acc, d.signature_desc, d.ipr_acc, d.ipr_desc
FROM kbase_ke_pangenome.interproscan_domains d
JOIN fb_clusters fc ON d.gene_cluster_id = fc.gene_cluster_id
WHERE d.analysis = 'Pfam'
''').toPandas()
pfam.to_csv(os.path.join(DATA, 'fb_interproscan_pfam.csv'), index=False)
print(f' Pfam: {len(pfam):,} rows')
# Bakta KEGG
kegg = spark.sql('''
SELECT b.gene_cluster_id, b.kegg_orthology_id, b.cog_id, b.cog_category, b.ec
FROM kbase_ke_pangenome.bakta_annotations b
JOIN fb_clusters fc ON b.gene_cluster_id = fc.gene_cluster_id
WHERE b.kegg_orthology_id IS NOT NULL AND b.kegg_orthology_id != ''
''').toPandas()
kegg.to_csv(os.path.join(DATA, 'fb_bakta_kegg.csv'), index=False)
print(f' KEGG: {len(kegg):,} rows')
else:
print('Using cached annotation files:')
for f in ['fb_interproscan_go.csv', 'fb_interproscan_pfam.csv', 'fb_bakta_kegg.csv']:
path = os.path.join(DATA, f)
print(f' {f}: {os.path.getsize(path)/1e6:.1f} MB')
Using cached annotation files: fb_interproscan_go.csv: 18.4 MB fb_interproscan_pfam.csv: 24.7 MB fb_bakta_kegg.csv: 1.7 MB
1. Map partner genes to GO terms via gene_cluster_id¶
In [3]:
# Map partner locusIds to gene_cluster_ids
extra['partner_cluster_id'] = extra.apply(
lambda r: locus_to_cluster.get((r['orgId'], r['partner_locusId'])), axis=1)
print(f'Partners mapped to gene clusters: {extra["partner_cluster_id"].notna().sum():,} / {len(extra):,} '
f'({extra["partner_cluster_id"].notna().mean()*100:.1f}%)')
# Build GO lookup: gene_cluster_id → set of GO terms
cluster_go = go_annot.groupby('gene_cluster_id')['go_id'].apply(set).to_dict()
# Map GO terms onto partners
extra['partner_go_terms'] = extra['partner_cluster_id'].map(
lambda cid: cluster_go.get(cid, set()) if pd.notna(cid) else set())
has_go = extra['partner_go_terms'].apply(len) > 0
print(f'Partners with GO annotations: {has_go.sum():,} / {len(extra):,} ({has_go.mean()*100:.1f}%)')
# Build Pfam lookup similarly
cluster_pfam = pfam_annot.groupby('gene_cluster_id')['signature_acc'].apply(set).to_dict()
extra['partner_pfam'] = extra['partner_cluster_id'].map(
lambda cid: cluster_pfam.get(cid, set()) if pd.notna(cid) else set())
has_pfam = extra['partner_pfam'].apply(len) > 0
print(f'Partners with Pfam annotations: {has_pfam.sum():,} / {len(extra):,} ({has_pfam.mean()*100:.1f}%)')
Partners mapped to gene clusters: 171,051 / 179,375 (95.4%)
Partners with GO annotations: 107,909 / 179,375 (60.2%)
Partners with Pfam annotations: 146,757 / 179,375 (81.8%)
2. GO term enrichment in AMR support networks¶
For each organism: which GO terms are overrepresented in AMR cofitness partners vs all genes in that organism?
In [4]:
# Build per-organism GO backgrounds
# For each organism, get GO terms for all genes (not just AMR partners)
all_locus_clusters = {}
for org in extra['orgId'].unique():
org_link = fb_link[fb_link['orgId'] == org]
all_locus_clusters[org] = dict(zip(org_link['locusId'].astype(str), org_link['gene_cluster_id']))
# Per-organism GO enrichment using Fisher's exact test
go_enrichment = []
for org in sorted(extra['orgId'].unique()):
# All genes in this organism with GO annotations
org_clusters = all_locus_clusters.get(org, {})
bg_go_counts = Counter()
n_bg_annotated = 0
for cid in org_clusters.values():
terms = cluster_go.get(cid, set())
if terms:
n_bg_annotated += 1
for t in terms:
bg_go_counts[t] += 1
if n_bg_annotated < 100:
continue
# AMR support network genes with GO annotations
org_partners = extra[extra['orgId'] == org]
support_clusters = set(org_partners['partner_cluster_id'].dropna())
sup_go_counts = Counter()
n_sup_annotated = 0
for cid in support_clusters:
terms = cluster_go.get(cid, set())
if terms:
n_sup_annotated += 1
for t in terms:
sup_go_counts[t] += 1
if n_sup_annotated < 10:
continue
# Test each GO term present in >=5 support genes
for go_term in sup_go_counts:
if sup_go_counts[go_term] < 5:
continue
# 2x2 table: support vs background, has_term vs not
a = sup_go_counts[go_term] # support with term
b = n_sup_annotated - a # support without term
c = bg_go_counts[go_term] # background with term
d = n_bg_annotated - c # background without term
odds, p = stats.fisher_exact([[a, b], [c, d]], alternative='greater')
go_enrichment.append({
'orgId': org,
'go_term': go_term,
'support_count': a,
'support_total': n_sup_annotated,
'support_pct': a / n_sup_annotated * 100,
'bg_count': c,
'bg_total': n_bg_annotated,
'bg_pct': c / n_bg_annotated * 100,
'odds_ratio': odds,
'p_value': p,
})
go_enrich_df = pd.DataFrame(go_enrichment)
print(f'GO enrichment tests: {len(go_enrich_df):,}')
# BH-FDR within each organism
go_enrich_df['q_value'] = np.nan
for org in go_enrich_df['orgId'].unique():
mask = go_enrich_df['orgId'] == org
if mask.sum() > 0:
_, q, _, _ = multipletests(go_enrich_df.loc[mask, 'p_value'], method='fdr_bh')
go_enrich_df.loc[mask, 'q_value'] = q
go_enrich_df['significant'] = go_enrich_df['q_value'] < 0.05
print(f'Significant (q<0.05): {go_enrich_df["significant"].sum():,}')
print(f'Organisms tested: {go_enrich_df["orgId"].nunique()}')
GO enrichment tests: 3,193 Significant (q<0.05): 35 Organisms tested: 28
In [5]:
# Which GO terms are most consistently enriched across organisms?
sig = go_enrich_df[go_enrich_df['significant']]
n_orgs = go_enrich_df['orgId'].nunique()
go_consistency = sig.groupby('go_term').agg(
n_orgs_sig=('orgId', 'nunique'),
mean_odds=('odds_ratio', 'mean'),
mean_support_pct=('support_pct', 'mean'),
mean_bg_pct=('bg_pct', 'mean'),
).sort_values('n_orgs_sig', ascending=False)
print(f'GO terms significant in at least 1 organism: {len(go_consistency)}')
print(f'GO terms significant in >=3 organisms:')
multi_org = go_consistency[go_consistency['n_orgs_sig'] >= 3]
print(f' {len(multi_org)} GO terms')
print()
# Show top 30 with descriptions where available
print(f'{"GO term":15s} {"Orgs":>5s} {"MeanOR":>7s} {"Sup%":>6s} {"Bg%":>6s}')
print('-' * 45)
for go_term, row in go_consistency.head(30).iterrows():
print(f'{go_term:15s} {row["n_orgs_sig"]:5.0f} {row["mean_odds"]:7.2f} '
f'{row["mean_support_pct"]:6.1f} {row["mean_bg_pct"]:6.1f}')
GO terms significant in at least 1 organism: 15 GO terms significant in >=3 organisms: 6 GO terms GO term Orgs MeanOR Sup% Bg% --------------------------------------------- GO:0071973 5 4.69 3.0 0.7 GO:0044780 5 5.30 1.8 0.3 GO:0009288 4 4.87 2.2 0.5 GO:0071978 4 4.97 2.0 0.4 GO:0000105 3 5.31 1.5 0.3 GO:0000162 3 5.30 1.3 0.2 GO:0000160 2 1.72 6.2 3.7 GO:0007165 2 1.78 5.8 3.4 GO:0006935 1 3.12 3.4 1.1 GO:0000155 1 2.13 4.4 2.1 GO:0003774 1 5.98 1.7 0.3 GO:0016310 1 2.22 4.8 2.2 GO:0016020 1 1.42 24.1 18.3 GO:0015074 1 4.43 3.1 0.7 GO:0016772 1 2.10 4.7 2.3
In [6]:
# Enrichment by AMR mechanism: do different mechanisms enrich different GO terms?
mech_go_enrichment = []
for mech in ['efflux', 'enzymatic_inactivation', 'metal_resistance']:
mech_extra = extra[extra['amr_mechanism'] == mech]
for org in sorted(mech_extra['orgId'].unique()):
org_clusters = all_locus_clusters.get(org, {})
bg_go_counts = Counter()
n_bg = 0
for cid in org_clusters.values():
terms = cluster_go.get(cid, set())
if terms:
n_bg += 1
for t in terms:
bg_go_counts[t] += 1
if n_bg < 100:
continue
# Mechanism-specific support network
org_mech = mech_extra[mech_extra['orgId'] == org]
sup_clusters = set(org_mech['partner_cluster_id'].dropna())
sup_go = Counter()
n_sup = 0
for cid in sup_clusters:
terms = cluster_go.get(cid, set())
if terms:
n_sup += 1
for t in terms:
sup_go[t] += 1
if n_sup < 10:
continue
for go_term in sup_go:
if sup_go[go_term] < 3:
continue
a = sup_go[go_term]
b = n_sup - a
c = bg_go_counts.get(go_term, 0)
d = n_bg - c
odds, p = stats.fisher_exact([[a, b], [c, d]], alternative='greater')
mech_go_enrichment.append({
'mechanism': mech,
'orgId': org,
'go_term': go_term,
'odds_ratio': odds,
'p_value': p,
})
mech_go_df = pd.DataFrame(mech_go_enrichment)
# FDR per mechanism-organism
mech_go_df['q_value'] = np.nan
for key, grp in mech_go_df.groupby(['mechanism', 'orgId']):
if len(grp) > 0:
_, q, _, _ = multipletests(grp['p_value'], method='fdr_bh')
mech_go_df.loc[grp.index, 'q_value'] = q
mech_go_df['significant'] = mech_go_df['q_value'] < 0.05
print(f'Mechanism-specific GO enrichment tests: {len(mech_go_df):,}')
print(f'Significant: {mech_go_df["significant"].sum():,}')
print()
for mech in ['efflux', 'enzymatic_inactivation', 'metal_resistance']:
sub = mech_go_df[mech_go_df['mechanism'] == mech]
sig_sub = sub[sub['significant']]
n_orgs = sub['orgId'].nunique()
print(f'{mech:30s}: {len(sig_sub)} sig tests across {sig_sub["orgId"].nunique()}/{n_orgs} orgs')
if len(sig_sub) > 0:
top_go = sig_sub.groupby('go_term').size().sort_values(ascending=False).head(5)
for go, cnt in top_go.items():
print(f' {go}: sig in {cnt} organisms')
Mechanism-specific GO enrichment tests: 9,244
Significant: 212
efflux : 72 sig tests across 12/28 orgs
GO:0000105: sig in 6 organisms
GO:0000162: sig in 5 organisms
GO:0071973: sig in 5 organisms
GO:0044780: sig in 4 organisms
GO:0009288: sig in 3 organisms
enzymatic_inactivation : 55 sig tests across 8/26 orgs
GO:0000162: sig in 4 organisms
GO:0071973: sig in 3 organisms
GO:0009288: sig in 3 organisms
GO:0044780: sig in 3 organisms
GO:0006935: sig in 2 organisms
metal_resistance : 85 sig tests across 9/22 orgs
GO:0071973: sig in 6 organisms
GO:0009288: sig in 5 organisms
GO:0071978: sig in 5 organisms
GO:0006935: sig in 4 organisms
GO:0044780: sig in 4 organisms
In [7]:
# Summary figure: top GO terms enriched in AMR support networks
GO_DESCRIPTIONS = {
'GO:0071973': 'flagellum-dependent motility',
'GO:0044780': 'flagellum assembly',
'GO:0009288': 'bacterial-type flagellum',
'GO:0071978': 'flagellum-dependent swarming',
'GO:0000105': 'histidine biosynthesis',
'GO:0000162': 'tryptophan biosynthesis',
'GO:0000160': 'phosphorelay signal transduction',
'GO:0007165': 'signal transduction',
'GO:0006935': 'chemotaxis',
'GO:0000155': 'phosphorelay sensor kinase',
'GO:0003774': 'motor activity',
'GO:0016310': 'phosphorylation',
'GO:0016020': 'membrane',
'GO:0015074': 'DNA integration',
'GO:0016772': 'transferase (phosphorus)',
'GO:0055085': 'transmembrane transport',
}
if len(go_consistency) > 0:
fig, ax = plt.subplots(figsize=(10, 8))
top_go = go_consistency.head(20)
y_pos = range(len(top_go))
# Use descriptions where available
labels = [f"{idx} ({GO_DESCRIPTIONS.get(idx, '?')})" for idx in top_go.index]
ax.barh(y_pos, top_go['n_orgs_sig'], color='tab:blue', alpha=0.7)
ax.set_yticks(list(y_pos))
ax.set_yticklabels(labels, fontsize=8)
ax.set_xlabel('Number of Organisms (FDR < 0.05)')
ax.set_title(f'Top GO Terms Enriched in AMR Support Networks\n'
f'(InterProScan annotations, {n_orgs} organisms tested)')
ax.invert_yaxis()
plt.tight_layout()
plt.savefig(os.path.join(FIGURES, 'go_enrichment_interproscan.png'), dpi=150, bbox_inches='tight')
plt.show()
print('Saved to figures/go_enrichment_interproscan.png')
else:
print('No significant GO enrichments to plot')
Saved to figures/go_enrichment_interproscan.png
In [8]:
# Sensitivity check: confirm enrichments at |r| > 0.4
extra_04 = partners[(~partners['is_near_operon']) & (partners['abs_cofitness'] >= 0.4)].copy()
extra_04['partner_cluster_id'] = extra_04.apply(
lambda r: locus_to_cluster.get((r['orgId'], r['partner_locusId'])), axis=1)
print(f'=== Sensitivity: |r| > 0.4 (mean network ~110 genes) ===')
print(f'Partners: {len(extra_04):,}')
# Re-run Fisher for key GO terms at this threshold
for org in sorted(extra_04['orgId'].unique())[:5]: # sample 5 orgs for speed
org_clusters = all_locus_clusters.get(org, {})
bg_go_counts = Counter()
n_bg = 0
for cid in org_clusters.values():
terms = cluster_go.get(cid, set())
if terms:
n_bg += 1
for t in terms:
bg_go_counts[t] += 1
if n_bg < 100:
continue
sup_clusters = set(extra_04[extra_04['orgId'] == org]['partner_cluster_id'].dropna())
sup_go = Counter()
n_sup = 0
for cid in sup_clusters:
terms = cluster_go.get(cid, set())
if terms:
n_sup += 1
for t in terms:
sup_go[t] += 1
if n_sup < 5:
continue
# Test flagellar motility
for go_id, go_name in [('GO:0071973', 'flagellum motility'), ('GO:0000105', 'histidine biosyn')]:
a = sup_go.get(go_id, 0)
if a < 2:
continue
b = n_sup - a
c = bg_go_counts.get(go_id, 0)
d = n_bg - c
odds, p = stats.fisher_exact([[a, b], [c, d]], alternative='greater')
print(f' {org:15s} {go_name:20s}: sup={a}/{n_sup} bg={c}/{n_bg} OR={odds:.2f} p={p:.4g}')
# Full run across all organisms at |r|>0.4
go_04_results = []
for org in sorted(extra_04['orgId'].unique()):
org_clusters = all_locus_clusters.get(org, {})
bg_go_counts = Counter()
n_bg = 0
for cid in org_clusters.values():
terms = cluster_go.get(cid, set())
if terms:
n_bg += 1
for t in terms:
bg_go_counts[t] += 1
if n_bg < 100:
continue
sup_clusters = set(extra_04[extra_04['orgId'] == org]['partner_cluster_id'].dropna())
sup_go = Counter()
n_sup = 0
for cid in sup_clusters:
terms = cluster_go.get(cid, set())
if terms:
n_sup += 1
for t in terms:
sup_go[t] += 1
if n_sup < 5:
continue
for go_term in sup_go:
if sup_go[go_term] < 3:
continue
a = sup_go[go_term]
b = n_sup - a
c = bg_go_counts.get(go_term, 0)
d = n_bg - c
odds, p = stats.fisher_exact([[a, b], [c, d]], alternative='greater')
go_04_results.append({'orgId': org, 'go_term': go_term, 'p_value': p, 'odds_ratio': odds})
go_04_df = pd.DataFrame(go_04_results)
go_04_df['q_value'] = np.nan
for org in go_04_df['orgId'].unique():
mask = go_04_df['orgId'] == org
if mask.sum() > 0:
_, q, _, _ = multipletests(go_04_df.loc[mask, 'p_value'], method='fdr_bh')
go_04_df.loc[mask, 'q_value'] = q
sig_04 = go_04_df[go_04_df['q_value'] < 0.05]
print(f'\n|r|>0.4 enrichment: {len(go_04_df)} tests, {len(sig_04)} significant (FDR<0.05)')
if len(sig_04) > 0:
top_04 = sig_04.groupby('go_term').size().sort_values(ascending=False).head(10)
print('Top GO terms at |r|>0.4:')
for go_id, cnt in top_04.items():
desc = GO_DESCRIPTIONS.get(go_id, go_id)
print(f' {go_id} ({desc}): {cnt} organisms')
else:
print('No significant enrichments at |r|>0.4 (expected with smaller networks)')
=== Sensitivity: |r| > 0.4 (mean network ~110 genes) === Partners: 65,188 ANA3 flagellum motility : sup=12/1692 bg=20/2915 OR=1.03 p=0.5305 ANA3 histidine biosyn : sup=8/1692 bg=8/2915 OR=1.73 p=0.1978 BFirm flagellum motility : sup=15/790 bg=21/4826 OR=4.43 p=4.552e-05 BFirm histidine biosyn : sup=9/790 bg=12/4826 OR=4.62 p=0.001238 Caulo flagellum motility : sup=12/1459 bg=13/2447 OR=1.55 p=0.1841 Caulo histidine biosyn : sup=10/1459 bg=10/2447 OR=1.68 p=0.1729 Cup4G11 flagellum motility : sup=7/865 bg=19/4810 OR=2.06 p=0.08876
|r|>0.4 enrichment: 2763 tests, 110 significant (FDR<0.05) Top GO terms at |r|>0.4: GO:0044780 (flagellum assembly): 8 organisms GO:0071973 (flagellum-dependent motility): 7 organisms GO:0009288 (bacterial-type flagellum): 6 organisms GO:0071978 (flagellum-dependent swarming): 6 organisms GO:0000162 (tryptophan biosynthesis): 6 organisms GO:0006935 (chemotaxis): 5 organisms GO:0000105 (histidine biosynthesis): 4 organisms GO:0006568 (GO:0006568): 4 organisms GO:0007165 (signal transduction): 4 organisms GO:0005198 (GO:0005198): 3 organisms
In [9]:
# Save results
go_enrich_df.to_csv(os.path.join(DATA, 'go_enrichment_interproscan.csv'), index=False)
mech_go_df.to_csv(os.path.join(DATA, 'mechanism_go_enrichment.csv'), index=False)
print(f'=== NB03b Summary ===')
print(f'Annotation source: InterProScan GO (68% cluster coverage vs ~15% old SEED)')
print(f'GO enrichment tests: {len(go_enrich_df):,}')
print(f'Significant (FDR<0.05): {go_enrich_df["significant"].sum():,}')
print(f'GO terms sig in >=3 organisms: {len(multi_org)}')
if len(go_consistency) > 0:
print(f'Top enriched GO term: {go_consistency.index[0]} ({go_consistency.iloc[0]["n_orgs_sig"]:.0f} organisms)')
=== NB03b Summary === Annotation source: InterProScan GO (68% cluster coverage vs ~15% old SEED) GO enrichment tests: 3,193 Significant (FDR<0.05): 35 GO terms sig in >=3 organisms: 6 Top enriched GO term: GO:0071973 (5 organisms)