03 Module Cooccurrence
Jupyter notebook from the Prophage Gene Modules and Terminase-Defined Lineages Across Bacterial Phylogeny and Environmental Gradients project.
NB03: Module Co-occurrence Validation¶
Project: Prophage Ecology Across Bacterial Phylogeny and Environmental Gradients
Goal: Test H1c — do genes within each predefined prophage module co-occur across genomes significantly more than random gene groupings of equal size? Also test contig co-localization.
Dependencies: NB01 outputs (data/prophage_gene_clusters.tsv, data/species_module_summary.tsv)
Environment: Requires BERDL JupyterHub (Spark SQL) for genome-level extraction
Outputs:
data/module_cooccurrence_stats.tsv— co-occurrence statistics per module per speciesdata/contig_colocation.tsv— contig co-localization statisticsfigures/module_cooccurrence_heatmap.png
Design Decisions (v2 revision)¶
The original approach (v1) attempted 50 species with full genome sets, joining the
billion-row gene and gene_genecluster_junction tables 50 times and computing
pairwise Jaccard in pure Python. This was intractable — K. pneumoniae alone (14,240
genomes, 9,067 clusters) stalled the notebook.
Changes in v2:
| Parameter | v1 | v2 | Rationale |
|---|---|---|---|
| Species count | 50 | 15 | Validation across phyla, not exhaustive survey |
| Genome cap | None (up to 14K) | 300 | Jaccard stabilizes well below 300; avoids maxResultSize errors |
| Query strategy | Per-species join on 1B-row tables | BROADCAST both genome and cluster ID temp views | Lets Spark push filters before the expensive join |
| Jaccard computation | Python nested loop O(n²) | scipy.spatial.distance.pdist (vectorized C) |
Orders of magnitude faster |
| Null permutations | 500 | 200 | Ample for z-score estimation at ~half the cost |
import sys
import os
import pandas as pd
import numpy as np
from scipy import stats
from scipy.spatial.distance import pdist
spark = get_spark_session()
sys.path.insert(0, '../src')
from prophage_utils import MODULES
os.makedirs('../data', exist_ok=True)
os.makedirs('../figures', exist_ok=True)
# Load NB01 outputs
prophage_clusters = pd.read_csv('../data/prophage_gene_clusters.tsv', sep='\t')
species_summary = pd.read_csv('../data/species_module_summary.tsv', sep='\t')
# Fix boolean columns from TSV read (may be string "True"/"False")
for module_id in MODULES.keys():
col = f'has_{module_id}'
if col in prophage_clusters.columns:
prophage_clusters[col] = prophage_clusters[col].astype(str).str.lower() == 'true'
print(f'Prophage clusters: {len(prophage_clusters):,}')
print(f'Species with prophage: {len(species_summary):,}')
# Constants
MAX_GENOMES = 300 # Cap genomes per species to stabilize Jaccard & avoid maxResultSize
N_SPECIES = 15 # Enough for cross-phylum validation
N_NULL = 200 # Null permutations for z-score
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
Prophage clusters: 4,005,537 Species with prophage: 27,702
1. Select Representative Species¶
Choose ~15 species spanning phylogenetic diversity with >=20 genomes each and multiple prophage modules present. Stratified sampling across phyla ensures we validate co-occurrence broadly, not just in well-studied clades.
# Get taxonomy for species selection
# NOTE: Must join on genome_id, NOT gtdb_taxonomy_id.
# The genome table truncates gtdb_taxonomy_id at genus level while the
# taxonomy table includes species — so they never match via =.
taxonomy = spark.sql("""
SELECT DISTINCT g.gtdb_species_clade_id, t.phylum, t.class, t.order, t.family
FROM kbase_ke_pangenome.genome g
JOIN kbase_ke_pangenome.gtdb_taxonomy_r214v1 t ON g.genome_id = t.genome_id
""").toPandas()
print(f'Taxonomy rows: {len(taxonomy):,}')
# Merge with species summary
candidates = species_summary.merge(taxonomy, on='gtdb_species_clade_id', how='left')
# Filter: >=20 genomes, >=5 modules present, >=10 prophage clusters
candidates = candidates[
(candidates['no_genomes'] >= 20) &
(candidates['n_modules_present'] >= 5) &
(candidates['n_prophage_clusters'] >= 10)
]
print(f'Candidate species (>=20 genomes, >=5 modules, >=10 clusters): {len(candidates):,}')
print(f'Phyla represented in candidates: {candidates["phylum"].nunique()}')
# Stratified sampling: 1-2 species per phylum, preferring species with
# moderate genome counts (20-500) to avoid very large species that dominate runtime
candidates['genome_bin'] = pd.cut(candidates['no_genomes'],
bins=[0, 100, 500, 1000, 100000],
labels=['small', 'medium', 'large', 'very_large'])
# Prefer medium-sized species (100-500 genomes) for balanced representation
selected_rows = []
phyla = candidates['phylum'].dropna().unique()
for phylum in sorted(phyla):
phylum_cands = candidates[candidates['phylum'] == phylum]
# Prefer medium, then small, then large
for preferred_bin in ['medium', 'small', 'large', 'very_large']:
bin_cands = phylum_cands[phylum_cands['genome_bin'] == preferred_bin]
if len(bin_cands) > 0:
best = bin_cands.nlargest(1, ['n_modules_present', 'n_prophage_clusters'])
selected_rows.append(best.iloc[0])
break
if len(selected_rows) >= N_SPECIES:
break
selected = pd.DataFrame(selected_rows)
# If we have fewer than N_SPECIES, fill from remaining candidates
if len(selected) < N_SPECIES:
already_selected = set(selected['gtdb_species_clade_id']) if len(selected) > 0 else set()
remaining = candidates[~candidates['gtdb_species_clade_id'].isin(already_selected)]
extra = remaining.nlargest(N_SPECIES - len(selected), 'n_prophage_clusters')
selected = pd.concat([selected, extra])
selected = selected.head(N_SPECIES)
print(f'\nSelected {len(selected)} species across {selected["phylum"].nunique()} phyla:')
for _, row in selected.iterrows():
print(f' {row["phylum"]}: {row["gtdb_species_clade_id"]} '
f'({row["no_genomes"]} genomes, {row["n_prophage_clusters"]} prophage clusters, '
f'{row["n_modules_present"]} modules)')
Taxonomy rows: 27,690 Candidate species (>=20 genomes, >=5 modules, >=10 clusters): 1,182 Phyla represented in candidates: 23 Selected 15 species across 15 phyla: p__Acidobacteriota: s__Geothrix_sp903913215--GB_GCA_903913215.1 (29 genomes, 255 prophage clusters, 7 modules) p__Actinomycetota: s__Streptomyces_albidoflavus--RS_GCF_000719955.1 (117 genomes, 1189 prophage clusters, 7 modules) p__Bacillota: s__Bacillus_A_wiedmannii--RS_GCF_001583695.1 (235 genomes, 2298 prophage clusters, 7 modules) p__Bacillota_A: s__Blautia_A_wexlerae--GB_GCA_025148125.1 (175 genomes, 2559 prophage clusters, 7 modules) p__Bacillota_C: s__Megamonas_funiformis--RS_GCF_010669225.1 (44 genomes, 671 prophage clusters, 7 modules) p__Bacteroidota: s__Bacteroides_uniformis--GB_GCA_025147485.1 (401 genomes, 2677 prophage clusters, 7 modules) p__Campylobacterota: s__Helicobacter_pylori_BU--RS_GCF_900120335.1 (370 genomes, 195 prophage clusters, 7 modules) p__Chlamydiota: s__Chlamydophila_psittaci--RS_GCF_000204255.1 (64 genomes, 60 prophage clusters, 5 modules) p__Chloroflexota: s__UBA12225_sp002347925--GB_GCA_002347925.1 (21 genomes, 155 prophage clusters, 7 modules) p__Cyanobacteriota: s__Microcystis_panniformis--RS_GCF_014698335.1 (46 genomes, 323 prophage clusters, 7 modules) p__Deinococcota: s__Thermus_scotoductus--RS_GCF_000381045.1 (47 genomes, 244 prophage clusters, 7 modules) p__Desulfobacterota: s__Bilophila_wadsworthia--RS_GCF_000701705.1 (32 genomes, 680 prophage clusters, 7 modules) p__Desulfobacterota_F: s__JACRCG01_sp903852495--GB_GCA_903852495.1 (26 genomes, 362 prophage clusters, 6 modules) p__Fusobacteriota: s__Fusobacterium_C_necrophorum--RS_GCF_004006635.1 (49 genomes, 411 prophage clusters, 7 modules) p__Halobacteriota: s__Methanothrix_soehngenii--RS_GCF_000204415.1 (21 genomes, 120 prophage clusters, 7 modules)
2. Extract Genome × Cluster Presence Matrices¶
Key optimization (v2b): Instead of 15 separate Spark queries (each scanning the
billion-row tables), we do ONE query with all genome IDs and all cluster IDs
across all species. This means one scan of gene and gene_genecluster_junction
instead of 15, reducing wall-clock time from ~60+ min to ~15-20 min.
We BROADCAST both temp views (up to 4,500 genome IDs and ~30,000 cluster IDs), which Spark can handle efficiently.
def mean_pairwise_jaccard(binary_matrix):
"""Compute mean pairwise Jaccard similarity between columns of a binary matrix.
Uses scipy.spatial.distance.pdist for vectorized computation (C-level speed).
pdist returns Jaccard *distance* (1 - similarity), so we convert.
"""
# pdist expects rows=observations, cols=features
# We want pairwise between clusters (columns), so transpose
mat = binary_matrix.values.T.astype(bool)
if mat.shape[0] < 2:
return 0.0
# pdist 'jaccard' = fraction of positions where vectors differ (among positions
# where at least one is nonzero). This is 1 - Jaccard similarity.
distances = pdist(mat, metric='jaccard')
# Handle NaN from all-zero vectors
distances = np.nan_to_num(distances, nan=1.0)
similarities = 1.0 - distances
return float(np.mean(similarities))
print('Helper function defined (vectorized Jaccard via scipy.spatial.distance.pdist).')
Helper function defined (vectorized Jaccard via scipy.spatial.distance.pdist).
import time
# === Phase 1: Collect genome IDs and cluster IDs for ALL species ===
all_genome_ids_by_species = {}
all_cluster_ids_by_species = {}
all_genomes_flat = []
all_clusters_flat = []
for _, species_row in selected.iterrows():
species_id = species_row['gtdb_species_clade_id']
# Get prophage cluster IDs
sp_clusters = prophage_clusters[
prophage_clusters['gtdb_species_clade_id'] == species_id
]
if len(sp_clusters) < 5:
continue
cluster_ids = sp_clusters['gene_cluster_id'].tolist()
all_cluster_ids_by_species[species_id] = cluster_ids
all_clusters_flat.extend(cluster_ids)
# Get genome IDs, cap at MAX_GENOMES
genome_rows = spark.sql(f"""
SELECT genome_id FROM kbase_ke_pangenome.genome
WHERE gtdb_species_clade_id = '{species_id}'
""").collect()
gids = [r.genome_id for r in genome_rows]
if len(gids) > MAX_GENOMES:
gids = [str(g) for g in np.random.choice(gids, MAX_GENOMES, replace=False)]
print(f'{species_id}: sampled {MAX_GENOMES}/{len(genome_rows)} genomes')
all_genome_ids_by_species[species_id] = gids
all_genomes_flat.extend(gids)
# Deduplicate (in case of overlap — unlikely but safe)
all_genomes_flat = list(set(all_genomes_flat))
all_clusters_flat = list(set(all_clusters_flat))
print(f'\nTotal: {len(all_genome_ids_by_species)} species, '
f'{len(all_genomes_flat)} unique genomes, '
f'{len(all_clusters_flat)} unique prophage clusters')
# === Phase 2: Single Spark query for ALL species ===
t0 = time.time()
spark.createDataFrame(
[(str(c),) for c in all_clusters_flat], ['gene_cluster_id']
).createOrReplaceTempView('all_target_clusters')
spark.createDataFrame(
[(str(g),) for g in all_genomes_flat], ['genome_id']
).createOrReplaceTempView('all_target_genomes')
print('Registered temp views. Running single join query...')
all_presence_df = spark.sql("""
SELECT /*+ BROADCAST(tc), BROADCAST(tg) */
g.genome_id, j.gene_cluster_id, g.gene_id
FROM kbase_ke_pangenome.gene g
JOIN all_target_genomes tg ON g.genome_id = tg.genome_id
JOIN kbase_ke_pangenome.gene_genecluster_junction j ON g.gene_id = j.gene_id
JOIN all_target_clusters tc ON j.gene_cluster_id = tc.gene_cluster_id
""").toPandas()
elapsed_query = time.time() - t0
print(f'Single query completed in {elapsed_query:.1f}s')
print(f'Total rows: {len(all_presence_df):,}')
print(f'Unique genomes: {all_presence_df["genome_id"].nunique()}, '
f'Unique clusters: {all_presence_df["gene_cluster_id"].nunique()}')
s__Bacteroides_uniformis--GB_GCA_025147485.1: sampled 300/401 genomes
s__Helicobacter_pylori_BU--RS_GCF_900120335.1: sampled 300/370 genomes
Total: 15 species, 1506 unique genomes, 12199 unique prophage clusters
Registered temp views. Running single join query...
Single query completed in 319.8s Total rows: 182,093 Unique genomes: 1506, Unique clusters: 11908
3. Per-Species Co-occurrence and Co-localization Tests¶
Now that we have all presence data in a single pandas DataFrame, we split by species and run the statistical tests locally — this is fast because it's all vectorized numpy/scipy on small matrices (≤300 genomes × ≤few thousand clusters).
# Build genome→species lookup
genome_to_species = {}
for species_id, gids in all_genome_ids_by_species.items():
for g in gids:
genome_to_species[g] = species_id
# Map each row to its species
all_presence_df['species'] = all_presence_df['genome_id'].map(genome_to_species)
all_cooccurrence_results = []
all_colocation_results = []
for species_id in all_genome_ids_by_species:
t0 = time.time()
sp_presence = all_presence_df[all_presence_df['species'] == species_id]
if len(sp_presence) == 0:
print(f'{species_id}: no data')
continue
sp_clusters_df = prophage_clusters[
prophage_clusters['gtdb_species_clade_id'] == species_id
]
prophage_cluster_ids = sp_clusters_df['gene_cluster_id'].tolist()
n_genomes = sp_presence['genome_id'].nunique()
n_clusters_found = sp_presence['gene_cluster_id'].nunique()
print(f'{species_id}: {len(sp_presence):,} pairs, '
f'{n_genomes} genomes, {n_clusters_found} clusters')
# Build binary presence/absence matrix
presence_binary = sp_presence.groupby(
['genome_id', 'gene_cluster_id']
).size().unstack(fill_value=0)
presence_binary = (presence_binary > 0).astype(int)
# === MODULE CO-OCCURRENCE TEST ===
all_available = [c for c in prophage_cluster_ids if c in presence_binary.columns]
for module_id in MODULES.keys():
module_clusters = [
c for c in sp_clusters_df.loc[sp_clusters_df[f'has_{module_id}'], 'gene_cluster_id']
if c in presence_binary.columns
]
if len(module_clusters) < 2:
continue
observed_jaccard = mean_pairwise_jaccard(presence_binary[module_clusters])
if len(all_available) < len(module_clusters):
continue
null_jaccards = []
for _ in range(N_NULL):
random_set = list(np.random.choice(all_available, size=len(module_clusters), replace=False))
null_jaccards.append(mean_pairwise_jaccard(presence_binary[random_set]))
null_mean = np.mean(null_jaccards)
null_std = np.std(null_jaccards)
z_score = (observed_jaccard - null_mean) / null_std if null_std > 0 else 0
p_value = 1 - stats.norm.cdf(z_score)
all_cooccurrence_results.append({
'gtdb_species_clade_id': species_id,
'module': module_id,
'n_clusters': len(module_clusters),
'n_genomes': n_genomes,
'observed_jaccard': observed_jaccard,
'null_mean': null_mean,
'null_std': null_std,
'z_score': z_score,
'p_value': p_value,
})
# === CONTIG CO-LOCALIZATION TEST ===
sp_presence_local = sp_presence.copy()
sp_presence_local['contig'] = sp_presence_local['gene_id'].apply(
lambda x: '_'.join(x.rsplit('_', 1)[:-1]) if '_' in x else x
)
for module_id in MODULES.keys():
module_clusters = set(
sp_clusters_df.loc[sp_clusters_df[f'has_{module_id}'], 'gene_cluster_id']
)
if len(module_clusters) < 2:
continue
module_genes = sp_presence_local[
sp_presence_local['gene_cluster_id'].isin(module_clusters)
]
n_genomes_with_module = 0
n_colocalized = 0
for genome_id, ggrp in module_genes.groupby('genome_id'):
if ggrp['gene_cluster_id'].nunique() >= 2:
n_genomes_with_module += 1
contig_counts = ggrp.groupby('contig')['gene_cluster_id'].nunique()
if contig_counts.max() >= 2:
n_colocalized += 1
coloc_frac = n_colocalized / n_genomes_with_module if n_genomes_with_module > 0 else 0
all_colocation_results.append({
'gtdb_species_clade_id': species_id,
'module': module_id,
'n_genomes_with_module': n_genomes_with_module,
'n_colocalized': n_colocalized,
'colocation_fraction': coloc_frac,
})
elapsed = time.time() - t0
n_tests = len([r for r in all_cooccurrence_results if r['gtdb_species_clade_id'] == species_id])
print(f' -> {n_tests} module tests in {elapsed:.1f}s')
print(f'\nDone. Co-occurrence results: {len(all_cooccurrence_results)}, '
f'Co-localization results: {len(all_colocation_results)}')
s__Geothrix_sp903913215--GB_GCA_903913215.1: 2,289 pairs, 29 genomes, 255 clusters
-> 7 module tests in 1.3s s__Streptomyces_albidoflavus--RS_GCF_000719955.1: 23,640 pairs, 117 genomes, 1189 clusters
-> 7 module tests in 13.9s s__Bacillus_A_wiedmannii--RS_GCF_001583695.1: 48,250 pairs, 235 genomes, 2298 clusters
-> 7 module tests in 54.0s s__Blautia_A_wexlerae--GB_GCA_025148125.1: 27,026 pairs, 175 genomes, 2559 clusters
-> 7 module tests in 67.5s s__Megamonas_funiformis--RS_GCF_010669225.1: 4,059 pairs, 44 genomes, 671 clusters
-> 7 module tests in 2.5s s__Bacteroides_uniformis--GB_GCA_025147485.1: 38,299 pairs, 300 genomes, 2402 clusters
-> 7 module tests in 109.9s s__Helicobacter_pylori_BU--RS_GCF_900120335.1: 14,044 pairs, 300 genomes, 179 clusters
-> 6 module tests in 5.8s s__Chlamydophila_psittaci--RS_GCF_000204255.1: 2,358 pairs, 64 genomes, 60 clusters
-> 5 module tests in 1.4s s__UBA12225_sp002347925--GB_GCA_002347925.1: 1,211 pairs, 21 genomes, 155 clusters
-> 7 module tests in 1.1s s__Microcystis_panniformis--RS_GCF_014698335.1: 5,043 pairs, 46 genomes, 323 clusters
-> 5 module tests in 1.6s s__Thermus_scotoductus--RS_GCF_000381045.1: 3,691 pairs, 47 genomes, 244 clusters
-> 6 module tests in 1.5s s__Bilophila_wadsworthia--RS_GCF_000701705.1: 4,521 pairs, 32 genomes, 680 clusters
-> 7 module tests in 2.1s s__JACRCG01_sp903852495--GB_GCA_903852495.1: 2,787 pairs, 26 genomes, 362 clusters
-> 5 module tests in 1.2s s__Fusobacterium_C_necrophorum--RS_GCF_004006635.1: 4,169 pairs, 49 genomes, 411 clusters
-> 7 module tests in 2.1s s__Methanothrix_soehngenii--RS_GCF_000204415.1: 706 pairs, 21 genomes, 120 clusters
-> 5 module tests in 0.8s Done. Co-occurrence results: 95, Co-localization results: 95
4. Save and Summarize Results¶
cooc_df = pd.DataFrame(all_cooccurrence_results)
coloc_df = pd.DataFrame(all_colocation_results)
# Save
cooc_df.to_csv('../data/module_cooccurrence_stats.tsv', sep='\t', index=False)
coloc_df.to_csv('../data/contig_colocation.tsv', sep='\t', index=False)
print(f'Saved data/module_cooccurrence_stats.tsv: {len(cooc_df):,} rows')
print(f'Saved data/contig_colocation.tsv: {len(coloc_df):,} rows')
# Summarize
print(f'\n--- Co-occurrence Summary by Module ---')
for module_id in sorted(MODULES.keys()):
mod_data = cooc_df[cooc_df['module'] == module_id]
if len(mod_data) == 0:
continue
n_sig = (mod_data['p_value'] < 0.05).sum()
mean_z = mod_data['z_score'].mean()
mean_obs = mod_data['observed_jaccard'].mean()
mean_null = mod_data['null_mean'].mean()
print(f' {module_id}: {n_sig}/{len(mod_data)} significant (p<0.05), '
f'mean z={mean_z:.2f}, obs Jaccard={mean_obs:.3f} vs null={mean_null:.3f}')
print(f'\n--- Contig Co-localization Summary by Module ---')
for module_id in sorted(MODULES.keys()):
mod_data = coloc_df[coloc_df['module'] == module_id]
if len(mod_data) == 0:
continue
mean_frac = mod_data['colocation_fraction'].mean()
print(f' {module_id}: mean co-localization fraction = {mean_frac:.3f} '
f'({len(mod_data)} species tested)')
Saved data/module_cooccurrence_stats.tsv: 95 rows Saved data/contig_colocation.tsv: 95 rows --- Co-occurrence Summary by Module --- A_packaging: 11/15 significant (p<0.05), mean z=3.40, obs Jaccard=0.186 vs null=0.104 B_head_morphogenesis: 1/12 significant (p<0.05), mean z=0.20, obs Jaccard=0.069 vs null=0.106 C_tail: 5/12 significant (p<0.05), mean z=1.43, obs Jaccard=0.150 vs null=0.082 D_lysis: 13/15 significant (p<0.05), mean z=4.45, obs Jaccard=0.192 vs null=0.104 E_integration: 1/15 significant (p<0.05), mean z=-1.93, obs Jaccard=0.059 vs null=0.104 F_lysogenic_regulation: 10/15 significant (p<0.05), mean z=2.16, obs Jaccard=0.132 vs null=0.104 G_anti_defense: 3/11 significant (p<0.05), mean z=1.50, obs Jaccard=0.052 vs null=0.077 --- Contig Co-localization Summary by Module --- A_packaging: mean co-localization fraction = 0.986 (15 species tested) B_head_morphogenesis: mean co-localization fraction = 0.654 (12 species tested) C_tail: mean co-localization fraction = 0.752 (12 species tested) D_lysis: mean co-localization fraction = 0.901 (15 species tested) E_integration: mean co-localization fraction = 0.652 (15 species tested) F_lysogenic_regulation: mean co-localization fraction = 1.000 (15 species tested) G_anti_defense: mean co-localization fraction = 0.281 (11 species tested)
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
module_order = sorted(MODULES.keys())
# Panel 1: Z-scores by module (box plot)
plot_data = cooc_df[cooc_df['module'].isin(module_order)].copy()
plot_data['module_name'] = plot_data['module'].map(
{m: MODULES[m]['full_name'] for m in module_order}
)
sns.boxplot(data=plot_data, y='module_name', x='z_score', ax=axes[0],
orient='h', color='steelblue')
axes[0].axvline(x=1.96, color='red', linestyle='--', alpha=0.5, label='z=1.96 (p<0.05)')
axes[0].set_xlabel('Z-score (observed vs null co-occurrence)')
axes[0].set_ylabel('')
axes[0].set_title('Module Co-occurrence Significance')
axes[0].legend()
# Panel 2: Co-localization fraction by module
coloc_plot = coloc_df[coloc_df['module'].isin(module_order)].copy()
coloc_plot['module_name'] = coloc_plot['module'].map(
{m: MODULES[m]['full_name'] for m in module_order}
)
sns.boxplot(data=coloc_plot, y='module_name', x='colocation_fraction', ax=axes[1],
orient='h', color='darkorange')
axes[1].set_xlabel('Contig Co-localization Fraction')
axes[1].set_ylabel('')
axes[1].set_title('Module Genes on Same Contig')
plt.tight_layout()
plt.savefig('../figures/module_cooccurrence_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()
print('Saved figures/module_cooccurrence_heatmap.png')
Saved figures/module_cooccurrence_heatmap.png
print('='*60)
print('NB03 SUMMARY')
print('='*60)
print(f'Species tested: {cooc_df["gtdb_species_clade_id"].nunique()}')
print(f'Module × species tests: {len(cooc_df)}')
n_sig = (cooc_df['p_value'] < 0.05).sum()
print(f'Significant co-occurrence (p<0.05): {n_sig}/{len(cooc_df)} ({n_sig/len(cooc_df)*100:.1f}%)')
print(f'Mean co-localization fraction: {coloc_df["colocation_fraction"].mean():.3f}')
print(f'\nDesign: {N_SPECIES} species, max {MAX_GENOMES} genomes/species, '
f'{N_NULL} null permutations, seed={RANDOM_SEED}')
============================================================ NB03 SUMMARY ============================================================ Species tested: 15 Module × species tests: 95 Significant co-occurrence (p<0.05): 44/95 (46.3%) Mean co-localization fraction: 0.769 Design: 15 species, max 300 genomes/species, 200 null permutations, seed=42