02 Terl Lineage Clustering
Jupyter notebook from the Prophage Gene Modules and Terminase-Defined Lineages Across Bacterial Phylogeny and Environmental Gradients project.
NB02: TerL Lineage Clustering¶
Project: Prophage Ecology Across Bacterial Phylogeny and Environmental Gradients
Goal: Cluster terminase large subunit (TerL) protein sequences into prophage lineages at ~70% amino acid identity using MMseqs2. Assign lineage IDs, build a rough phylogenetic overview.
Dependencies: NB01 outputs (data/terL_sequences.fasta)
Environment: JupyterHub or local (requires MMseqs2 installed)
Outputs:
data/terL_lineages.tsv— gene_cluster_id → lineage_id mappingdata/lineage_summary.tsv— per-lineage statisticsfigures/terL_lineage_tree.png— rough lineage overview
In [1]:
import os
import subprocess
import pandas as pd
import numpy as np
from collections import Counter
os.makedirs('../data', exist_ok=True)
os.makedirs('../figures', exist_ok=True)
FASTA_PATH = '../data/terL_sequences.fasta'
assert os.path.exists(FASTA_PATH), f'TerL FASTA not found at {FASTA_PATH}. Run NB01 first.'
# Count sequences
n_seqs = sum(1 for line in open(FASTA_PATH) if line.startswith('>'))
print(f'TerL sequences: {n_seqs:,}')
TerL sequences: 38,085
In [2]:
# Check MMseqs2 availability
MMSEQS_BIN = '/tmp/mmseqs/bin/mmseqs'
if not os.path.exists(MMSEQS_BIN):
# Fallback: check PATH
import shutil
path_mmseqs = shutil.which('mmseqs')
if path_mmseqs:
MMSEQS_BIN = path_mmseqs
else:
raise FileNotFoundError(
'MMseqs2 not found. Download static binary: '
'curl -sL https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz | tar xz -C /tmp/'
)
result = subprocess.run([MMSEQS_BIN, 'version'], capture_output=True, text=True)
print(f'MMseqs2 binary: {MMSEQS_BIN}')
print(f'MMseqs2 version: {result.stdout.strip()}')
MMseqs2 binary: /tmp/mmseqs/bin/mmseqs MMseqs2 version: 01683a607f83878e95436632d73e1d7d9ae30955
1. Cluster TerL Sequences at 70% AAI¶
In [3]:
# Run MMseqs2 easy-cluster at 70% identity
CLUSTER_PREFIX = '../data/terL_clusters_70'
TMP_DIR = '../data/tmp_mmseqs'
os.makedirs(TMP_DIR, exist_ok=True)
cmd = [
MMSEQS_BIN, 'easy-cluster',
FASTA_PATH,
CLUSTER_PREFIX,
TMP_DIR,
'--min-seq-id', '0.7',
'-c', '0.8', # 80% coverage
'--cov-mode', '0', # bidirectional coverage
'--threads', '4',
]
print(f'Running: {" ".join(cmd)}')
result = subprocess.run(cmd, capture_output=True, text=True)
print(f'Return code: {result.returncode}')
if result.returncode != 0:
print(f'STDERR: {result.stderr[:500]}')
else:
print('Clustering complete.')
Running: /tmp/mmseqs/bin/mmseqs easy-cluster ../data/terL_sequences.fasta ../data/terL_clusters_70 ../data/tmp_mmseqs --min-seq-id 0.7 -c 0.8 --cov-mode 0 --threads 4
Return code: 0 Clustering complete.
In [4]:
# Parse MMseqs2 cluster output (TSV with rep_id, member_id)
cluster_tsv = f'{CLUSTER_PREFIX}_cluster.tsv'
clusters = pd.read_csv(cluster_tsv, sep='\t', header=None, names=['rep_id', 'member_id'])
# Parse the gene_cluster_id and species from the FASTA header format: >gene_cluster_id|species_clade_id
clusters['member_cluster_id'] = clusters['member_id'].apply(lambda x: x.split('|')[0])
clusters['member_species'] = clusters['member_id'].apply(lambda x: x.split('|')[1] if '|' in x else '')
clusters['rep_cluster_id'] = clusters['rep_id'].apply(lambda x: x.split('|')[0])
# Assign lineage IDs (use representative cluster ID as lineage name)
clusters['lineage_id'] = 'L_' + clusters['rep_cluster_id']
n_lineages = clusters['lineage_id'].nunique()
print(f'Total TerL sequences clustered: {len(clusters):,}')
print(f'Lineages (clusters at 70% AAI): {n_lineages:,}')
print(f'Singletons: {(clusters.groupby("lineage_id").size() == 1).sum():,}')
# Lineage size distribution
lineage_sizes = clusters.groupby('lineage_id').size().rename('size')
print(f'\nLineage size distribution:')
print(lineage_sizes.describe())
Total TerL sequences clustered: 38,085 Lineages (clusters at 70% AAI): 10,991 Singletons: 6,921 Lineage size distribution: count 10991.000000 mean 3.465108 std 14.063215 min 1.000000 25% 1.000000 50% 1.000000 75% 2.000000 max 1094.000000 Name: size, dtype: float64
In [5]:
# Build lineage summary
lineage_summary = clusters.groupby('lineage_id').agg(
n_members=('member_cluster_id', 'count'),
n_species=('member_species', 'nunique'),
rep_cluster_id=('rep_cluster_id', 'first'),
).reset_index()
# Extract phylum-level diversity per lineage (parse species clade ID for genus at minimum)
lineage_summary = lineage_summary.sort_values('n_members', ascending=False)
print(f'Top 20 lineages by size:')
print(lineage_summary.head(20).to_string(index=False))
Top 20 lineages by size:
lineage_id n_members n_species rep_cluster_id
L_SFJM01000051.1_13 1094 869 SFJM01000051.1_13
L_NZ_JAEUXB010000001.1_577 273 193 NZ_JAEUXB010000001.1_577
L_CAJTVG010000054.1_11 250 210 CAJTVG010000054.1_11
L_NZ_CP038627.1_15 208 198 NZ_CP038627.1_15
L_NZ_JADPCR010000046.1_1 195 173 NZ_JADPCR010000046.1_1
L_NZ_JAEACP010000036.1_22 195 133 NZ_JAEACP010000036.1_22
L_NZ_JAHQXO010000007.1_1 185 140 NZ_JAHQXO010000007.1_1
L_NZ_MXMK01000013.1_27 179 122 NZ_MXMK01000013.1_27
L_NZ_UGVE01000001.1_301 168 137 NZ_UGVE01000001.1_301
L_NZ_CABGNB010000003.1_457 163 121 NZ_CABGNB010000003.1_457
L_CP025982.1_1977 157 131 CP025982.1_1977
L_NZ_CAJGBM010000009.1_16 155 151 NZ_CAJGBM010000009.1_16
L_NZ_JAHWJA010000001.1_3580 148 125 NZ_JAHWJA010000001.1_3580
L_JAEYAT010000216.1_4 139 135 JAEYAT010000216.1_4
L_NZ_JAEMGT010000123.1_10 127 106 NZ_JAEMGT010000123.1_10
L_NZ_NJAJ01000036.1_16 124 104 NZ_NJAJ01000036.1_16
L_NZ_BIGY01000001.1_646 114 98 NZ_BIGY01000001.1_646
L_NZ_WTEX01000003.1_65 108 105 NZ_WTEX01000003.1_65
L_NZ_JAFMZA010000060.1_10 107 95 NZ_JAFMZA010000060.1_10
L_NZ_JABS01000028.1_48 95 85 NZ_JABS01000028.1_48
2. Sensitivity Analysis: Multiple Clustering Thresholds¶
In [6]:
# Also cluster at 50%, 60%, 80% AAI for sensitivity analysis
thresholds = [0.5, 0.6, 0.8]
threshold_results = [{'threshold': 0.7, 'n_lineages': n_lineages}]
for thresh in thresholds:
prefix = f'../data/terL_clusters_{int(thresh*100)}'
cmd = [
MMSEQS_BIN, 'easy-cluster',
FASTA_PATH, prefix, TMP_DIR,
'--min-seq-id', str(thresh),
'-c', '0.8', '--cov-mode', '0', '--threads', '4',
]
os.makedirs(TMP_DIR, exist_ok=True)
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode == 0:
cl = pd.read_csv(f'{prefix}_cluster.tsv', sep='\t', header=None)
n = cl[0].nunique()
threshold_results.append({'threshold': thresh, 'n_lineages': n})
print(f' {int(thresh*100)}% AAI: {n:,} lineages')
else:
print(f' {int(thresh*100)}% AAI: FAILED - {result.stderr[:200]}')
thresh_df = pd.DataFrame(threshold_results).sort_values('threshold')
print(f'\nLineage count by threshold:')
print(thresh_df.to_string(index=False))
50% AAI: 4,001 lineages
60% AAI: 7,052 lineages
80% AAI: 16,283 lineages
Lineage count by threshold:
threshold n_lineages
0.5 4001
0.6 7052
0.7 10991
0.8 16283
3. Save Outputs¶
In [7]:
# Save lineage assignments
lineage_out = clusters[['member_cluster_id', 'member_species', 'lineage_id', 'rep_cluster_id']].copy()
lineage_out.columns = ['gene_cluster_id', 'gtdb_species_clade_id', 'lineage_id', 'lineage_rep']
lineage_out.to_csv('../data/terL_lineages.tsv', sep='\t', index=False)
print(f'Saved data/terL_lineages.tsv: {len(lineage_out):,} rows')
# Save lineage summary
lineage_summary.to_csv('../data/lineage_summary.tsv', sep='\t', index=False)
print(f'Saved data/lineage_summary.tsv: {len(lineage_summary):,} rows')
# Save threshold sensitivity
thresh_df.to_csv('../data/terL_threshold_sensitivity.tsv', sep='\t', index=False)
Saved data/terL_lineages.tsv: 38,085 rows Saved data/lineage_summary.tsv: 10,991 rows
In [8]:
# Visualization: lineage size distribution + threshold sensitivity
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Panel 1: Lineage size distribution (log scale)
sizes = lineage_sizes.values
axes[0].hist(sizes, bins=np.logspace(0, np.log10(max(sizes)+1), 50), color='steelblue', edgecolor='black', linewidth=0.5)
axes[0].set_xscale('log')
axes[0].set_xlabel('Lineage Size (number of TerL members)')
axes[0].set_ylabel('Number of Lineages')
axes[0].set_title(f'TerL Lineage Size Distribution (70% AAI, n={n_lineages:,})')
# Panel 2: Threshold sensitivity
axes[1].plot(thresh_df['threshold']*100, thresh_df['n_lineages'], 'o-', color='darkorange', markersize=8)
axes[1].set_xlabel('Clustering Threshold (% AAI)')
axes[1].set_ylabel('Number of Lineages')
axes[1].set_title('Lineage Count vs Clustering Threshold')
for _, row in thresh_df.iterrows():
axes[1].annotate(f'{int(row["n_lineages"]):,}', (row['threshold']*100, row['n_lineages']),
textcoords='offset points', xytext=(0, 10), ha='center')
plt.tight_layout()
plt.savefig('../figures/terL_lineage_overview.png', dpi=150, bbox_inches='tight')
plt.show()
print('Saved figures/terL_lineage_overview.png')
Saved figures/terL_lineage_overview.png
In [9]:
# Clean up tmp directory
import shutil
if os.path.exists(TMP_DIR):
shutil.rmtree(TMP_DIR)
print('Cleaned up tmp directory')
# Summary
print('='*60)
print('NB02 SUMMARY')
print('='*60)
print(f'TerL sequences clustered: {n_seqs:,}')
print(f'Lineages at 70% AAI: {n_lineages:,}')
print(f'Largest lineage: {lineage_sizes.max():,} members')
print(f'Median lineage size: {lineage_sizes.median():.0f}')
print(f'Singleton lineages: {(lineage_sizes == 1).sum():,}')
Cleaned up tmp directory ============================================================ NB02 SUMMARY ============================================================ TerL sequences clustered: 38,085 Lineages at 70% AAI: 10,991 Largest lineage: 1,094 members Median lineage size: 1 Singleton lineages: 6,921