01 Define Quadrants
Jupyter notebook from the The 5,526 Costly + Dispensable Genes project.
NB 01: Define Selection Signature Quadrants¶
Reconstruct the 2×2 selection signature matrix from cached data:
costly/neutral (max_fit > 1) × conserved/dispensable (core/non-core).
Verify counts match the established 28,017 / 5,526 / 86,761 / 21,886
from the core_gene_tradeoffs project.
Run locally — all data pre-cached.
Output: data/gene_quadrants.tsv
In [1]:
import pandas as pd
import numpy as np
from pathlib import Path
DATA_DIR = Path('../data')
DATA_DIR.mkdir(exist_ok=True)
1. Load Data¶
In [2]:
# Fitness stats: per-gene fitness summary
fitness = pd.read_csv(
'../../fitness_effects_conservation/data/fitness_stats.tsv',
sep='\t'
)
print(f"Fitness stats: {len(fitness):,} genes")
# Pangenome link
link = pd.read_csv(
'../../conservation_vs_fitness/data/fb_pangenome_link.tsv',
sep='\t'
)
# Fix boolean columns
for col in ['is_core', 'is_auxiliary', 'is_singleton']:
if link[col].dtype == object:
link[col] = link[col].map({'True': True, 'False': False}).astype(bool)
else:
link[col] = link[col].fillna(False).astype(bool)
print(f"Pangenome link: {len(link):,} genes")
# Essential genes (for gene_length and descriptions)
essentials = pd.read_csv(
'../../conservation_vs_fitness/data/essential_genes.tsv',
sep='\t', dtype={'gene': str}
)
print(f"Essential genes: {len(essentials):,} genes")
Fitness stats: 166,523 genes
Pangenome link: 177,863 genes Essential genes: 153,143 genes
2. Merge Fitness with Conservation¶
In [3]:
# Merge fitness stats with pangenome link
merged = fitness.merge(
link[['orgId', 'locusId', 'is_core', 'is_auxiliary', 'is_singleton']],
on=['orgId', 'locusId'],
how='inner'
)
print(f"Merged (fitness + pangenome): {len(merged):,} genes")
print(f" Core: {merged['is_core'].sum():,}")
print(f" Auxiliary: {merged['is_auxiliary'].sum():,}")
print(f" Singleton: {merged['is_singleton'].sum():,}")
Merged (fitness + pangenome): 142,190 genes Core: 114,778 Auxiliary: 27,412 Singleton: 6,312
3. Define Quadrants¶
In [4]:
# Define burden: max fitness > 1 means deleting the gene improves fitness
merged['is_burden'] = merged['max_fit'] > 1
# Assign quadrants
def assign_quadrant(row):
if row['is_burden'] and row['is_core']:
return 'costly_conserved'
elif row['is_burden'] and not row['is_core']:
return 'costly_dispensable'
elif not row['is_burden'] and row['is_core']:
return 'neutral_conserved'
else:
return 'neutral_dispensable'
merged['quadrant'] = merged.apply(assign_quadrant, axis=1)
# Verify counts
quad_counts = merged['quadrant'].value_counts()
print("=== SELECTION SIGNATURE MATRIX ===")
print(f" Costly + Conserved: {quad_counts.get('costly_conserved', 0):>8,} (expected: 28,017)")
print(f" Costly + Dispensable: {quad_counts.get('costly_dispensable', 0):>8,} (expected: 5,526)")
print(f" Neutral + Conserved: {quad_counts.get('neutral_conserved', 0):>8,} (expected: 86,761)")
print(f" Neutral + Dispensable: {quad_counts.get('neutral_dispensable', 0):>8,} (expected: 21,886)")
print(f" Total: {len(merged):>8,} (expected: 142,190)")
=== SELECTION SIGNATURE MATRIX === Costly + Conserved: 28,017 (expected: 28,017) Costly + Dispensable: 5,526 (expected: 5,526) Neutral + Conserved: 86,761 (expected: 86,761) Neutral + Dispensable: 21,886 (expected: 21,886) Total: 142,190 (expected: 142,190)
4. Add Gene Metadata¶
In [5]:
# Add gene_length and description from essential_genes table
gene_meta = essentials[['orgId', 'locusId', 'gene_length', 'desc']].copy()
merged = merged.merge(gene_meta, on=['orgId', 'locusId'], how='left')
print(f"With metadata: {len(merged):,} genes")
print(f" With gene_length: {merged['gene_length'].notna().sum():,}")
print(f" With description: {merged['desc'].notna().sum():,}")
# Summary stats by quadrant
summary = merged.groupby('quadrant').agg(
n_genes=('locusId', 'count'),
mean_max_fit=('max_fit', 'mean'),
mean_min_fit=('min_fit', 'mean'),
mean_gene_length=('gene_length', 'mean'),
pct_singleton=('is_singleton', 'mean'),
).round(2)
summary['pct_singleton'] = (summary['pct_singleton'] * 100).round(1)
print()
summary
With metadata: 142,190 genes With gene_length: 117,916 With description: 117,833
Out[5]:
| n_genes | mean_max_fit | mean_min_fit | mean_gene_length | pct_singleton | |
|---|---|---|---|---|---|
| quadrant | |||||
| costly_conserved | 28017 | 1.98 | -2.31 | 887.98 | 0.0 |
| costly_dispensable | 5526 | 1.77 | -1.97 | 791.68 | 24.0 |
| neutral_conserved | 86761 | 0.51 | -1.02 | 1072.47 | 0.0 |
| neutral_dispensable | 21886 | 0.49 | -0.73 | 1112.44 | 23.0 |
5. Save¶
In [6]:
out_file = DATA_DIR / 'gene_quadrants.tsv'
merged.to_csv(out_file, sep='\t', index=False)
print(f"Saved: {out_file} ({len(merged):,} genes)")
print(f"Columns: {merged.columns.tolist()}")
Saved: ../data/gene_quadrants.tsv (142,190 genes) Columns: ['orgId', 'locusId', 'n_experiments', 'min_fit', 'max_fit', 'mean_fit', 'n_sick', 'n_very_sick', 'n_beneficial', 'n_very_beneficial', 'n_significant', 'is_core', 'is_auxiliary', 'is_singleton', 'is_burden', 'quadrant', 'gene_length', 'desc']
In [7]:
print("=" * 60)
print("NB01 SUMMARY")
print("=" * 60)
print(f"Total genes: {len(merged):,}")
print(f"Organisms: {merged['orgId'].nunique()}")
for q in ['costly_conserved', 'costly_dispensable', 'neutral_conserved', 'neutral_dispensable']:
n = quad_counts.get(q, 0)
pct = n / len(merged) * 100
print(f" {q:25s}: {n:>8,} ({pct:5.1f}%)")
print("=" * 60)
============================================================ NB01 SUMMARY ============================================================ Total genes: 142,190 Organisms: 43 costly_conserved : 28,017 ( 19.7%) costly_dispensable : 5,526 ( 3.9%) neutral_conserved : 86,761 ( 61.0%) neutral_dispensable : 21,886 ( 15.4%) ============================================================