04 Black Queen Test
Jupyter notebook from the Metabolic Capability vs Metabolic Dependency project.
NB04: Black Queen Hypothesis Test¶
Purpose: Test whether pathways classified as 'latent capabilities' show lower genomic conservation than 'active dependencies', consistent with ongoing streamlining under the Black Queen Hypothesis.
Black Queen Hypothesis (Morris et al. 2012): Costly functions tend to be lost from genomes when they can be sourced from other community members. Latent capabilities (complete but fitness-neutral pathways) are candidates for such gene loss.
Predictions (H2):
- Latent capability pathways show LOWER cross-genome conservation than active dependencies
- Organisms with more latent capabilities have MORE OPEN pangenomes
- Pangenome openness correlates with the fraction of complete-but-neutral pathways
Conservation proxy: % of species genomes with a complete pathway (from NB01 GapMind data) This avoids needing gene-cluster-level data: if a pathway is actively maintained, it should be consistently complete across related genomes.
Inputs:
data/pathway_classification.csv(from NB03) — localdata/gapmind_genome_pathways.csv(from NB01) — localdata/organism_mapping.tsv(from NB02) — local- Spark:
kbase_ke_pangenome.pangenome— pangenome openness metrics per species
Outputs:
data/pathway_conservation.csv— Conservation rates per pathway classdata/pangenome_openness.csv— Clade-level openness metricsfigures/nb04_conservation_boxplot.pngfigures/nb04_openness_scatter.pngfigures/nb04_conservation_cdf.png
Runtime: ~5 min (one Spark query + local analysis)
import pandas as pd
import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import mannwhitneyu, spearmanr, kruskal
from pathlib import Path
import sys
import warnings
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['font.size'] = 11
PROJECT_ROOT = Path('/home/cjneely/repos/BERIL-research-observatory/projects/metabolic_capability_dependency')
DATA_DIR = PROJECT_ROOT / 'data'
FIG_DIR = PROJECT_ROOT / 'figures'
FIG_DIR.mkdir(exist_ok=True)
print(f'Data: {DATA_DIR}')
Data: /home/cjneely/repos/BERIL-research-observatory/projects/metabolic_capability_dependency/data
1. Load Classification and GapMind Data¶
# Pathway classifications from NB03
classif_path = DATA_DIR / 'pathway_classification.csv'
if not classif_path.exists():
raise FileNotFoundError(f'{classif_path} not found. Run NB03 first.')
classif = pd.read_csv(classif_path)
print(f'Classification records: {len(classif):,}')
print(f'Organisms: {classif["orgId"].nunique()}')
print(f'Class breakdown:\n{classif["dependency_class"].value_counts().to_string()}')
# Organism mapping to get GapMind clade_name
org_map_path = DATA_DIR / 'organism_mapping.tsv'
if org_map_path.exists():
org_map = pd.read_csv(org_map_path, sep='\t')
print(f'\nOrganism mapping: {len(org_map)} organisms')
if 'clade_name' not in classif.columns and 'clade_name' in org_map.columns:
org_id_col = next(
(c for c in org_map.columns if c.lower() in ('orgid', 'org_id')),
org_map.columns[0]
)
org_map_slim = org_map[[org_id_col, 'clade_name']].rename(columns={org_id_col: 'orgId'})
classif = classif.merge(org_map_slim, on='orgId', how='left')
print(f'Added clade_name for {classif["clade_name"].notna().sum()} records')
else:
org_map = None
print('WARNING: organism_mapping.tsv not found')
Classification records: 1,695 Organisms: 48 Class breakdown: dependency_class active_dependency 881 intermediate 547 latent_capability 267 Organism mapping: 48 organisms
2. Compute Cross-Genome Pathway Conservation¶
For each (organism, pathway) classification, compute the conservation rate = % of species genomes with a complete GapMind prediction for that pathway.
This uses NB01 data directly — no additional Spark queries needed.
# Load NB01 data and compute species-level pathway conservation
gapmind_path = DATA_DIR / 'gapmind_genome_pathways.csv'
print('Loading GapMind data...')
gapmind = pd.read_csv(gapmind_path, dtype={
'genome_id': str, 'species': str, 'pathway': str,
'best_score': np.int8, 'is_complete': np.int8
})
# Species-level conservation: % genomes with complete pathway
print('Computing conservation rates per species × pathway...')
species_conservation = (
gapmind.groupby(['species', 'pathway'])['is_complete']
.agg(['mean', 'count'])
.reset_index()
.rename(columns={
'species': 'clade_name',
'mean': 'conservation_rate',
'count': 'n_genomes_in_species',
})
)
print(f'Species × pathway conservation records: {len(species_conservation):,}')
# Join with classification data
if 'clade_name' in classif.columns:
conserv_data = classif.merge(
species_conservation,
on=['clade_name', 'pathway'],
how='left'
)
has_conserv = conserv_data['conservation_rate'].notna().sum()
print(f'Records with conservation data: {has_conserv:,} '
f'({100 * has_conserv / len(conserv_data):.1f}%)')
else:
print('WARNING: No clade_name column — using species_completion_rate from NB02 if available')
conserv_data = classif.copy()
if 'species_completion_rate' in conserv_data.columns:
conserv_data['conservation_rate'] = conserv_data['species_completion_rate']
else:
conserv_data['conservation_rate'] = np.nan
# Filter to classified records with conservation data
classif_conserv = conserv_data[
(conserv_data['dependency_class'].isin(['active_dependency', 'latent_capability', 'intermediate'])) &
(conserv_data['conservation_rate'].notna())
].copy()
print(f'\nClassified + conservation records: {len(classif_conserv):,}')
print('Conservation rate by class:')
print(classif_conserv.groupby('dependency_class')['conservation_rate'].describe())
Loading GapMind data...
Computing conservation rates per species × pathway...
Species × pathway conservation records: 2,215,200
Records with conservation data: 1,511 (89.1%)
Classified + conservation records: 1,511
Conservation rate by class:
count mean std min 25% 50% 75% max
dependency_class
active_dependency 755.0 0.829334 0.369223 0.0 0.992908 1.0 1.0 1.0
intermediate 508.0 0.906679 0.280862 0.0 1.000000 1.0 1.0 1.0
latent_capability 248.0 0.869374 0.326221 0.0 1.000000 1.0 1.0 1.0
3. Query Pangenome Openness Metrics (Spark)¶
Retrieve clade-level pangenome statistics: % core, % auxiliary, openness. Openness = % non-core genes (auxiliary + singleton), measuring ongoing gene turnover.
SPARK_AVAILABLE = False
# On BERDL JupyterHub, get_spark_session is available as a kernel global.
# Try that first, then fall back to explicit module import.
try:
spark = get_spark_session()
SPARK_AVAILABLE = True
print('Spark session available (kernel global)')
except NameError:
try:
from get_spark_session import get_spark_session
spark = get_spark_session()
SPARK_AVAILABLE = True
print('Spark session available (module import)')
except Exception as e:
print(f'Spark not available ({e}) — pangenome openness analysis will be skipped')
except Exception as e:
print(f'Spark not available ({e}) — pangenome openness analysis will be skipped')
Spark session available (kernel global)
pangenome_df = None
openness_path = DATA_DIR / 'pangenome_openness.csv'
if SPARK_AVAILABLE:
# Inspect pangenome table schema
print('Pangenome table schema:')
spark.sql('DESCRIBE kbase_ke_pangenome.pangenome').show(40, truncate=False)
print('\nSample pangenome rows:')
spark.sql('SELECT * FROM kbase_ke_pangenome.pangenome LIMIT 5').show(5, truncate=False)
elif openness_path.exists():
pangenome_df = pd.read_csv(openness_path)
print(f'Loaded cached pangenome data: {len(pangenome_df):,} clades')
Pangenome table schema: +-------------------------------------+---------+-------+ |col_name |data_type|comment| +-------------------------------------+---------+-------+ |gtdb_species_clade_id |string |NULL | |protocol_id |string |NULL | |number_of_iterations |int |NULL | |mean_initial_completeness |double |NULL | |total_sum_of_loglikelihood_ratios |double |NULL | |corrected_mean_completness |double |NULL | |estimate_mean_traits_per_genome_count|double |NULL | |no_aux_genome |int |NULL | |no_core |int |NULL | |no_singleton_gene_clusters |double |NULL | |likelies |int |NULL | |no_CDSes |int |NULL | |no_gene_clusters |int |NULL | |no_genomes |int |NULL | +-------------------------------------+---------+-------+ Sample pangenome rows: +-------------------------------------------------+-----------------------+--------------------+-------------------------+---------------------------------+--------------------------+-------------------------------------+-------------+-------+--------------------------+--------+--------+----------------+----------+ |gtdb_species_clade_id |protocol_id |number_of_iterations|mean_initial_completeness|total_sum_of_loglikelihood_ratios|corrected_mean_completness|estimate_mean_traits_per_genome_count|no_aux_genome|no_core|no_singleton_gene_clusters|likelies|no_CDSes|no_gene_clusters|no_genomes| +-------------------------------------------------+-----------------------+--------------------+-------------------------+---------------------------------+--------------------------+-------------------------------------+-------------+-------+--------------------------+--------+--------+----------------+----------+ |s__Klebsiella_pneumoniae--RS_GCF_000742135.1 |PGNKE_MMS90_V01_DEC2024|0 |95.0 |-1.4186263623030312E10 |99.24400808378776 |NULL |438925 |4199 |276743.0 |NULL |NULL |443124 |14240 | |s__Staphylococcus_aureus--RS_GCF_001027105.1 |PGNKE_MMS90_V01_DEC2024|0 |95.0 |-5.100735177719256E9 |99.36403620861542 |NULL |145831 |2083 |86127.0 |NULL |NULL |147914 |14526 | |s__Salmonella_enterica--RS_GCF_000006945.2 |PGNKE_MMS90_V01_DEC2024|0 |95.0 |-6.724158175335136E9 |98.85836979950815 |NULL |262732 |3639 |149757.0 |NULL |NULL |266371 |11402 | |s__Streptococcus_pneumoniae--RS_GCF_001457635.1 |PGNKE_MMS90_V01_DEC2024|0 |95.0 |-2.0296091761721177E9 |99.0776921902097 |NULL |115370 |1475 |67191.0 |NULL |NULL |116845 |8434 | |s__Mycobacterium_tuberculosis--RS_GCF_000195955.2|PGNKE_MMS90_V01_DEC2024|0 |95.0 |-2.114989833884559E9 |98.97829228508505 |NULL |139929 |3741 |97549.0 |NULL |NULL |143670 |6903 | +-------------------------------------------------+-----------------------+--------------------+-------------------------+---------------------------------+--------------------------+-------------------------------------+-------------+-------+--------------------------+--------+--------+----------------+----------+
if SPARK_AVAILABLE and pangenome_df is None:
# kbase_ke_pangenome.pangenome known schema:
# gtdb_species_clade_id — species/clade identifier
# no_genomes — number of genomes in clade
# no_core — number of core gene clusters
# no_singleton_gene_clusters — number of singleton gene clusters
# no_gene_clusters — total gene clusters
# no_aux_genome — auxiliary genome count (not gene clusters)
pangenome_raw = spark.sql("""
SELECT
gtdb_species_clade_id AS clade_name,
no_genomes AS n_genomes,
no_core AS n_core,
no_singleton_gene_clusters AS n_singleton,
no_gene_clusters AS total_clusters
FROM kbase_ke_pangenome.pangenome
WHERE no_genomes >= 5
""").toPandas()
print(f'Pangenome records (≥5 genomes): {len(pangenome_raw):,}')
print(pangenome_raw.head(5).to_string())
# Derived metrics
# Auxiliary clusters = total - core - singleton (clip at 0 for safety)
pangenome_raw['n_auxiliary'] = (
pangenome_raw['total_clusters']
- pangenome_raw['n_core']
- pangenome_raw['n_singleton']
).clip(lower=0)
pangenome_raw['pct_core'] = (
100.0 * pangenome_raw['n_core'] / pangenome_raw['total_clusters']
)
# Openness = % non-core gene clusters
pangenome_raw['openness'] = 100.0 - pangenome_raw['pct_core']
pangenome_df = pangenome_raw
pangenome_df.to_csv(openness_path, index=False)
print(f'Saved: {openness_path}')
print(f'\nOpenness distribution:')
print(pangenome_df['openness'].describe())
Pangenome records (≥5 genomes): 7,334
clade_name n_genomes n_core n_singleton total_clusters
0 s__Klebsiella_pneumoniae--RS_GCF_000742135.1 14240 4199 276743.0 443124
1 s__Staphylococcus_aureus--RS_GCF_001027105.1 14526 2083 86127.0 147914
2 s__Salmonella_enterica--RS_GCF_000006945.2 11402 3639 149757.0 266371
3 s__Streptococcus_pneumoniae--RS_GCF_001457635.1 8434 1475 67191.0 116845
4 s__Mycobacterium_tuberculosis--RS_GCF_000195955.2 6903 3741 97549.0 143670
Saved: /home/cjneely/repos/BERIL-research-observatory/projects/metabolic_capability_dependency/data/pangenome_openness.csv
Openness distribution:
count 7334.000000
mean 62.177627
std 17.192262
min 0.779133
25% 51.643001
50% 64.353108
75% 74.730913
max 99.052410
Name: openness, dtype: float64
# ── H2a: Conservation by dependency class ─────────────────────────────────────
active = classif_conserv.loc[
classif_conserv['dependency_class'] == 'active_dependency', 'conservation_rate'
].dropna()
latent = classif_conserv.loc[
classif_conserv['dependency_class'] == 'latent_capability', 'conservation_rate'
].dropna()
intermediate = classif_conserv.loc[
classif_conserv['dependency_class'] == 'intermediate', 'conservation_rate'
].dropna()
print('=== H2a: Conservation by Dependency Class ===')
print(f'Active n={len(active):>5}, median={active.median():.3f}, mean={active.mean():.3f}')
print(f'Latent n={len(latent):>5}, median={latent.median():.3f}, mean={latent.mean():.3f}')
print(f'Interm. n={len(intermediate):>5}, median={intermediate.median():.3f}, mean={intermediate.mean():.3f}')
# Mann-Whitney U test (active vs latent)
if len(active) > 0 and len(latent) > 0:
u_stat, u_p = mannwhitneyu(active, latent, alternative='greater') # active > latent
print(f'\nMann-Whitney U (active > latent conservation): U={u_stat:.0f}, p={u_p:.4e}')
if u_p < 0.05:
print('→ H2a supported: active dependencies are significantly more conserved than latent capabilities')
else:
print('→ H2a not supported at p<0.05')
# Kruskal-Wallis (all three classes)
if len(active) > 0 and len(latent) > 0 and len(intermediate) > 0:
kw_stat, kw_p = kruskal(active, latent, intermediate)
print(f'\nKruskal-Wallis (3 classes): H={kw_stat:.2f}, p={kw_p:.4e}')
# Effect size (rank-biserial correlation)
if len(active) > 0 and len(latent) > 0:
n1, n2 = len(active), len(latent)
r_biserial = 1 - 2 * u_stat / (n1 * n2)
print(f'Effect size (rank-biserial r): {r_biserial:.3f}')
=== H2a: Conservation by Dependency Class === Active n= 755, median=1.000, mean=0.829 Latent n= 248, median=1.000, mean=0.869 Interm. n= 508, median=1.000, mean=0.907 Mann-Whitney U (active > latent conservation): U=88778, p=9.4283e-01 → H2a not supported at p<0.05 Kruskal-Wallis (3 classes): H=14.97, p=5.6136e-04 Effect size (rank-biserial r): 0.052
# ── H2b: Pangenome openness vs % latent capabilities ─────────────────────────
print('=== H2b: Pangenome Openness vs Latent Capability Rate ===')
if pangenome_df is not None and 'clade_name' in classif.columns:
# Compute per-organism latent capability rate
org_latent_rate = (
classif.groupby(['orgId', 'clade_name'])['dependency_class']
.apply(lambda x: (x == 'latent_capability').sum() / len(x))
.reset_index()
.rename(columns={'dependency_class': 'latent_rate'})
)
# Aggregate to one point per clade to ensure independence.
# Multiple FB organisms from the same species clade share a single pangenome
# openness value; treating them as independent inflates n and biases the test.
clade_latent_rate = (
org_latent_rate.groupby('clade_name')['latent_rate']
.mean()
.reset_index()
)
n_clades = len(clade_latent_rate)
n_organisms = len(org_latent_rate)
print(f'Aggregated {n_organisms} organisms → {n_clades} unique clades (one point per clade)')
# Join with pangenome openness
openness_analysis = clade_latent_rate.merge(
pangenome_df[['clade_name', 'openness', 'pct_core', 'n_genomes']],
on='clade_name',
how='inner'
)
print(f'Clades with openness data: {len(openness_analysis)}')
if len(openness_analysis) >= 5:
print(openness_analysis[['clade_name', 'latent_rate', 'openness']].to_string(index=False))
rho, p_rho = spearmanr(openness_analysis['latent_rate'], openness_analysis['openness'])
print(f'\nSpearman correlation (clade latent rate vs openness): ρ={rho:.3f}, p={p_rho:.4f}')
if p_rho < 0.05 and rho > 0:
print('→ H2b supported: clades with more latent capabilities have more open pangenomes')
elif p_rho < 0.05 and rho < 0:
print('→ H2b REVERSED: more closed pangenomes have more latent capabilities (unexpected)')
else:
print('→ H2b not supported at p<0.05')
else:
print('Too few matched clades for correlation analysis')
openness_analysis = None
else:
print('Pangenome data not available or no clade_name column — skipping H2b test')
openness_analysis = None
=== H2b: Pangenome Openness vs Latent Capability Rate ===
Aggregated 41 organisms → 30 unique clades (one point per clade)
Clades with openness data: 22
clade_name latent_rate openness
s__Acidovorax_sp018828125--GB_GCA_018828125.1 0.176471 47.470412
s__Azospirillum_brasilense--RS_GCF_001315015.1 0.193548 54.078947
s__Bacteroides_thetaiotaomicron--RS_GCF_000011065.1 0.171429 95.086388
s__Cupriavidus_basilensis--RS_GCF_008801925.2 0.219512 67.391907
s__Dickeya_dadantii--RS_GCF_000406145.1 0.088889 67.411776
s__Dickeya_dianthicola--RS_GCF_000365305.1 0.233202 61.180796
s__Herbaspirillum_seropedicae--RS_GCF_001040945.1 0.119048 40.297847
s__Klebsiella_michiganensis--RS_GCF_002925905.1 0.288462 92.553657
s__Marinobacter_adhaerens--RS_GCF_000166295.1 0.137931 60.826002
s__Methanococcus_maripaludis--RS_GCF_002945325.1 0.171429 67.958206
s__Nitratidesulfovibrio_vulgaris--RS_GCF_000195755.1 0.092033 51.143911
s__Phaeobacter_inhibens--RS_GCF_000473105.1 0.166667 68.642674
s__Pseudomonas_E_alloputida--RS_GCF_021282585.1 0.204545 90.298734
s__Pseudomonas_E_fluorescens_E--RS_GCF_001307155.1 0.152628 47.362121
s__Pseudomonas_E_simiae--RS_GCF_900111895.1 0.195652 54.979906
s__Pseudomonas_E_syringae--RS_GCF_000507185.2 0.289474 88.339066
s__Ralstonia_solanacearum--RS_GCF_002251695.1 0.125046 83.252923
s__Shewanella_loihica--RS_GCF_000016065.1 0.027027 38.618553
s__Shewanella_sp000203935--RS_GCF_000203935.1 0.052632 46.863469
s__Sinorhizobium_meliloti--RS_GCF_017876815.1 0.220000 91.487826
s__Sphingomonas_koreensis--RS_GCF_002797435.1 0.075000 40.272501
s__Synechococcus_elongatus--RS_GCF_000010065.1 0.000000 15.345593
Spearman correlation (clade latent rate vs openness): ρ=0.688, p=0.0004
→ H2b supported: clades with more latent capabilities have more open pangenomes
5. Figures¶
# ── Figure 1: Box plot of conservation by dependency class ───────────────────
CLASS_COLORS = {
'active_dependency': '#e74c3c',
'intermediate': '#f39c12',
'latent_capability': '#3498db',
}
CLASS_ORDER = ['active_dependency', 'intermediate', 'latent_capability']
CLASS_LABELS = ['Active\nDependency', 'Intermediate', 'Latent\nCapability']
plot_data = classif_conserv[
classif_conserv['dependency_class'].isin(CLASS_ORDER)
].copy()
fig, ax = plt.subplots(figsize=(8, 6))
parts = ax.violinplot(
[plot_data.loc[plot_data['dependency_class'] == cls, 'conservation_rate'].dropna().values
for cls in CLASS_ORDER],
positions=range(len(CLASS_ORDER)),
showmedians=True,
showextrema=False,
)
for i, (pc, cls) in enumerate(zip(parts['bodies'], CLASS_ORDER)):
pc.set_facecolor(CLASS_COLORS[cls])
pc.set_alpha(0.7)
parts['cmedians'].set_color('black')
parts['cmedians'].set_linewidth(2)
# Add jitter
for i, cls in enumerate(CLASS_ORDER):
vals = plot_data.loc[plot_data['dependency_class'] == cls, 'conservation_rate'].dropna()
if len(vals) > 500:
vals = vals.sample(500, random_state=42)
ax.scatter(
np.random.normal(i, 0.06, len(vals)), vals,
c=CLASS_COLORS[cls], alpha=0.3, s=5, linewidths=0
)
ax.set_xticks(range(len(CLASS_ORDER)))
ax.set_xticklabels(CLASS_LABELS, fontsize=11)
ax.set_ylabel('Cross-Genome Conservation Rate\n(% species genomes with complete pathway)', fontsize=10)
ax.set_title('Pathway Conservation by Dependency Class\n(Black Queen Hypothesis Test)', fontsize=12)
ax.set_ylim(-0.05, 1.05)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda y, _: f'{y:.0%}'))
# Significance annotation
if len(active) > 0 and len(latent) > 0:
stars = '***' if u_p < 0.001 else '**' if u_p < 0.01 else '*' if u_p < 0.05 else 'ns'
ax.annotate('', xy=(2, 0.95), xytext=(0, 0.95),
arrowprops=dict(arrowstyle='-', color='black', lw=1.5))
ax.text(1, 0.97, stars, ha='center', fontsize=14)
plt.tight_layout()
plt.savefig(FIG_DIR / 'nb04_conservation_boxplot.png', dpi=150, bbox_inches='tight')
plt.close()
print('Saved: figures/nb04_conservation_boxplot.png')
Saved: figures/nb04_conservation_boxplot.png
# ── Figure 2: Conservation CDF by dependency class ───────────────────────────
fig, ax = plt.subplots(figsize=(8, 6))
for cls in CLASS_ORDER:
vals = plot_data.loc[
plot_data['dependency_class'] == cls, 'conservation_rate'
].dropna().sort_values()
if len(vals) == 0:
continue
cdf = np.arange(1, len(vals) + 1) / len(vals)
ax.plot(vals.values, cdf, color=CLASS_COLORS[cls], lw=2,
label=f'{cls.replace("_", " ").title()} (n={len(vals):,})')
ax.set_xlabel('Conservation Rate (% genomes with complete pathway)', fontsize=11)
ax.set_ylabel('Cumulative Fraction', fontsize=11)
ax.set_title('CDF of Pathway Conservation by Dependency Class', fontsize=12)
ax.legend(fontsize=10)
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:.0%}'))
ax.set_xlim(0, 1.01)
ax.set_ylim(0, 1.01)
# Annotate with test result
if len(active) > 0 and len(latent) > 0:
ax.text(0.05, 0.95,
f'Mann-Whitney p={u_p:.2e}\nEffect r={r_biserial:.3f}',
transform=ax.transAxes, va='top', fontsize=9,
bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
plt.tight_layout()
plt.savefig(FIG_DIR / 'nb04_conservation_cdf.png', dpi=150, bbox_inches='tight')
plt.close()
print('Saved: figures/nb04_conservation_cdf.png')
Saved: figures/nb04_conservation_cdf.png
# ── Figure 3: Pangenome openness vs latent capability rate (clade-level) ──────
if openness_analysis is not None and len(openness_analysis) >= 3:
fig, ax = plt.subplots(figsize=(8, 6))
sc = ax.scatter(
openness_analysis['latent_rate'] * 100,
openness_analysis['openness'],
s=openness_analysis['n_genomes'].clip(upper=5000) / 5,
c='#3498db', alpha=0.7, linewidths=0.5, edgecolors='navy'
)
# Trend line
x = openness_analysis['latent_rate'] * 100
y = openness_analysis['openness']
if len(x) >= 3:
z = np.polyfit(x, y, 1)
p_fit = np.poly1d(z)
x_line = np.linspace(x.min(), x.max(), 100)
ax.plot(x_line, p_fit(x_line), 'r--', alpha=0.7, lw=1.5)
# Label points by species name (stripped from clade_name)
for _, row in openness_analysis.iterrows():
label = str(row['clade_name']).split('--')[0].replace('s__', '').replace('_', ' ')[:20]
ax.annotate(
label,
(row['latent_rate'] * 100, row['openness']),
textcoords='offset points', xytext=(5, 5), fontsize=7, alpha=0.8
)
ax.set_xlabel('% Latent Capabilities (mean across organisms in clade)', fontsize=10)
ax.set_ylabel('Pangenome Openness (% non-core gene clusters)', fontsize=10)
ax.set_title('Pangenome Openness vs Latent Capability Rate\n(Black Queen Hypothesis — one point per clade)', fontsize=12)
if len(x) >= 5:
ax.text(0.05, 0.95,
f'Spearman ρ={rho:.3f}, p={p_rho:.3f}\nn={len(openness_analysis)} clades',
transform=ax.transAxes, va='top', fontsize=10,
bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
plt.tight_layout()
plt.savefig(FIG_DIR / 'nb04_openness_scatter.png', dpi=150, bbox_inches='tight')
plt.close()
print('Saved: figures/nb04_openness_scatter.png')
else:
print('Insufficient data for openness scatter plot — skipping')
Saved: figures/nb04_openness_scatter.png
6. Save Outputs¶
# Save conservation data
conservation_out = classif_conserv[[
'orgId', 'pathway', 'pathway_category', 'dependency_class',
'mean_abs_t', 'pct_essential', 'conservation_rate', 'n_genomes_in_species'
]].copy() if 'n_genomes_in_species' in classif_conserv.columns else classif_conserv[[
'orgId', 'pathway', 'pathway_category', 'dependency_class',
'mean_abs_t', 'pct_essential', 'conservation_rate',
]].copy()
conservation_out.to_csv(DATA_DIR / 'pathway_conservation.csv', index=False)
print(f'Saved: {DATA_DIR}/pathway_conservation.csv ({len(conservation_out):,} rows)')
if pangenome_df is not None:
pangenome_df.to_csv(DATA_DIR / 'pangenome_openness.csv', index=False)
print(f'Saved: {DATA_DIR}/pangenome_openness.csv ({len(pangenome_df):,} rows)')
print('\n=== NB04 Summary ===')
if len(active) > 0 and len(latent) > 0:
print(f'Active dependency conservation: {active.median():.1%} median')
print(f'Latent capability conservation: {latent.median():.1%} median')
print(f'Mann-Whitney p = {u_p:.2e}')
print(f'Effect size r = {r_biserial:.3f}')
if u_p < 0.05:
print('→ H2 SUPPORTED: Latent capabilities are less conserved than active dependencies')
else:
print('→ H2 not supported at p<0.05')
print('\nNext step: Run NB05 for metabolic ecotype analysis, then synthesize all findings in REPORT.md')
Saved: /home/cjneely/repos/BERIL-research-observatory/projects/metabolic_capability_dependency/data/pathway_conservation.csv (1,511 rows) Saved: /home/cjneely/repos/BERIL-research-observatory/projects/metabolic_capability_dependency/data/pangenome_openness.csv (7,334 rows) === NB04 Summary === Active dependency conservation: 100.0% median Latent capability conservation: 100.0% median Mann-Whitney p = 9.43e-01 Effect size r = 0.052 → H2 not supported at p<0.05 Next step: Run NB05 for metabolic ecotype analysis, then synthesize all findings in REPORT.md
Completion¶
Outputs generated:
data/pathway_conservation.csv— Conservation rates per pathway classdata/pangenome_openness.csv— Clade-level pangenome opennessfigures/nb04_conservation_boxplot.png— Violin/jitter plotfigures/nb04_conservation_cdf.png— CDF comparisonfigures/nb04_openness_scatter.png— Openness vs latent capability rate
Interpretation:
- H2a (latent = low conservation): If supported, it means complete-but-neutral pathways are on an evolutionary trajectory of loss — consistent with the Black Queen Hypothesis
- H2b (more latent → more open pangenome): If supported, organisms with many latent capabilities have more dynamic genomes, suggesting ongoing streamlining
- H2 caveat: Conservation of pathway completeness is a coarser measure than gene-cluster conservation. Pathways can become 'incomplete' by losing just one step gene.