01 Enigma Discovery
Jupyter notebook from the Field vs Lab Gene Importance in Desulfovibrio vulgaris Hildenborough project.
NB 01: ENIGMA CORAL Discovery¶
Discover what data exists in the undocumented ENIGMA CORAL database tables. Assess whether ENIGMA adds TnSeq, DubSeq, community, or field sampling data beyond what the Fitness Browser already provides for DvH.
Run via Spark Connect — get_spark_session() from berdl_notebook_utils.
Outputs: Summary of ENIGMA data availability, schema notes.
Execution Summary¶
Executed 2026-02-15 via Spark Connect. Key findings:
- 47 tables in
enigma_coral(17 sdt_, 23 ddt_brick, 5 sys_*, 2 other)- No DvH data: Zero DvH strains, genomes, or genes in the database. The single TnSeq library is for FW300-N2E2 (a Pseudomonas isolate), not DvH.
- All 596 locations are Oak Ridge Reservation sites
- 3 DubSeq libraries: E. coli BW25113, P. putida KT2440, B. thetaiotaomicron VPI-5482 — none for DvH
- 6 Desulfovibrio taxa in
sdt_taxonbut only at genus/family/order level — no DvH Hildenborough- 6,705 genomes, 15,015 genes — these are environmental isolates from ORR, not DvH
- 4,346 field samples with geochemistry data in brick tables (uranium, metals)
- 213,044 ASVs for community composition (16S amplicon data)
- Assessment: ENIGMA CORAL does NOT add TnSeq/fitness data for DvH beyond the Fitness Browser. It provides complementary environmental/community data (field samples, geochemistry, community composition) that could be used for future analyses linking DvH ecology to field conditions, but not for the current gene-level fitness analysis.
In [ ]:
import pandas as pd
import numpy as np
from pathlib import Path
spark = get_spark_session()
print(f"Spark version: {spark.version}")
DATA_DIR = Path('../data')
DATA_DIR.mkdir(exist_ok=True)
1. List All ENIGMA Tables¶
In [ ]:
# List all tables in enigma_coral
tables = spark.sql("SHOW TABLES IN enigma_coral").toPandas()
print(f"Total ENIGMA tables: {len(tables)}")
print()
# Categorize tables
sdt_tables = tables[tables['tableName'].str.startswith('sdt_')]
ddt_tables = tables[tables['tableName'].str.startswith('ddt_')]
sys_tables = tables[tables['tableName'].str.startswith('sys_')]
other_tables = tables[~tables['tableName'].str.startswith(('sdt_', 'ddt_', 'sys_'))]
print(f"Scientific data tables (sdt_*): {len(sdt_tables)}")
print(f"Data brick tables (ddt_*): {len(ddt_tables)}")
print(f"System tables (sys_*): {len(sys_tables)}")
print(f"Other tables: {len(other_tables)}")
print()
print("sdt_ tables:")
for t in sorted(sdt_tables['tableName'].tolist()):
print(f" {t}")
2. Row Counts for Key Tables¶
In [ ]:
# Get row counts for all sdt_ tables
row_counts = {}
for t in sorted(sdt_tables['tableName'].tolist()):
try:
n = spark.sql(f"SELECT COUNT(*) as n FROM enigma_coral.{t}").toPandas()['n'].iloc[0]
row_counts[t] = n
print(f" {t}: {n:,} rows")
except Exception as e:
print(f" {t}: ERROR - {e}")
row_counts[t] = -1
3. TnSeq Libraries¶
In [ ]:
# What TnSeq libraries exist?
tnseq = spark.sql("""
SELECT *
FROM enigma_coral.sdt_tnseq_library
LIMIT 50
""").toPandas()
print(f"TnSeq library rows (sample): {len(tnseq)}")
print(f"Columns: {tnseq.columns.tolist()}")
print()
tnseq
4. DubSeq Libraries¶
In [ ]:
# What DubSeq data is available?
dubseq = spark.sql("""
SELECT *
FROM enigma_coral.sdt_dubseq_library
LIMIT 50
""").toPandas()
print(f"DubSeq library rows (sample): {len(dubseq)}")
print(f"Columns: {dubseq.columns.tolist()}")
print()
dubseq
5. Strains¶
In [ ]:
# What strains are recorded?
strains = spark.sql("""
SELECT *
FROM enigma_coral.sdt_strain
LIMIT 100
""").toPandas()
print(f"Strain rows (sample): {len(strains)}")
print(f"Columns: {strains.columns.tolist()}")
print()
# Check for DvH strains
if 'sdt_strain_name' in strains.columns:
dvh_strains = strains[strains['sdt_strain_name'].str.contains('Desulfovibrio|DvH|vulgaris', case=False, na=False)]
print(f"DvH-related strains in sample: {len(dvh_strains)}")
if len(dvh_strains) > 0:
print(dvh_strains)
print()
strains.head(20)
6. Experimental Conditions¶
In [ ]:
# What experimental conditions were tested?
conditions = spark.sql("""
SELECT *
FROM enigma_coral.sdt_condition
LIMIT 100
""").toPandas()
print(f"Condition rows (sample): {len(conditions)}")
print(f"Columns: {conditions.columns.tolist()}")
print()
conditions.head(20)
7. Field Samples¶
In [ ]:
# What field samples exist?
samples = spark.sql("""
SELECT *
FROM enigma_coral.sdt_sample
LIMIT 50
""").toPandas()
print(f"Sample rows (sample): {len(samples)}")
print(f"Columns: {samples.columns.tolist()}")
print()
samples.head(20)
8. Community Composition¶
In [ ]:
# Community composition data
community = spark.sql("""
SELECT *
FROM enigma_coral.sdt_community
LIMIT 50
""").toPandas()
print(f"Community rows (sample): {len(community)}")
print(f"Columns: {community.columns.tolist()}")
print()
community.head(20)
9. Sampling Locations¶
In [ ]:
# Sampling locations
locations = spark.sql("""
SELECT *
FROM enigma_coral.sdt_location
LIMIT 50
""").toPandas()
print(f"Location rows (sample): {len(locations)}")
print(f"Columns: {locations.columns.tolist()}")
print()
locations.head(20)
10. Desulfovibrio Presence in Taxon Table¶
In [ ]:
# Check for Desulfovibrio in taxon table
dvh_taxa = spark.sql("""
SELECT *
FROM enigma_coral.sdt_taxon
WHERE sdt_taxon_name LIKE '%Desulfovibrio%'
OR sdt_taxon_name LIKE '%vulgaris%'
""").toPandas()
print(f"Desulfovibrio/vulgaris taxa: {len(dvh_taxa)}")
print(f"Columns: {dvh_taxa.columns.tolist()}")
print()
dvh_taxa
11. Data Brick Tables¶
In [ ]:
# Sample ddt_brick* tables to understand numerical data format
brick_tables = ddt_tables[ddt_tables['tableName'].str.startswith('ddt_brick')]
print(f"Brick tables found: {len(brick_tables)}")
for t in sorted(brick_tables['tableName'].tolist())[:5]:
print(f"\n--- {t} ---")
try:
schema = spark.sql(f"DESCRIBE enigma_coral.{t}").toPandas()
print(f"Columns: {schema['col_name'].tolist()}")
sample = spark.sql(f"SELECT * FROM enigma_coral.{t} LIMIT 3").toPandas()
print(f"Sample rows: {len(sample)}")
print(sample)
except Exception as e:
print(f"ERROR: {e}")
12. Summary & Assessment¶
In [ ]:
print("=" * 60)
print("ENIGMA CORAL DISCOVERY SUMMARY")
print("=" * 60)
print()
print("Table row counts:")
for t, n in sorted(row_counts.items()):
status = f"{n:,}" if n >= 0 else "ERROR"
print(f" {t}: {status}")
print()
print("Key questions:")
print(f" TnSeq libraries found: {len(tnseq)}")
print(f" DubSeq libraries found: {len(dubseq)}")
print(f" Desulfovibrio taxa: {len(dvh_taxa)}")
print(f" Sampling locations: {len(locations)}")
print()
print("Assessment:")
print(" [Fill in after running: does ENIGMA CORAL add data beyond")
print(" the Fitness Browser for this project's analyses?]")
print("=" * 60)