01 Explore Data
Jupyter notebook from the Antibiotic Resistance Hotspots in Microbial Pangenomes project.
Phase 1: Explore BERDL Data & Pangenome Structure¶
This notebook explores the BERDL pangenome collection to understand its structure and prepare for antibiotic resistance gene (ARG) detection.
Goals¶
- Connect to BERDL and verify access
- Explore pangenome collection tables and schemas
- Understand the relationship between genomes, genes, and orthogroups
- Document table structures and row counts
- Identify metadata fields for downstream analysis (environment, taxonomy, etc.)
In [ ]:
# Import required libraries
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
import os
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
print("Libraries imported successfully")
In [ ]:
# Initialize Spark session (if not already initialized)
# Note: In BERDL JupyterHub, spark session is usually pre-initialized
try:
spark
print("Spark session already initialized")
except NameError:
spark = SparkSession.builder \
.appName("resistance_hotspots") \
.getOrCreate()
print("Spark session created")
Explore Available Collections¶
In [ ]:
# List available databases/collections
databases = spark.sql("SHOW DATABASES").collect()
print(f"Available databases: {len(databases)}")
for db in databases:
print(f" - {db.namespace}")
In [ ]:
# Focus on the pangenome collection
pangenome_tables = spark.sql("SHOW TABLES IN kbase_ke_pangenome").collect()
print(f"Tables in kbase_ke_pangenome: {len(pangenome_tables)}")
for table in pangenome_tables:
print(f" - {table.tableName}")
Examine Key Tables¶
In [ ]:
# Check pangenome table
pangenome_df = spark.sql("SELECT * FROM kbase_ke_pangenome.pangenome LIMIT 5")
print("Pangenome Table Schema:")
pangenome_df.printSchema()
print(f"\nRow count: {spark.sql('SELECT COUNT(*) as count FROM kbase_ke_pangenome.pangenome').collect()[0]['count']}")
In [ ]:
# Check gene table
gene_df = spark.sql("SELECT * FROM kbase_ke_pangenome.gene LIMIT 5")
print("Gene Table Schema:")
gene_df.printSchema()
print(f"\nRow count: {spark.sql('SELECT COUNT(*) as count FROM kbase_ke_pangenome.gene').collect()[0]['count']}")
# Check eggNOG annotations (functional annotations including potential resistance genes)
eggnog_df = spark.sql("SELECT * FROM kbase_ke_pangenome.eggnog_mapper_annotations LIMIT 5")
print("\neggNOG Annotations Table Schema:")
eggnog_df.printSchema()
print(f"\nRow count: {spark.sql('SELECT COUNT(*) as count FROM kbase_ke_pangenome.eggnog_mapper_annotations').collect()[0]['count']}")
In [ ]:
# Check orthogroup table
orthogroup_df = spark.sql("SELECT * FROM kbase_ke_pangenome.orthogroup LIMIT 5")
print("Orthogroup Table Schema:")
orthogroup_df.printSchema()
print(f"\nRow count: {spark.sql('SELECT COUNT(*) as count FROM kbase_ke_pangenome.orthogroup').collect()[0]['count']}")
In [ ]:
# Check GTDB species taxonomy
gtdb_df = spark.sql("SELECT * FROM kbase_ke_pangenome.gtdb_species_clade LIMIT 5")
print("GTDB Species Clade Table Schema:")
gtdb_df.printSchema()
print(f"\nRow count: {spark.sql('SELECT COUNT(*) as count FROM kbase_ke_pangenome.gtdb_species_clade').collect()[0]['count']}")
Basic Statistics¶
In [ ]:
# Get summary statistics
stats_query = """
SELECT
(SELECT COUNT(DISTINCT pangenome_id) FROM kbase_ke_pangenome.pangenome) as n_pangenomes,
(SELECT COUNT(DISTINCT genome_id) FROM kbase_ke_pangenome.genome) as n_genomes,
(SELECT COUNT(*) FROM kbase_ke_pangenome.gene) as n_genes,
(SELECT COUNT(*) FROM kbase_ke_pangenome.orthogroup) as n_orthogroups
"""
stats = spark.sql(stats_query).collect()[0]
print("BERDL Pangenome Collection Summary:")
print(f" Pangenomes: {stats.n_pangenomes}")
print(f" Genomes: {stats.n_genomes}")
print(f" Genes: {stats.n_genes}")
print(f" Orthogroups: {stats.n_orthogroups}")
Next Steps¶
- Identify which tables contain functional annotations or gene descriptions
- Look for any existing resistance-related annotations in the database
- Plan ARG detection strategy in notebook 02_identify_args.ipynb