02 Identify Args
Jupyter notebook from the Antibiotic Resistance Hotspots in Microbial Pangenomes project.
Phase 2: Identify Antibiotic Resistance Genes (ARGs)¶
This notebook identifies antibiotic resistance genes in the pangenome collection by:
- Querying for functional annotations matching ARG databases
- Creating an ARG annotation dataset
- Linking genes to resistance mechanisms and drug classes
Data Sources for ARG Detection¶
- CARD (Comprehensive Antibiotic Resistance Database)
- ResFinder database
- PATRIC resistance annotations
- Keyword matching in gene descriptions/functions
In [ ]:
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
import re
# Spark session should be initialized from notebook 01
print("Phase 2: ARG Identification")
Step 1: Download and Parse ARG Reference Databases¶
TODO: Fetch and parse CARD, ResFinder, PATRIC databases
In [ ]:
# Placeholder: Create ARG reference dataset
# In practice, this would involve:
# 1. Downloading CARD database
# 2. Parsing ResFinder annotations
# 3. Creating BLASTP database for homology matching
arg_keywords = [
'antibiotic', 'resistance', 'resistant', 'beta-lactamase', 'efflux',
'ampicillin', 'penicillin', 'tetracycline', 'fluoroquinolone', 'macrolide',
'vancomycin', 'methicillin', 'carbapenem', 'cephalosporin'
]
print(f"ARG keywords for detection: {len(arg_keywords)}")
print(arg_keywords)
Step 2: Query BERDL for Gene Annotations¶
In [ ]:
# Query genes with functional annotations
# This assumes gene descriptions are available in the database
genes_query = """
SELECT
gene_id,
orthogroup_id,
genome_id,
gene_description,
gene_function
FROM kbase_ke_pangenome.gene
WHERE gene_description IS NOT NULL
OR gene_function IS NOT NULL
LIMIT 1000
"""
# This will be populated after exploring table structure in notebook 01
print("Query prepared for gene annotation retrieval")
Step 3: Identify ARGs by Keyword Matching¶
In [ ]:
# Implement keyword matching logic with case-insensitive matching
# TODO: Extract and annotate ARGs from gene descriptions and annotations
def extract_drug_class(description):
"""Extract drug class from gene description (case-insensitive)"""
if not description:
return None
drug_mappings = {
'beta-lactam': ['beta-lactamase', 'penicillin', 'ampicillin', 'cephalosporin', 'carbapenem', 'beta_lactam'],
'tetracycline': ['tetracycline', 'tet', 'oxytetracycline'],
'fluoroquinolone': ['fluoroquinolone', 'quinolone', 'gyra', 'gyrb', 'qnr'],
'macrolide': ['macrolide', 'erythromycin', 'erm', 'mls'],
'vancomycin': ['vancomycin', 'vana', 'vanb', 'vanc', 'vand'],
'aminoglycoside': ['aminoglycoside', 'kanamycin', 'gentamicin', 'streptomycin', 'aac', 'aad', 'ags'],
'sulfonamide': ['sulfonamide', 'sulfon', 'dhps'],
'chloramphenicol': ['chloramphenicol', 'cam'],
}
desc_lower = description.lower()
for drug_class, keywords in drug_mappings.items():
if any(kw in desc_lower for kw in keywords):
return drug_class
return None
print("ARG detection functions prepared with case-insensitive matching")
Step 4: Create ARG Annotation Dataset¶
In [ ]:
# TODO: Create comprehensive ARG annotation table
# Columns: gene_id, orthogroup_id, arg_name, drug_class, resistance_mechanism, source_db, confidence
print("ARG annotation dataset creation in progress...")
Next Steps¶
- Save ARG annotation dataset to data/arg_annotations.csv
- Continue to notebook 03_distribution_analysis.ipynb for prevalence analysis