AI-powered genomic resistance prediction for E. coli β Ciprofloxacin, Ceftriaxone & Amoxicillin
This project is a multi-task machine learning system that predicts antibiotic resistance in E. coli genomes. Given 40 genomic features extracted from whole-genome sequencing data (gene presence/absence markers, mutation counts, and proportion scores), the model simultaneously predicts resistance outcomes for three clinically important antibiotics:
| Antibiotic | Drug Family | Treats |
|---|---|---|
| Ciprofloxacin | Fluoroquinolone | UTIs, respiratory infections |
| Ceftriaxone | 3rd-gen Cephalosporin | Pneumonia, meningitis, hospital infections |
| Amoxicillin | Penicillin | Ear infections, strep throat, chest infections |
Each prediction returns one of three standardised AST labels:
- π’ S β Susceptible (antibiotic kills the bacteria)
- π‘ I β Intermediate (uncertain, dose/context dependent)
- π΄ R β Resistant (antibiotic fails, bacteria survives)
| Antibiotic | Accuracy | MCC | F1 (weighted) | AUC |
|---|---|---|---|---|
| Ciprofloxacin | 78.3% | 0.553 | 0.717 | 0.934 |
| Ceftriaxone | 83.5% | 0.632 | 0.783 | 0.935 |
| Amoxicillin | 79.8% | 0.572 | 0.732 | 0.934 |
Algorithm: MultiOutputClassifier wrapping RandomForestClassifier (200 trees, max depth 12)
Genomic data and Antimicrobial Susceptibility Testing (AST) results were sourced from the PATRIC / BV-BRC public database β the largest freely available bacterial genomics repository maintained by the U.S. Department of Energy.
- Organism: Escherichia coli (Gram-negative, clinically relevant)
- Genomes collected: 3,000 whole-genome sequencing (WGS) records
- AST labels: R / I / S for Ciprofloxacin, Ceftriaxone, and Amoxicillin
Raw data from PATRIC required significant cleaning before it was usable:
- Duplicate removal β genomes with identical PATRIC IDs or redundant AST entries were dropped
- Missing label handling β genomes missing AST results for any of the 3 antibiotics were excluded
- Outlier filtering β genomes with abnormal feature distributions (>3 SD from mean) were flagged and reviewed
- Label standardisation β raw MIC values and non-standard labels were mapped to S / I / R using EUCAST 2023 breakpoints
- Class balance check β final label distribution confirmed: S β 65%, I β 15%, R β 20%
Features were extracted from cleaned genome assemblies using two bioinformatics tools:
a) AMRFinderPlus β identifies resistance genes:
amrfinderplus -i genome.fasta -o amr_genes.tsv| Feature Group | Columns | What Was Extracted |
|---|---|---|
| F1βF15 | Binary (0/1) | Presence/absence of resistance genes & key mutations (e.g., gyrA S83L, bla_TEM, bla_CTX-M-15) |
| F16βF30 | Integer (0β10) | Counts of resistance elements, mobile genetic elements, efflux pump genes |
| F31βF40 | Real (0.00β1.00) | Proportional scores β fraction of genome with resistance markers, intact target sites |
b) bcftools β identifies point mutations:
bcftools mpileup -f reference.fasta genome.bam | bcftools call -mv -o variants.vcfMutations in gyrA, parC, ompF, and beta-lactamase regions were recorded as binary features.
git clone https://github.com/Rishu-raj-02/AMR-Multi-Antibiotic-Resistance-Predictor.git
cd AMR-Multi-Antibiotic-Resistance-Predictorpip install -r requirements.txtpython train_model.pyReads data/amr_dataset.csv, trains the model, saves artifacts to models/.
python app.pyOpen http://localhost:5000 in your browser.
AMR-Multi-Antibiotic-Resistance-Predictor/
β
βββ data/
β βββ amr_dataset.csv # Cleaned E. coli dataset (3,000 genomes Γ 44 cols)
β
βββ models/ # Auto-generated by train_model.py
β βββ amr_model.pkl # Trained MultiOutputClassifier
β βββ encoders.pkl # LabelEncoders for S/I/R
β βββ metrics.json # Per-antibiotic performance stats
β βββ feature_importance.json # Feature ranking per antibiotic
β βββ feature_cols.json # Ordered feature column list
β
βββ templates/
β βββ index.html # Dark biotech dashboard UI
β
βββ app.py # Flask web server + REST API
βββ train_model.py # Model training & evaluation script
βββ requirements.txt
βββ render.yaml # One-click Render.com deployment
βββ Procfile # Gunicorn process config
βββ README.md
| Method | Endpoint | Description |
|---|---|---|
GET |
/ |
Interactive web dashboard |
POST |
/predict |
Run resistance prediction |
GET |
/metrics |
Model performance stats |
GET |
/random_sample |
Load a sample genome for demo |
POST /predict
{
"features": {
"F1": 1, "F2": 0, "F3": 1,
"F4": 1, "F11": 1,
"F16": 7, "F17": 3,
"F31": 0.85, "F32": 0.20, "F33": 0.70
}
}{
"predictions": {
"CIPRO": {
"label": "R",
"full": "Resistant",
"confidence": 87.3,
"probabilities": { "S": 0.07, "I": 0.06, "R": 0.87 }
},
"CEFTRIAXONE": { "label": "S", "confidence": 72.1 },
"AMOXICILLIN": { "label": "R", "confidence": 81.5 }
}
}Gene presence / mutation markers extracted via AMRFinderPlus:
| Feature | Biological Meaning | Linked Antibiotic |
|---|---|---|
| F1 | gyrA S83L mutation | Ciprofloxacin βR |
| F2 | gyrA D87G mutation | Ciprofloxacin βR |
| F3 | bla_TEM gene presence | Amoxicillin βR |
| F4 | Multi-drug resistance plasmid marker | All three βR |
| F7 | bla_SHV gene | Ceftriaxone βR |
| F11 | bla_CTX-M-15 (ESBL) gene | Ceftriaxone βR |
Count features β number of resistance mutations, mobile elements, efflux pump genes detected per genome assembly.
Proportion scores β fraction of genome containing resistance markers, proportion of intact drug-binding sites, etc.
| Rank | Feature | Importance | Biological Role |
|---|---|---|---|
| 1 | F3 | 0.1018 | bla_TEM β Amoxicillin resistance |
| 2 | F2 | 0.0719 | gyrA mutation β Cipro resistance |
| 3 | F4 | 0.0570 | Multi-drug plasmid marker |
| 4 | F7 | 0.0553 | bla_SHV β Ceftriaxone resistance |
| 5 | F1 | 0.0536 | gyrA S83L β Cipro resistance |
The web app generates real-time clinical guidance:
- Flags Resistant predictions with evidence-based alternative drug suggestions
- Recommends confirmatory MIC lab testing for borderline (I) results
- Displays probability distributions across all three resistance classes
β οΈ Disclaimer: This tool is intended for research and educational purposes. Clinical treatment decisions must always be confirmed with certified laboratory antimicrobial susceptibility testing.
- Data Collection β 3,000 E. coli WGS records with AST results from PATRIC/BV-BRC
- Cleaning β duplicate removal, missing label exclusion, EUCAST breakpoint standardisation
- Feature Extraction β AMRFinderPlus (resistance genes) + bcftools (point mutations)
- Feature Engineering β 40 columns: 15 binary + 15 integer + 10 real-valued proportions
- Modelling β MultiOutputClassifier (RandomForest, 200 estimators, depth 12)
- Evaluation β 80/20 stratified split; Accuracy, MCC, F1-weighted, AUC-ROC (OvR)
Pull requests welcome. For major changes, please open an issue first.
MIT β see LICENSE
Built for IIT Mandi Hackathon β E. coli Β· Ciprofloxacin Β· Ceftriaxone Β· Amoxicillin