Turning messy biological data into clean, ML-ready datasets for biomedical research and personalized medicine.
Biological data is inherently messy, high-dimensional, and noisy. It is often generated for narrow experimental purposes, which makes it difficult to reuse directly for predictive modeling. This project addresses that problem by repurposing raw biological datasets into clean, ML-ready formats for tasks such as drug response prediction, protein classification, and biomarker discovery.
Standardize raw biological data (RNA-seq, protein sequences, SMILES) into structured features.
Apply preprocessing, normalization, and splitting into train/valid/test sets.
Train baseline and advanced ML models with reproducible pipelines.
Improve interpretability with feature importance and explainability tools.
bio-ml-repurposing
├── data/ # raw and processed biological datasets
├── notebooks/ # exploration, preprocessing, training, explainability
├── src/ # preprocessing and model scripts
├── models/ # saved models and metrics
└── README.md
Python (pandas, numpy, scikit-learn)
Biopython / RDKit for biological feature extraction
XGBoost / PyTorch for modeling
SHAP, matplotlib for interpretability and visualization
Biopython / RDKit for biological feature extraction
XGBoost / PyTorch for modeling
SHAP, matplotlib for interpretability and visualization# bio-ml-repurposing