GitHub - davidhoushangi/bio-ml-repurposing: Pipelines to repurpose messy biological datasets into ML-ready formats for predictive modeling in genomics, proteomics, and drug discovery

Repurposing Biological Data for Machine Learning

Turning messy biological data into clean, ML-ready datasets for biomedical research and personalized medicine.

Biological data is inherently messy, high-dimensional, and noisy. It is often generated for narrow experimental purposes, which makes it difficult to reuse directly for predictive modeling. This project addresses that problem by repurposing raw biological datasets into clean, ML-ready formats for tasks such as drug response prediction, protein classification, and biomarker discovery.

Obsectives

Standardize raw biological data (RNA-seq, protein sequences, SMILES) into structured features.

Apply preprocessing, normalization, and splitting into train/valid/test sets.

Train baseline and advanced ML models with reproducible pipelines.

Improve interpretability with feature importance and explainability tools.

Project Structure

bio-ml-repurposing

├── data/ # raw and processed biological datasets

├── notebooks/ # exploration, preprocessing, training, explainability

├── src/ # preprocessing and model scripts

├── models/ # saved models and metrics

└── README.md

Tech Stack

Python (pandas, numpy, scikit-learn)

Biopython / RDKit for biological feature extraction

XGBoost / PyTorch for modeling

SHAP, matplotlib for interpretability and visualization

Biopython / RDKit for biological feature extraction

XGBoost / PyTorch for modeling

SHAP, matplotlib for interpretability and visualization# bio-ml-repurposing

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
models		models
notebooks		notebooks
src		src
LICENSE		LICENSE
README.md		README.md
colab_lab6.ipynb		colab_lab6.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Repurposing Biological Data for Machine Learning

Obsectives

Project Structure

Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Repurposing Biological Data for Machine Learning

Obsectives

Project Structure

Tech Stack

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages