Skip to content

davidhoushangi/bio-ml-repurposing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Repurposing Biological Data for Machine Learning

Turning messy biological data into clean, ML-ready datasets for biomedical research and personalized medicine.

Biological data is inherently messy, high-dimensional, and noisy. It is often generated for narrow experimental purposes, which makes it difficult to reuse directly for predictive modeling. This project addresses that problem by repurposing raw biological datasets into clean, ML-ready formats for tasks such as drug response prediction, protein classification, and biomarker discovery.

Obsectives

Standardize raw biological data (RNA-seq, protein sequences, SMILES) into structured features.

Apply preprocessing, normalization, and splitting into train/valid/test sets.

Train baseline and advanced ML models with reproducible pipelines.

Improve interpretability with feature importance and explainability tools.

Project Structure

bio-ml-repurposing

├── data/ # raw and processed biological datasets

├── notebooks/ # exploration, preprocessing, training, explainability

├── src/ # preprocessing and model scripts

├── models/ # saved models and metrics

└── README.md

Tech Stack

Python (pandas, numpy, scikit-learn)

Biopython / RDKit for biological feature extraction

XGBoost / PyTorch for modeling

SHAP, matplotlib for interpretability and visualization

Biopython / RDKit for biological feature extraction

XGBoost / PyTorch for modeling

SHAP, matplotlib for interpretability and visualization# bio-ml-repurposing

About

Pipelines to repurpose messy biological datasets into ML-ready formats for predictive modeling in genomics, proteomics, and drug discovery

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors