Heart Disease Prediction Project

This project was built to demonstrate core machine learning workflows and model evaluation strategies on a healthcare-related classification problem.

Dataset Overview

The dataset contains information such as age, cholesterol, and max heart rate. The target variable indicates heart disease presence (1) or absence (0).

Models Used

Decision Tree: High interpretability, prone to overfitting
Random Forest: Best performance due to ensemble learning
Logistic Regression: Linear baseline model, lowest accuracy

Key Results

Model	Accuracy
Decision Tree	91.7%
Random Forest	98.5%
Logistic Regression	79.5%

The table above reflects performance before model tuning, when overfitting and underfitting were not yet addressed.

After correcting for these issues through parameter tuning (e.g., adjusting max_depth and min_samples_leaf), the models produced more realistic and generalizable results:

Model	Accuracy
Decision Tree	80.0%
Random Forest	85.4%
Logistic Regression	79.5%

Why Model Tuning Matters

Initial models can achieve deceptively high accuracy by overfitting the training data, especially with flexible algorithms like decision trees and random forests. However, overfit models perform poorly on real-world data because they memorize patterns instead of learning general rules.

To ensure generalization and robustness, I tuned key hyperparameters (e.g., max_depth, min_samples_leaf) to strike a balance between bias and variance. This led to more realistic accuracy scores and reduced the risk of false predictions which is critical in medical applications like disease detection. While the original models appeared to achieve near-perfect accuracy, those results were specific to the training data and did not reflect true generalization. After tuning to address overfitting, the revised accuracy scores better represent how the models would perform on new, unseen data.

In regard to the random state vaule of the model, this is an arbitrary vaule and any number can be used. 42 was used for reproducible results. While arbitrary, 42 is a common seed value in programming culture.

Insights

Random Forest outperformed other models due to its ability to reduce variance and generalize well.
Proper max_depth tuning (tested 4–15) significantly impacted performance.
Medical metrics like recall were emphasized to reduce missed diagnoses (false negatives).

Next Steps

Incorporate ROC/AUC analysis for better model discrimination
Deploy the model using Streamlit to enable interactive use
Explore feature selection or engineering to improve performance

Technologies Used

Python (pandas, numpy, scikit-learn, matplotlib)
Jupyter Notebook
Git/GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
heart-disease-prediction-project-final-version.ipynb		heart-disease-prediction-project-final-version.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Heart Disease Prediction Project

Dataset Overview

Models Used

Key Results

Why Model Tuning Matters

Insights

Next Steps

Technologies Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Heart Disease Prediction Project

Dataset Overview

Models Used

Key Results

Why Model Tuning Matters

Insights

Next Steps

Technologies Used

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages