Heart Disease Prediction: Binary Classification (Multi-Dataset Study)

Project Overview

Comparison of performance of various ML models on binary classification of cardiovascular disease (CVD) using four distinct public datasets. The primary goal is to explore mutiple heart disease datasets and to assess perfomance of various traditional ML models on these datasets.

Classification Goal

Binary Target: 0 (No CVD / Healthy) or 1 (At Risk / Presence of CVD).
Original Data Adjustment: Except Hungarian dataset, all three datasets initially contained multi-class risk levels (1-4). These have been merged into a single binary class (1) to focus solely on the presence or absence of the disease.

Experiment Tracking (MLflow)

This project uses MLflow Tracking to log:

parameters (dataset, model type, hyperparameters)
metrics (accuracy, f1, precision, recall, AUROC, specificity)
artifacts (predictions, confusion matrix, explainability plots)
models (sklearn + XGBoost)

How to Run

Prerequisites

Python 3.8+
Jupyter Notebook or JupyterLab
Dataset is assumed to be placed in data/downloaded/ folder in the parent directory of this repo (this repo and data are in the same directory level).

Clone the repository:

git clone https://github.com/darshanz/ML-for-Cardiovascular-Disease-Diagnosis.git
ML-for-Cardiovascular-Disease-Diagnosis.git

Install dependencies:
```
pip install -r requirements.txt
```
Run Experiments:
1. Data Exploration and Missing Valie Imputation (Notebook)
2. Data Preparation for Training
3. Experiment scripts
To run all the experiments:
```
    cd src
    python main.py
```
Or Run using MLFlow:,
```
  mlflow run .
```

MLFlow UI

mlflow server --backend-store-uri sqlite:///cardiovascular.db --port 5000

Datasets Used

This project utilizes four well-known heart disease datasets, which were combined for comprehensive evaluation:

Dataset Name	Source Location	Total Rows	Target Class Balance (Approx.)	Notes
Cleveland	Cleveland Clinic Foundation	303	~54% Healthy / 46% Risk	Commonly used for benchmarking.
Hungarian	Hungarian Institute of Cardiology	294	~62% Healthy / 38% Risk	Contains a high number of missing values.
Switzerland	University Hospital, Zurich	123	~54% Healthy / 46% Risk	Smaller, unique patient group.
Long Beach	V.A. Medical Center, Long Beach	200	~62% Healthy / 38% Risk	Used to evaluate external validity.

Dataset downloaded from : https://archive.ics.uci.edu/dataset/45/heart%2Bdisease?

Data Source Information:

(a) Creators: -- 1. Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D. -- 2. University Hospital, Zurich, Switzerland: William Steinbrunn, M.D. -- 3. University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D. -- 4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.
(b) Donor: David W. Aha (aha@ics.uci.edu) (714) 856-8779
(c) Date: July, 1988

Methodology and Key Steps

1. Data Preprocessing & Cleaning

Data Cleaning

The data was available in both raw and processed form with 13 features which were common in all 4 datasets, we used the 13-feature processed version of the dataset.

Attribute Information

Index	Feature Name	Description	Value Interpretation
1	`age`	Age in years	-
2	`sex`	Sex	`1` = Male; `0` = Female
3	`cp`	Chest pain type	`1` = Typical angina, `2` = Atypical angina, `3` = Non-anginal pain, `4` = Asymptomatic
4	`trestbps`	Resting blood pressure (mm Hg on admission)	-
5	`chol`	Serum cholestoral (mg/dl)	-
6	`fbs`	Fasting blood sugar > 120 mg/dl	`1` = True; `0` = False
7	`restecg`	Resting electrocardiographic results	`0` = Normal, `1` = ST-T wave abnormality, `2` = Probable or definite left ventricular hypertrophy
8	`thalach`	Maximum heart rate achieved	-
9	`exang`	Exercise induced angina	`1` = Yes; `0` = No
10	`oldpeak`	ST depression induced by exercise relative to rest	-
11	`slope`	The slope of the peak exercise ST segment	`1` = Upsloping, `2` = Flat, `3` = Downsloping
12	`ca`	Number of major vessels (0-3) colored by flourosopy	-
13	`thal`	Thalassemia	`3` = Normal; `6` = Fixed defect; `7` = Reversable defect
14	`num` (Target)	Diagnosis of heart disease (Angiographic status)	`0` = < 50% diameter narrowing (No disease); `1` = > 50% diameter narrowing (Disease)

The raw data files, which lacked headers, were loaded and assigned clear, descriptive column names based on the UCI dataset dictionary.
The non-standard missing value placeholder (?) was converted to the standard NaN to ensure correct handling by pandas.

Label Distribution in Four Datasets

Handling Missing Values (Two-Pronged Approach)

Missing Values in four datasets

Missing values were categorized into two types: Random Missing Values (RMV) and Systematic Missing Values (SMV).

Systematic Missing Values (SMV):
- For the Hungarian, Switzerland, and Long Beach VA datasets (which showed high systematic missingness), the Attribute Deletion for Missing Value Handling (ADMVH) technique was applied.
- Any column missing more than 50% of its data was removed from the analysis to preserve data quality.
Random Missing Values (RMV):
- For the Cleveland dataset, and the remaining attributes of the hybrid datasets, the Most Common Missing Value Imputation (MCMVI) method was used.
- Remaining NaN values were imputed using the mode (most frequent value) of their respective column.

2. Exploratory Data Analysis (EDA)

Missingness Visualization: Used the missingno library (matrix, bar, and heatmaps) to visualize and compare missing data patterns across the four datasets.
Interactive Controls: Implemented ipywidgets to dynamically switch between dataset views for comparison.

3. Model Training and Evaluation

Various established machine learning algorithms were used for classifications including ensemble methods like XGBoost and Random Forest, deep learning approaches like MLP, and classical models such as Logistic Regression and Support Vector Classifier (SVC).

4. Results

Cleveland

model	accuracy	auroc	f1	precision	recall	specificity
XGBoost	0.87	0.96	0.87	0.88	0.87	0.93
MLP	0.90	0.94	0.90	0.90	0.90	0.93
LogisticRegression	0.90	0.94	0.90	0.90	0.90	0.90
DecisionTree	0.72	0.72	0.72	0.72	0.72	0.76
KNeighbors	0.69	0.74	0.69	0.70	0.69	0.76
GaussianNB	0.84	0.92	0.84	0.84	0.84	0.90
RandomForest	0.90	0.94	0.90	0.91	0.90	0.97
AdaBoost	0.90	0.94	0.90	0.90	0.90	0.93
SVC	0.48	0.50	0.31	0.23	0.48	1.00

Hungarian

model	accuracy	auroc	f1	precision	recall	specificity
XGBoost	0.83	0.92	0.83	0.83	0.83	0.92
MLP	0.83	0.90	0.83	0.84	0.83	0.84
LogisticRegression	0.83	0.92	0.83	0.83	0.83	0.87
DecisionTree	0.78	0.81	0.78	0.79	0.78	0.79
KNeighbors	0.75	0.75	0.74	0.74	0.75	0.84
GaussianNB	0.83	0.92	0.83	0.84	0.83	0.82
RandomForest	0.83	0.91	0.83	0.83	0.83	0.87
AdaBoost	0.81	0.90	0.81	0.82	0.81	0.84
SVC	0.64	0.53	0.50	0.41	0.64	1.00

Markdown for Switzerland

model	accuracy	auroc	f1	precision	recall	specificity
XGBoost	0.92	0.50	0.88	0.85	0.92	0.00
MLP	0.92	0.43	0.88	0.85	0.92	0.00
LogisticRegression	0.88	0.39	0.86	0.84	0.88	0.00
DecisionTree	0.84	0.46	0.84	0.84	0.84	0.00
KNeighbors	0.92	0.28	0.88	0.85	0.92	0.00
GaussianNB	0.52	0.48	0.62	0.86	0.52	0.50
RandomForest	0.92	0.57	0.88	0.85	0.92	0.00
AdaBoost	0.88	0.52	0.86	0.84	0.88	0.00
SVC	0.92	0.72	0.88	0.85	0.92	0.00

Longbeach VA

model	accuracy	auroc	f1	precision	recall	specificity
XGBoost	0.80	0.55	0.71	0.64	0.80	0.00
MLP	0.72	0.65	0.67	0.63	0.72	0.00
LogisticRegression	0.78	0.72	0.73	0.72	0.78	0.12
DecisionTree	0.68	0.58	0.69	0.72	0.68	0.38
KNeighbors	0.75	0.61	0.72	0.69	0.75	0.12
GaussianNB	0.75	0.75	0.76	0.77	0.75	0.50
RandomForest	0.72	0.61	0.70	0.68	0.72	0.12
AdaBoost	0.70	0.47	0.68	0.67	0.70	0.12
SVC	0.80	0.44	0.71	0.64	0.80	0.00

Overall results show that ensemble methods and deep learning models were the most effective, with RandomForest, MLP, and XGBoost consistently achieving the highest accuracy, such as $0.90$ on the cleveland dataset and $0.83$ on the hungarian dataset, alongside strong AUROC values. Performance varied significantly across datasets; the switzerland dataset, in particular, presented a unique challenge where most models, including top performers like XGBoost and MLP, achieved high accuracy but failed to classify the minority class, evidenced by a specificity of $0.00$. Conversely, the longbeach_va dataset saw generally lower results across all metrics, with GaussianNB and LogisticRegression offering the most balanced performance in this group with AUROC values up to $0.75$ and $0.72$, respectively. Across all trials, the SVC model demonstrated the greatest inconsistency, delivering the lowest accuracy of $0.48$ on the cleveland dataset and frequently yielding the lowest F1 scores among all tested classifiers.

LIMITATIONS: 
- Amount of missing data could be the reason for low performance. Simple imputation methods were not sufficient. 
- Also the hyperparameters could be better optimized the selected hyperparameters were not backed by sufficient imperical study. 
- Next step: Use more sophisticated missing data imputaton techniques and apply hyperparameter optimization techniques

F1 Score

Accuracy

SHAP Summary Plot (Cleveland - MLP model)

The SHAP summary plot illustrates the feature importance and impact on the MLP model's output for the cleveland dataset, clearly indicating that features like CA and Thal have the strongest influence on the prediction. For instance, high values of CA and certain values of Thal (likely the 'fixed defect' or 'reversable defect' encodings) primarily drive the model towards a positive prediction (presence of heart disease).

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
images		images
notebooks		notebooks
src		src
.gitignore		.gitignore
MLProject		MLProject
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Heart Disease Prediction: Binary Classification (Multi-Dataset Study)

Project Overview

Classification Goal

Experiment Tracking (MLflow)

How to Run

Prerequisites

Datasets Used

Methodology and Key Steps

1. Data Preprocessing & Cleaning

Data Cleaning

Attribute Information

Label Distribution in Four Datasets

Handling Missing Values (Two-Pronged Approach)

Missing Values in four datasets

2. Exploratory Data Analysis (EDA)

3. Model Training and Evaluation

4. Results

Cleveland

Hungarian

Markdown for Switzerland

Longbeach VA

F1 Score

Accuracy

SHAP Summary Plot (Cleveland - MLP model)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Heart Disease Prediction: Binary Classification (Multi-Dataset Study)

Project Overview

Classification Goal

Experiment Tracking (MLflow)

How to Run

Prerequisites

Datasets Used

Methodology and Key Steps

1. Data Preprocessing & Cleaning

Data Cleaning

Attribute Information

Label Distribution in Four Datasets

Handling Missing Values (Two-Pronged Approach)

Missing Values in four datasets

2. Exploratory Data Analysis (EDA)

3. Model Training and Evaluation

4. Results

Cleveland

Hungarian

Markdown for Switzerland

Longbeach VA

F1 Score

Accuracy

SHAP Summary Plot (Cleveland - MLP model)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages