Heart Disease Prediction using Machine Learning & Deep Neural Network Models
- Introduction
Cardiac disorders also known narrowly as “heart diseases” are the cause of most deaths worldwide. Heart disease has become a cause of increasing concern for this country with patients enduring several sorts of related illnesses. Death is inevitable if some of the related diseases are diagnosed too late.
In our project, we will try to generate a predictive model of heart diseases which will be used for early detections. Our focus is to find the pre-processing techniques best for specific models, improving the existing models, creating combined predictions from two or more datasets.Our Focus is to implement the model and increase the accuracy of the model done previously.
- Methodology
Fig: Proposed Methodology for this work
- Model Selection
We have selected Artificial Neural Network(ANN), Convolutional Neural Network(CNN) and Long Short Term Memory (LSTM) for training our data based on paper study. Our Focus is to implement the model and increase the accuracy of the model done previously.
Model 4 (ANN)
Input Data Shape : 13. Input Data is scaled. No. hidden layer : 1. Neurons: 300
Activation:
Hidden Layer: Relu Output Layer: Sigmoid
Loss: Binary Cross Entropy with LogitLoss
Model 5 (ANN)
Input Data Shape : 13. Input Data is scaled. No. hidden layer : 1. Neurons: 300
Activation:
Hidden Layer: Relu Output Layer: Sigmoid
Loss: Binary Cross Entropy with LogitLoss
Output Shape: 1
- Experiments
This dataset was taken from the UCI machine learning repository. The heart disease dataset is made up of 75 raw features from which 13 features were published. These features are very vital in the diagnosis of heart diseases. The 13 features considered in this research work are stated
below :
- Dataset Collection
UCI Dataset:
| SI | Attributes | Description |
|---|---|---|
| 1. | Age | age in years |
| 2. | Sex | 1 = male; 0 = female |
| 3. | cp | chest pain** type (4 values) |
| 4. | trestbps | resting blood pressure (in mm Hg on admission to the hospital) |
| 5. | chol | serum cholesterol in mg/dl |
| 6. | fbs | (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) |
| 7. | restecg | resting electrocardiographic results |
| 8. | Thalach | maximum heart rate achieved |
| 9. | Exang | exercise induced angina (1 = yes; 0 = no) |
| 10. | Oldpeak | ST depression induced by exercise relative to rest |
| 11. | Slope | Heart rate slope |
| 12. | Ca | Count of major vessels (value 0-3) coloured by fluoroscopy. |
| 13. | Thal | Thal: 3= normal; 6 = fixed defect; 7 = reversible defect. |
The data set had 13 features and 303 rows. No NULL values or duplicate values found in the dataset. Dataset contains 164 (54.3%) heart disease (target = 1) patients and 138 (45.7%) non heart disease (target = 0) patients. Fig.1 represents a balanced dataset. Among these 31.79 % are Female Patients and 68.21 % are Male Patients. The average age of patients is 53. Fig.2 shows the patients affected in cardiac disease at different ranges of age. From Fig.3 visualized that the affected rate of male patients is higher than the rate of female patients.
Fig.1: Heart disease(1) and Non heart disease(0)
Fig.2: Age vs Cardio Disease.
Fig.3: Affected patient based on sex.
Cardiovascular_Dataset: ***
| SI | Attributes | Description |
|---|---|---|
| 1. | age | Objective Feature. int (days) |
| 2. | height | Objective Feature. int (cm) |
| 3. | weight | Objective Feature. float (kg) |
| 4. | Gender | Objective Feature. Categorical code(1 - women, 2 - men) |
| 5. | ap_hi | Systolic blood pressure. Examination Feature. int |
| 6. | ap_lo | Diastolic blood pressure. Examination Feature. int |
| 7. | cholesterol | Cholesterol. Examination Feature(1: normal, 2: above normal, 3: well above normal) |
| 8. | gluc | Glucose. Examination Feature ( 1: normal, 2: above normal, 3: well above normal) |
| 9. | smoke | Smoking. Subjective Feature. binary |
| 10. | alco | Alcohol intake. Subjective Feature. binary |
| 11. | active | Physical activity. Subjective Feature. binary |
The data set had 11 features and 70000 rows. There are 3 types of input features:
Objective: factual information.
Examination: results of medical examination.
Subjective: information given by the patient.
No NULL values or duplicate values found in the dataset. Dataset contains 34979 (49.97%) heart disease (cardio = 1) patients and 35021(50.03 %) non heart disease (cardio = 0) patients. Fig.5 represents a balanced dataset. Among these 34.96% are Female Patients and 65.04% are Male Patients. The average age of patients is 55. Fig.6 shows the patients affected in cardiac disease at different ranges of age. Fig.7 visualized that the affected rate of male patients is higher than the rate of female patients.
Fig.5: Heart disease(1) and Non heart disease(0)
Fig.6: Age vs Cardio Disease.
Fig.7: Affected patient based on sex.
- Data Pre-Processing
To increase the performance and stability need to pre processing the data. The SelectKBest method selects the features according to the k highest score. Applying fclassif chol (2.002%),fbs (0.2160%), trestbps (6.55%) have low scores and drop these features. Now it has 10 features and 210 rows. KNN
Fig.4: Feature Score (Least Important Selected)
-
Performance Metrics
For Performance Metrics we have taken Accuracy , Precision , Recall, F1 Score for evaluation of models.
The paper we have chosen to improve firstly was Neural network diagnosis of heart disease (2015). Our expected result is 85% accuracy after implementing the mentioned structure. As we successfully improved the model given and increased the accuracy of the selected paper we tried to select two more papers with better performance metrics. As we know, only accuracy can not be a good performance metric for heart disease prediction.
Our Result:
Logistic Regression(Best Result): Accuracy 91.20%
Decision Tree : Accuracy 84.62%
KNN : Accuracy 79.00%
SVM: Accuracy 86.81%
Random Forest: Accuracy 86.81%
Perceptron : Accuracy 83.54%
Gradient Boosting: Accuracy 86.81%
Confusion Matrix *
*
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | 35 | 0 |
| Actual Negative | 3 | 23 |
Result Comparison from Previous Studies
| Paper | Model | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|---|
| Olaniyi, E. O., Oyedotun, O. K., Helwan, A., & Adnan, K. (2015). Neural network diagnosis of heart disease. | Decision Tree Naive Bayes |
45.67% 84.35% 82.31% |
|||
| Tasnim, F., & Habiba, S. U. (2021). A Comparative Study on Heart Disease Prediction Using Data Mining Techniques and Feature Selection. 2021 2nd International Conference on Robotics, Electrical and Signal Processing Techniques | KNN SVM Logistic Regression Gradient Boosting Random Forest |
88% 82% 80% 83% 91.17% |
87% 83% 82% 82% 91% |
86% 81% 81% 80% 90% |
|
| Terrada, O., Cherradi, B., Hamida, S., Raihani, A., Moujahid, H., & Bouattane, O. (2020). Prediction of Patients with Heart Disease using Artificial Neural Network and Adaptive Boosting techniques. 2020 | AdaBoost |
72.22% | 69.57% | 66.67% | 68.09% |
Table: Previous Result
UCI Data Set Result
Model Name |
Preprocessing |
Result |
|---|---|---|
| Accuracy | ||
| Logistic Regression | All Feature Scaling FClassif Chi-square One Hot Encoding |
87.91% 87.91% 90.11% 85.71% 90.11% |
| Decision Tree | All Feature Scaling FClassif Chi-square One Hot Encoding |
71.43% 75.82% 79.12% 80.22% 70.33% |
| KNN | All Feature Scaling FClassif Chi-square One Hot Encoding |
68.13% 84.62% 68.13% 68.13% 84.62% |
| SVM | All Feature Scaling FClassif Chi-square One Hot Encoding |
86.81% 87.91% 87.91% 86.81% 86.81% |
| Random Forest | All Feature Scaling FClassif Chi-square One Hot Encoding |
86.81% 83.52% 84.62% 83.52% 87.91% |
| Perceptron | All Feature Scaling FClassif Chi-square One Hot Encoding |
67.03% 76.92% 57.14% 68.13% 83.52% |
| Gradient Boosting | All Feature Scaling FClassif Chi-square One Hot Encoding |
86.81% 86.81% 85.71% 82.42% 84.62% |
Cardiovascular Data Set Result
Model Name |
Preprocessing |
Result |
|---|---|---|
| Accuracy | ||
| Logistic Regression | All Feature Scaling FClassif One Hot Encoding |
69.42% 49.89% 71.86% 49.89% |
| Decision Tree | All Feature Scaling FClassif One Hot Encoding |
63.31% 63.19% 63.50% 63.22% |
| KNN | All Feature Scaling FClassif One Hot Encoding |
63.79% 50.51% 68.93% 50.38% |
| SVM | All Feature Scaling FClassif One Hot Encoding |
71.66% 64.40% 72.30% 63.70% |
| Random Forest | All Feature Scaling FClassif One Hot Encoding |
71.96% 71.98% 70.48% 71.74% |
Table: Our Result
Fig. UCI Dataset Result Comparison
Fig. Cardio_vascular Dataset Result Comparison
-
Performance Metrics
For Performance Metrics we have taken Accuracy , Precision , Recall, F1 Score for evaluation of models.
The paper we have chosen to improve firstly was Neural network diagnosis of heart disease (2015). Our expected result is 85% accuracy after implementing the mentioned structure. As we successfully improved the model given and increased the accuracy of the selected paper we tried to select two more papers with better performance metrics. As we know, only accuracy can not be a good performance metric for heart disease prediction.
Our Result:
ANN(Paper Structure) : Accuracy 85.71%
ANN(Model 4) : Accuracy 91.08%
ANN(Best Result) : Accuracy 95.08%
Confusion Matrix *
*
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | 35 | 0 |
| Actual Negative | 3 | 23 |
Result Comparison
| Paper | Model | Accuracy | Precision | Recall | F1-Score | No Of Hidden Lair |
|---|---|---|---|---|---|---|
| Olaniyi, E. O., Oyedotun, O. K., Helwan, A., & Adnan, K. (2015). Neural network diagnosis of heart disease. | ANN | 85% | 6 | |||
| Lin, C.-H., Yang, P.-K., Lin, Y.-C., & Fu, P.-K. (2020). On Machine Learning Models for Heart Disease Diagnosis. 2020 IEEE 2nd Eurasia | ANN | 91.26% | 1 | |||
CNN |
83.50% | 3 | ||||
| Terrada, O., Cherradi, B., Hamida, S., Raihani, A., Moujahid, H., & Bouattane, O. (2020). Prediction of Patients with Heart Disease using Artificial Neural Network and Adaptive Boosting techniques. 2020 | ANN | 91.41% | 79.67% | 70.36% | 75.98% | 3 |
Table: Previous Result
- Evaluation
Model Name |
Preprocessing |
No of Input Layer |
No of Hidden Layer |
Neurons/Filters |
Activation Function & Loss Function |
Optimizer & Learning Rate |
Epoch | Result |
|---|---|---|---|---|---|---|---|---|
| Accuracy | ||||||||
| ANN (Paper Selected) | Scaling | 13 | 6 | 5 | Sigmoid
|
Adam 0.0032 |
2000 | 85.71% |
| ANN (Proposed Model-1) | Scaling | 13 | 3 | 12 | Relu BCE |
Adam 0.01 |
200 | 87.91% |
| LSTM (Model-2) | Scaling | 13 | 4 | 100 | Relu BCE |
Adam 0.001 |
90 | 77.04% |
| 1D CNN (Proposed Model-3) | Scaling | 13 | 2 | 128 | Relu BCE |
Adam 0.01 |
15 | 86.81% |
| ANN (Proposed Model-4) | Feature Selection with Scaling | 10 | 3 | 100 | Relu BCE |
Adam 0.01 |
125 | 91.21% |
ANN (Proposed Model-5) Best Result |
Scaling | 13 | 1 | 300 | Relu BCE with LogitLoss |
SGD 0.01 |
80 | 96.72% |
Table: Our Result
- Discussion
Our Experiment yielded good results.We have successfully obtained better results than above mentioned papers.We have found that generally scaling and encoding performs well for KNN,Logistic Regression. In future performing more hyperparameter tuning may increase our result. Other Boosting algorithms can be used to increase the accuracy of the models.



















