An end-to-end logistic regression project predicting the 10-year risk of coronary heart disease (CHD).
- ➡️ View the Jupyter Notebook on GitHub: chd-risk-prediction-logit.ipynb
- Dataset: Cardiovascular Study Dataset on Kaggle
Problem: CHD is a leading cause of death, and early identification of high-risk patients can save lives and reduce healthcare costs.
Solution: Built a logistic regression model with class-weight balancing and threshold optimized for F2 score. Preprocessed data, log-transformed skewed features, and removed multicollinear variables. Achieved F2 = 0.525 and ~80% recall on the test set, enabling early identification of at-risk patients for targeted interventions.
- Predict 10-year risk of CHD using logistic regression.
- Handle class imbalance using stratified splits and class weighting.
- Optimize model threshold for F2-Score to prioritize recall.
- Perform robust preprocessing and feature engineering based on Exploratory Data Analysis (EDA).
- Data Cleaning & Preparation
- Exploratory Data Analysis (EDA)
- Feature Selection & Transformation (log-transform skewed features, remove multicollinear features)
- Missing Value Imputation
- Train/Test Split (stratified)
- Logistic Regression with
class_weight='balanced' - Threshold Optimization based on F2-Score
- Model Evaluation (Recall, F2, ROC-AUC, Confusion Matrix)
Because missing a high-risk patient is more costly than a false alarm, the decision threshold was optimized using the F2-score, which prioritizes recall over precision. The optimal threshold was selected based on training-set probabilities.
Lowering the threshold from 0.50 to 0.40 increased the model’s ability to identify at-risk patients while maintaining acceptable precision.
The confusion matrix below illustrates model performance on the test set using the optimized threshold. The model favors recall, resulting in fewer false negatives at the expense of increased false positives — an appropriate trade-off for a medical screening task.
The ROC curve (Receiver Operating Characteristic curve) visualizes the model's ability to distinguish between patients who will develop CHD and those who will not across all possible thresholds.
- ROC-AUC (Area Under the ROC Curve): 0.694
This metric summarizes the model’s overall discriminatory power.- 1.0 = perfect separation of classes
- 0.5 = no better than random guessing
A ROC-AUC of 0.694 indicates the model reasonably separates at-risk and not-at-risk patients, showing it performs better than chance at predicting 10-year CHD risk.
-
F2 Score: The initial model had an F2 score of 0.487 on the test set. After threshold optimization, the F2 score increased to 0.525, an improvement of approximately 7.8%, reflecting better detection of at-risk patients while maintaining a balance between recall and precision.
-
Recall: With the optimized threshold, the model achieved a recall of ~80%, meaning it correctly identifies 8 out of 10 patients who will develop CHD within 10 years. This high recall is crucial for a screening tool where missing at-risk patients carries significant consequences.
-
ROC-AUC: The model achieved a ROC-AUC of 0.694, indicating it can reasonably distinguish between patients who will and will not develop CHD over 10 years.
- Use pipelines for preprocessing and modeling.
- Explore alternative models (e.g., XGBoost, LightGBM).
- Perform cost-sensitive threshold tuning based on false positive vs. false negative impact.
- Conduct in-depth analysis on misclassified samples.


