Multi-class ECG classification using machine learning on the PTB-XL dataset, focusing on handling missing data, feature selection, and class imbalance.
PTB-XL+ paper/
├── src/
│ ├── preprocessing/ # Data preprocessing scripts (run in order)
│ │ ├── 01_merge_and_label.py
│ │ ├── 02_clean_columns.py
│ │ ├── 03_split_data.py
│ │ └── 04_impute_and_flag.py
│ │
│ ├── analysis/ # Analysis and feature selection
│ │ ├── 05_feature_selection_rf.py # Feature selection (Top-50, Top-100, Top-200)
│ │ ├── missing_data_analysis.py
│ │ ├── outlier_analysis_per_label.py
│ │ └── create_class_weight_visualization.py
│ │
│ └── modeling/ # Model training scripts
│ └── 06_train_models.py # Main model training (RF Class Weight & Ensemble Undersampling)
│
├── data/
│ └── processed/ # Preprocessed data (generated by preprocessing scripts)
│
├── reports/
│ ├── feature_selection/ # Feature selection plots and lists
│ ├── missing_data_analysis/ # Missing data visualizations
│ └── model_comparison/ # Model performance plots (ROC, Precision/Recall)
│
├── results/ # CSV results of model experiments
│
├── requirements.txt # Python dependencies
└── README.md # This file
-
Install dependencies:
pip install -r requirements.txt
-
Download datasets:
- PTB-XL: https://physionet.org/content/ptb-xl/1.0.3/
- PTB-XL+: https://physionet.org/content/ptb-xl-plus/1.0.1/
Note: Processed CSV files are not included in the repository. You must run the preprocessing pipeline to generate them.
The project follows a sequential pipeline. Scripts are located in src/.
Run preprocessing scripts in order:
# Step 1: Merge datasets and process labels
python src/preprocessing/01_merge_and_label.py
# Step 2: Clean columns
python src/preprocessing/02_clean_columns.py
# Step 3: Split data into train/val/test
python src/preprocessing/03_split_data.py
# Step 4: Impute missing values
python src/preprocessing/04_impute_and_flag.pyThis pipeline performs:
- Merging PTB-XL metadata with PTB-XL+ features
- Processing labels and creating multi-label columns (NORM, MI, STTC, CD, HYP)
- Cleaning columns (dropping sparse/metadata columns)
- Splitting data into train/validation/test sets using stratified folds
- Imputing missing values and creating missing flags
- Variance Threshold: Removes low-variance features (< 0.05).
- Random Forest Importance: Selects Top-50, Top-100, and Top-200 features based on importance scores.
Select top features using Random Forest importance:
python src/analysis/05_feature_selection_rf.pyThis generates:
- Top-50, Top-100, and Top-200 feature lists
- Feature importance plots
- Feature selection reports in
reports/feature_selection/
Train and evaluate models:
python src/modeling/06_train_models.pyThis script:
- Trains Random Forest models with Class Weighting (Top-50, 100, 200)
- Trains Random Forest with Ensemble Undersampling (Top-200)
- Generates performance metrics and visualizations
# Missing data analysis
python src/analysis/missing_data_analysis.py
# Outlier analysis per label
python src/analysis/outlier_analysis_per_label.py
# Class weight visualization
python src/analysis/create_class_weight_visualization.py- Analysis: Identified that missingness in P-wave features is correlated with MI and STTC classes (MNAR).
- Strategy: Instead of dropping rows, missing values were imputed with 0 (for signal absence) or median, and binary flags (
is_missing_*) were added to preserve the information of "missingness".
- VarianceThreshold: Removed 186 features with variance < 0.05.
- Random Forest Importance: Selected the top 200 features from the remaining set. PCA was avoided to maintain interpretability.
- Class Weighting: Assigning higher weights to minority classes (HYP, CD).
- Ensemble Undersampling: Training multiple models on balanced subsets to reduce bias towards the majority class (NORM).
- Balanced Random Forest: Automatically balances bootstrap samples.
- Random Forest with Class Weighting (Top-50, Top-100, Top-200 features)
- Random Forest with Ensemble Undersampling (Top-200 features)
The experiments showed that handling class imbalance is crucial. The Random Forest with Top-200 features and Class Weighting provided the best balance between Precision and Recall.
| Model | Accuracy | Precision | Recall | F1 Score | ROC-AUC |
|---|---|---|---|---|---|
| RF (Top-50, CW) | 53.20% | 73.58% | 66.13% | 69.53% | 0.904 |
| RF (Top-100, CW) | 54.82% | 75.40% | 67.64% | 71.17% | 0.909 |
| RF (Top-200, CW) | 55.24% | 75.91% | 67.87% | 71.50% | 0.915 |
| RF (Top-200, Undersampling) | 57.09% | 69.23% | 68.12% | 69.23% | 0.915 |
Note: Metrics are macro-averaged. Detailed per-class metrics are available in the report.
All experiments can be reproduced by running the scripts in order:
src/preprocessing/01_merge_and_label.py→04_impute_and_flag.pysrc/analysis/05_feature_selection_rf.pysrc/modeling/06_train_models.py
Results are saved in:
results/- CSV files with performance metricsreports/- Visualizations and analysis reports
Presentation is Prepered in:
Sunum/- pdf and tex files
- Python: 3.11+
- Key Libraries:
- pandas, numpy - Data processing
- scikit-learn - Modeling and evaluation
- matplotlib, seaborn - Visualization
- Wagner, P., et al. (2020). PTB-XL, a large publicly available electrocardiography dataset. Scientific Data.
- Strodthoff, N., et al. (2021). PTB-XL+, a comprehensive electrocardiographic feature dataset. Scientific Data.
This project is for educational purposes at Istanbul University.