PTB-XL ECG Classification Project

Multi-class ECG classification using machine learning on the PTB-XL dataset, focusing on handling missing data, feature selection, and class imbalance.

Project Structure

PTB-XL+ paper/
├── src/
│   ├── preprocessing/       # Data preprocessing scripts (run in order)
│   │   ├── 01_merge_and_label.py
│   │   ├── 02_clean_columns.py
│   │   ├── 03_split_data.py
│   │   └── 04_impute_and_flag.py
│   │
│   ├── analysis/           # Analysis and feature selection
│   │   ├── 05_feature_selection_rf.py  # Feature selection (Top-50, Top-100, Top-200)
│   │   ├── missing_data_analysis.py
│   │   ├── outlier_analysis_per_label.py
│   │   └── create_class_weight_visualization.py
│   │
│   └── modeling/           # Model training scripts
│       └── 06_train_models.py  # Main model training (RF Class Weight & Ensemble Undersampling)
│
├── data/
│   └── processed/          # Preprocessed data (generated by preprocessing scripts)
│
├── reports/
│   ├── feature_selection/       # Feature selection plots and lists
│   ├── missing_data_analysis/   # Missing data visualizations
│   └── model_comparison/        # Model performance plots (ROC, Precision/Recall)
│
├── results/                # CSV results of model experiments
│
├── requirements.txt        # Python dependencies
└── README.md              # This file

Setup

Install dependencies:
```
pip install -r requirements.txt
```
Download datasets:
- PTB-XL: https://physionet.org/content/ptb-xl/1.0.3/
- PTB-XL+: https://physionet.org/content/ptb-xl-plus/1.0.1/
Note: Processed CSV files are not included in the repository. You must run the preprocessing pipeline to generate them.

Pipeline & Usage

The project follows a sequential pipeline. Scripts are located in src/.

1. Preprocessing

Run preprocessing scripts in order:

# Step 1: Merge datasets and process labels
python src/preprocessing/01_merge_and_label.py

# Step 2: Clean columns
python src/preprocessing/02_clean_columns.py

# Step 3: Split data into train/val/test
python src/preprocessing/03_split_data.py

# Step 4: Impute missing values
python src/preprocessing/04_impute_and_flag.py

This pipeline performs:

Merging PTB-XL metadata with PTB-XL+ features
Processing labels and creating multi-label columns (NORM, MI, STTC, CD, HYP)
Cleaning columns (dropping sparse/metadata columns)
Splitting data into train/validation/test sets using stratified folds
Imputing missing values and creating missing flags

2. Feature Selection

Variance Threshold: Removes low-variance features (< 0.05).
Random Forest Importance: Selects Top-50, Top-100, and Top-200 features based on importance scores.

Select top features using Random Forest importance:

python src/analysis/05_feature_selection_rf.py

This generates:

Top-50, Top-100, and Top-200 feature lists
Feature importance plots
Feature selection reports in reports/feature_selection/

3. Model Training

Train and evaluate models:

python src/modeling/06_train_models.py

This script:

Trains Random Forest models with Class Weighting (Top-50, 100, 200)
Trains Random Forest with Ensemble Undersampling (Top-200)
Generates performance metrics and visualizations

4. Additional Analysis (Optional)

# Missing data analysis
python src/analysis/missing_data_analysis.py

# Outlier analysis per label
python src/analysis/outlier_analysis_per_label.py

# Class weight visualization
python src/analysis/create_class_weight_visualization.py

Methodology

Missing Data Handling

Analysis: Identified that missingness in P-wave features is correlated with MI and STTC classes (MNAR).
Strategy: Instead of dropping rows, missing values were imputed with 0 (for signal absence) or median, and binary flags (is_missing_*) were added to preserve the information of "missingness".

Feature Selection

VarianceThreshold: Removed 186 features with variance < 0.05.
Random Forest Importance: Selected the top 200 features from the remaining set. PCA was avoided to maintain interpretability.

Class Imbalance Handling

Class Weighting: Assigning higher weights to minority classes (HYP, CD).
Ensemble Undersampling: Training multiple models on balanced subsets to reduce bias towards the majority class (NORM).
Balanced Random Forest: Automatically balances bootstrap samples.

Models Evaluated

Random Forest with Class Weighting (Top-50, Top-100, Top-200 features)
Random Forest with Ensemble Undersampling (Top-200 features)

Results

The experiments showed that handling class imbalance is crucial. The Random Forest with Top-200 features and Class Weighting provided the best balance between Precision and Recall.

Model	Accuracy	Precision	Recall	F1 Score	ROC-AUC
RF (Top-50, CW)	53.20%	73.58%	66.13%	69.53%	0.904
RF (Top-100, CW)	54.82%	75.40%	67.64%	71.17%	0.909
RF (Top-200, CW)	55.24%	75.91%	67.87%	71.50%	0.915
RF (Top-200, Undersampling)	57.09%	69.23%	68.12%	69.23%	0.915

Note: Metrics are macro-averaged. Detailed per-class metrics are available in the report.

Report

All experiments can be reproduced by running the scripts in order:

src/preprocessing/01_merge_and_label.py → 04_impute_and_flag.py
src/analysis/05_feature_selection_rf.py
src/modeling/06_train_models.py

Results are saved in:

results/ - CSV files with performance metrics
reports/ - Visualizations and analysis reports

Presentation is Prepered in:

Sunum/ - pdf and tex files

Code Environment

Python: 3.11+
Key Libraries:
- pandas, numpy - Data processing
- scikit-learn - Modeling and evaluation
- matplotlib, seaborn - Visualization

References

Wagner, P., et al. (2020). PTB-XL, a large publicly available electrocardiography dataset. Scientific Data.
Strodthoff, N., et al. (2021). PTB-XL+, a comprehensive electrocardiographic feature dataset. Scientific Data.

License

This project is for educational purposes at Istanbul University.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PTB-XL ECG Classification Project

Project Structure

Setup

Pipeline & Usage

1. Preprocessing

2. Feature Selection

3. Model Training

4. Additional Analysis (Optional)

Methodology

Missing Data Handling

Feature Selection

Class Imbalance Handling

Models Evaluated

Results

Report

Code Environment

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
Sunum		Sunum
reports		reports
results		results
src		src
.gitignore		.gitignore
README.md		README.md
paper_content.txt		paper_content.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PTB-XL ECG Classification Project

Project Structure

Setup

Pipeline & Usage

1. Preprocessing

2. Feature Selection

3. Model Training

4. Additional Analysis (Optional)

Methodology

Missing Data Handling

Feature Selection

Class Imbalance Handling

Models Evaluated

Results

Report

Code Environment

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages