Skip to content

ibeuler/PTB-XL-Heart-Anomaly-Diagnosis-Project

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PTB-XL ECG Classification Project

Multi-class ECG classification using machine learning on the PTB-XL dataset, focusing on handling missing data, feature selection, and class imbalance.

Project Structure

PTB-XL+ paper/
├── src/
│   ├── preprocessing/       # Data preprocessing scripts (run in order)
│   │   ├── 01_merge_and_label.py
│   │   ├── 02_clean_columns.py
│   │   ├── 03_split_data.py
│   │   └── 04_impute_and_flag.py
│   │
│   ├── analysis/           # Analysis and feature selection
│   │   ├── 05_feature_selection_rf.py  # Feature selection (Top-50, Top-100, Top-200)
│   │   ├── missing_data_analysis.py
│   │   ├── outlier_analysis_per_label.py
│   │   └── create_class_weight_visualization.py
│   │
│   └── modeling/           # Model training scripts
│       └── 06_train_models.py  # Main model training (RF Class Weight & Ensemble Undersampling)
│
├── data/
│   └── processed/          # Preprocessed data (generated by preprocessing scripts)
│
├── reports/
│   ├── feature_selection/       # Feature selection plots and lists
│   ├── missing_data_analysis/   # Missing data visualizations
│   └── model_comparison/        # Model performance plots (ROC, Precision/Recall)
│
├── results/                # CSV results of model experiments
│
├── requirements.txt        # Python dependencies
└── README.md              # This file

Setup

  1. Install dependencies:

    pip install -r requirements.txt
  2. Download datasets:

    Note: Processed CSV files are not included in the repository. You must run the preprocessing pipeline to generate them.

Pipeline & Usage

The project follows a sequential pipeline. Scripts are located in src/.

1. Preprocessing

Run preprocessing scripts in order:

# Step 1: Merge datasets and process labels
python src/preprocessing/01_merge_and_label.py

# Step 2: Clean columns
python src/preprocessing/02_clean_columns.py

# Step 3: Split data into train/val/test
python src/preprocessing/03_split_data.py

# Step 4: Impute missing values
python src/preprocessing/04_impute_and_flag.py

This pipeline performs:

  • Merging PTB-XL metadata with PTB-XL+ features
  • Processing labels and creating multi-label columns (NORM, MI, STTC, CD, HYP)
  • Cleaning columns (dropping sparse/metadata columns)
  • Splitting data into train/validation/test sets using stratified folds
  • Imputing missing values and creating missing flags

2. Feature Selection

  • Variance Threshold: Removes low-variance features (< 0.05).
  • Random Forest Importance: Selects Top-50, Top-100, and Top-200 features based on importance scores.

Select top features using Random Forest importance:

python src/analysis/05_feature_selection_rf.py

This generates:

  • Top-50, Top-100, and Top-200 feature lists
  • Feature importance plots
  • Feature selection reports in reports/feature_selection/

3. Model Training

Train and evaluate models:

python src/modeling/06_train_models.py

This script:

  • Trains Random Forest models with Class Weighting (Top-50, 100, 200)
  • Trains Random Forest with Ensemble Undersampling (Top-200)
  • Generates performance metrics and visualizations

4. Additional Analysis (Optional)

# Missing data analysis
python src/analysis/missing_data_analysis.py

# Outlier analysis per label
python src/analysis/outlier_analysis_per_label.py

# Class weight visualization
python src/analysis/create_class_weight_visualization.py

Methodology

Missing Data Handling

  • Analysis: Identified that missingness in P-wave features is correlated with MI and STTC classes (MNAR).
  • Strategy: Instead of dropping rows, missing values were imputed with 0 (for signal absence) or median, and binary flags (is_missing_*) were added to preserve the information of "missingness".

Feature Selection

  • VarianceThreshold: Removed 186 features with variance < 0.05.
  • Random Forest Importance: Selected the top 200 features from the remaining set. PCA was avoided to maintain interpretability.

Class Imbalance Handling

  • Class Weighting: Assigning higher weights to minority classes (HYP, CD).
  • Ensemble Undersampling: Training multiple models on balanced subsets to reduce bias towards the majority class (NORM).
  • Balanced Random Forest: Automatically balances bootstrap samples.

Models Evaluated

  1. Random Forest with Class Weighting (Top-50, Top-100, Top-200 features)
  2. Random Forest with Ensemble Undersampling (Top-200 features)

Results

The experiments showed that handling class imbalance is crucial. The Random Forest with Top-200 features and Class Weighting provided the best balance between Precision and Recall.

Model Accuracy Precision Recall F1 Score ROC-AUC
RF (Top-50, CW) 53.20% 73.58% 66.13% 69.53% 0.904
RF (Top-100, CW) 54.82% 75.40% 67.64% 71.17% 0.909
RF (Top-200, CW) 55.24% 75.91% 67.87% 71.50% 0.915
RF (Top-200, Undersampling) 57.09% 69.23% 68.12% 69.23% 0.915

Note: Metrics are macro-averaged. Detailed per-class metrics are available in the report.

Report

All experiments can be reproduced by running the scripts in order:

  1. src/preprocessing/01_merge_and_label.py04_impute_and_flag.py
  2. src/analysis/05_feature_selection_rf.py
  3. src/modeling/06_train_models.py

Results are saved in:

  • results/ - CSV files with performance metrics
  • reports/ - Visualizations and analysis reports

Presentation is Prepered in:

  • Sunum/ - pdf and tex files

Code Environment

  • Python: 3.11+
  • Key Libraries:
    • pandas, numpy - Data processing
    • scikit-learn - Modeling and evaluation
    • matplotlib, seaborn - Visualization

References

  1. Wagner, P., et al. (2020). PTB-XL, a large publicly available electrocardiography dataset. Scientific Data.
  2. Strodthoff, N., et al. (2021). PTB-XL+, a comprehensive electrocardiographic feature dataset. Scientific Data.

License

This project is for educational purposes at Istanbul University.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 64.5%
  • TeX 35.5%