Machine learning models for classification of First-Episode Psychosis (FEP) vs Healthy Controls using structural MRI radiomic features. This document describes the complete execution flow from data preprocessing to results visualization.
- Overview
- Prerequisites
- Step 1: Radiomics Feature Extraction
- Step 2: Data Preprocessing
- Step 3: Classification and Model Training
- Step 4: Results Visualization
- Complete Pipeline Execution
The pipeline consists of four main steps:
- Radiomics Feature Extraction: Extract radiomic features from MRI images using PyRadiomics
- Data Preprocessing: Clean, normalize, and prepare features for machine learning
- Classification: Train and evaluate machine learning models with explainability techniques
- Results Visualization: Analyze and visualize model performance and interpretability
- Python 3.11.9
- Conda or Miniconda
- Access to MRI data (FreeSurfer processed images)
First-episode_Psychosis_Clasification/
├── src/
│ ├── 1_radiomics/ # Feature extraction
│ ├── 2_preprocess_data/ # Data preprocessing
│ ├── 3_classification/ # Model training
│ └── 4_results_viewer/ # Results visualization
├── README.md
└── README_all.md
This step extracts radiomic features from structural MRI images using PyRadiomics.
Navigate to the radiomics directory:
cd src/1_radiomicsCreate and activate the conda environment:
conda env create -f enviroment.yml
conda activate radiomics- Input Directory: FreeSurfer processed MRI images with segmentation masks
- Configuration:
Params.yamlfile with PyRadiomics parameters - Output: CSV/TSV file with extracted radiomic features
./local_run_pyradiomics.shThis script will:
- Generate a CSV with subject IDs and mask paths from FreeSurfer data
- Extract radiomic features for each subject and ROI
- Save features to
../2_preprocess_data/data/df_processed_example.tsv - Generate logs in
logs/logfile.txt
sbatch run_radiomics.shgenerate_csv_freesurfer.py: Creates subject list from FreeSurfer directorycalculate.py: Main feature extraction script using PyRadiomicsParams.yaml: PyRadiomics configuration parameters
The output TSV file contains:
- Subject identifiers
- ROI labels
- First-order statistics (mean, median, variance, etc.)
- Texture features (GLCM, GLRLM, GLSZM, etc.)
- Shape features (volume, surface area, sphericity, etc.)
This step cleans and prepares the radiomic features for machine learning.
Navigate to the preprocessing directory:
cd ../2_preprocess_dataCreate and activate the conda environment:
conda env create -f environment.yml
conda activate jupyter_venvOpen the Jupyter notebook:
jupyter notebook preprocess.ipynbOr use JupyterLab:
jupyter lab preprocess.ipynbThe notebook performs:
- Data Loading: Load radiomic features from Step 1, or use the original dataset from Zenodo (automatically downloaded by
preprocess.ipynb) - Quality Control: Remove features with missing values or low variance
- Feature Selection: Select relevant features based on statistical tests
- Normalization: Standardize or normalize feature values
- Data Splitting: Create train/test splits
- Export: Save processed data for classification
Processed features are saved to:
data/features_outcome_df_processed_true.tsv
This file is used as input for the classification step.
This step trains machine learning models and evaluates their performance with explainability techniques.
Navigate to the classification directory:
cd ../3_classificationCreate and activate the conda environment:
conda env create -f environment.yml
conda activate pyclassification./local_run_train_and_evaluate.shThis executes the complete classification pipeline with:
- Multiple ML algorithms (Random Forest, SVM, Logistic Regression, etc.)
- Cross-validation
- Hyperparameter tuning
- Model evaluation
- Explainability analysis (SHAP, LIME)
sbatch sbatch_run_train_and_evaluate.shThe classification pipeline consists of three main scripts:
python 1_train_and_evaluate.py \
--csv features_outcome_df_processed_true.tsv \
--calculate_differences \
--fine_tune_best_model \
--results_base ../4_results_viewer/data \
--ratios 40 \
-vParameters:
--csv: Input features file--calculate_differences: Compute statistical differences between groups--fine_tune_best_model: Perform hyperparameter optimization--results_base: Output directory for results--ratios: Train/test split ratio (e.g., 40 = 40% test)-v: Verbose output
Output:
- Model performance metrics (accuracy, precision, recall, F1-score, AUC)
- Cross-validation results
- Feature importance rankings
- Confusion matrices
Analyzes statistical differences between model performances.
Retrains the best performing model with optimized hyperparameters and generates:
- Final model predictions
- SHAP values for feature importance
- LIME explanations for individual predictions
- Calibration plots
- ROC curves
- Logistic Regression
- Support Vector Machine (SVM)
- Random Forest
- Gradient Boosting
- XGBoost
- Neural Networks
- SHAP (SHapley Additive exPlanations): Global and local feature importance
- LIME (Local Interpretable Model-agnostic Explanations): Instance-level explanations
- Feature Importance: Model-specific feature rankings
- Partial Dependence Plots: Feature effect visualization
Results are saved to ../4_results_viewer/data/:
model_results.pkl: Trained models and predictionsfeature_importance.csv: Feature importance scoresshap_values.pkl: SHAP analysis resultsperformance_metrics.csv: Model evaluation metricsconfusion_matrix.png: Confusion matrix visualization
This step provides interactive visualization and analysis of model results.
Navigate to the results viewer directory:
cd ../4_results_viewerCreate and activate the conda environment:
conda env create -f enviroment.yml
conda activate jupyter_venvOpen the results notebook:
jupyter notebook results.ipynbOr use JupyterLab:
jupyter lab results.ipynbThe notebook includes:
-
Performance Metrics
- Accuracy, precision, recall, F1-score
- ROC curves and AUC scores
- Precision-recall curves
- Calibration plots
-
Model Comparison
- Side-by-side model performance
- Statistical significance tests
- Cross-validation stability
-
Feature Analysis
- Feature importance rankings
- SHAP summary plots
- SHAP dependence plots
- Feature correlation heatmaps
-
Explainability Visualizations
- SHAP waterfall plots for individual predictions
- LIME explanations
- Decision boundary visualizations
- Feature contribution plots
-
Clinical Interpretability
- Top discriminative features
- ROI-specific analysis
- Biomarker identification
The notebook provides interactive widgets for:
- Selecting different models
- Filtering features by importance
- Exploring individual predictions
- Comparing feature effects across groups
Execute each step in order:
# Step 1: Feature Extraction
cd src/1_radiomics
conda env create -f enviroment.yml
conda activate radiomics
./local_run_pyradiomics.sh
# Step 2: Preprocessing
cd ../2_preprocess_data
conda env create -f environment.yml
conda activate jupyter_venv
jupyter notebook preprocess.ipynb
# Run all cells in the notebook, then close
# Step 3: Classification
cd ../3_classification
conda env create -f environment.yml
conda activate pyclassification
./local_run_train_and_evaluate.sh
# Step 4: Results Visualization
cd ../4_results_viewer
conda env create -f enviroment.yml
conda activate jupyter_venv
jupyter notebook results.ipynbOnce environments are created, you can run the pipeline with:
# From project root
cd src/1_radiomics && conda activate radiomics && ./local_run_pyradiomics.sh && \
cd ../3_classification && conda activate pyclassification && ./local_run_train_and_evaluate.shThen manually run the preprocessing and results notebooks.
For high-performance computing:
# Step 1: Submit radiomics job
cd src/1_radiomics
sbatch run_radiomics.sh
# Wait for completion, then run preprocessing notebook
# Step 3: Submit classification job
cd ../3_classification
sbatch sbatch_run_train_and_evaluate.sh
# Wait for completion, then run results notebookconda env listconda activate radiomics # For feature extraction
conda activate jupyter_venv # For preprocessing and visualization
conda activate pyclassification # For classificationconda env remove -n radiomics
conda env remove -n jupyter_venv
conda env remove -n pyclassificationconda env update -f environment.yml --prune-
Environment Activation Fails
eval "$(conda shell.bash hook)" conda activate <env_name>
-
Missing Dependencies
conda env update -f environment.yml
-
Memory Issues During Classification
- Reduce the number of features in preprocessing
- Use feature selection techniques
- Increase available RAM or use HPC
-
SHAP/LIME Computation Slow
- Reduce the number of samples for explanation
- Use approximate SHAP methods
- Run on HPC with more resources
Check log files for debugging:
src/1_radiomics/logs/logfile.txt: Feature extraction logssrc/3_classification/log/: Classification logs
If you use this pipeline in your research, please cite:
@software{fep_classification,
title={First-Episode Psychosis Classification using MRI Radiomics},
author={Your Name},
year={2025},
url={https://github.com/yourusername/First-episode_Psychosis_Clasification}
}See LICENSE file for details.
For questions or issues, please open an issue on GitHub or contact the maintainers.