This repository contains the analysis workflow for developing and validating a machine learning model to predict 18-month early distant recurrence (EDR-18) in stage III colon cancer.
The workflow covers:
- Feature selection (LASSO and XGBoost)
- Model development with nested cross-validation
- Risk stratification and Kaplan–Meier analysis
- External validation in an independent cohort
⚠️ Note: Raw clinical datasets are not included in this repository due to patient privacy. Users should provide their own datasets under the./data/directory.
/
├── notebooks
│ ├── 1.LASSO_XGBoost_Feature_Selection.ipynb
│ ├── 2.XGBoost_StageIII.ipynb
│ └── 3.External_Validation.ipynb
├── data/ # An example data schema was included; user-provided data also go here
├── model/ # Directory for saving model artifacts
├── README.md
└── requirements.txt
Purpose: Perform feature selection using LASSO logistic regression and XGBoost feature importance.
Key Steps:
- Load derivation cohort from
./data/. - Fit LASSO models to identify sparse sets of predictors.
- Train XGBoost models and summarize feature importance across folds.
- Derive a parsimonious, clinically interpretable set of variables for downstream modeling.
Purpose: Develop the final four-variable model for EDR-18 prediction in stage III colon cancer.
Key Steps:
- Load the derivation cohort from
./data/. - Implement nested cross-validation to:
- Tune XGBoost hyperparameters.
- Generate out-of-fold (OOF) predictions.
- Calibrate predicted probabilities (e.g., isotonic regression).
- Define a fixed decision threshold based on the OOF Youden index.
- Construct Kaplan–Meier curves for:
- All stage III patients.
- Subgroups of interest (e.g., AJCC stage IIIB).
- Export model artifacts and OOF predictions for external use.
Purpose: Apply the finalized four-variable model to an independent external cohort.
Key Steps:
- Load the external validation cohort from
./data/. - Apply the saved preprocessing and model pipeline without retraining.
- Evaluate performance metrics:
- ROC-AUC
- Brier score
- Calibration performance
- Perform Cox regression for high- vs low-risk groups.
- Generate Kaplan–Meier curves in the external cohort (overall and subgroups).
The notebooks assume the following (or similar) structure for input data files:
./data/
stageIII_derivation.xlsx # Derivation cohort (not included)
stageIII_external.xlsx # External validation cohort (not included)
You may adjust filenames in the notebooks as needed. Since real clinical data cannot be shared, users must replace these with their own datasets containing equivalent variables.
The analysis was developed using Python 3.11.
python -m venv .venv
source .venv/bin/activate # On macOS/Linux
# .venv\Scripts\activate # On Windowspip install -r requirements.txt- Place your derivation and external validation datasets under
./data/. - Open the notebooks in JupyterLab, Jupyter Notebook, or VS Code.
- Run the notebooks in the following order:
1.LASSO_XGBoost_Feature_Selection.ipynb2.XGBoost_StageIII.ipynb3.External_Validation.ipynb
- Review generated figures, metrics, and model outputs for reproduction or extension.
- All paths in the notebooks are relative (e.g.,
./data/...) and do not include any hospital-specific directories. - Raw clinical data are not included and must not be committed to this repository.
- Results may vary slightly if:
- Random seeds are changed.
- Library versions differ from those in
requirements.txt.
If you use or adapt this workflow in your own research, please cite the corresponding manuscript (once published) and this repository.
Huang SF, et al. Ruling Out Early Distant Recurrence in Stage III Colon Cancer: A Parsimonious Machine Learning Model with External Validation [Manuscript in preparation]
This repository contains the analysis workflow for developing and validating a machine learning model to predict 18-month early distant recurrence (EDR-18) in stage III colon cancer.
The workflow covers:
- Feature selection (LASSO and XGBoost)
- Model development with nested cross-validation
- Risk stratification and Kaplan–Meier analysis
- External validation in an independent cohort
Note:
Raw clinical datasets are not included in this repository due to patient privacy.
Users should provide their own datasets under the./data/directory.
/
├── 1.LASSO_XGBoost_Feature_Selection.ipynb
├── 2.XGBoost_StageIII.ipynb
├── 3.External_Validation.ipynb
├── data/ # (Not included; user-provided data go here)
├── README.md
└── requirements.txt