This repository contains a simulation-based benchmark comparing regularised regression and machine-learning methods under different missing data mechanisms and data handling strategies.
The project is intended as a public, reproducible research-software example. It uses simulated data only. Restricted-use, institutional or confidential data are not included.
The benchmark evaluates how predictive models behave when covariate data are incomplete. The cleaned public workflow compares model performance for:
- full data before deletion,
- complete-case analysis after missingness,
- native handling of missing values where supported,
- multiple imputation followed by predictive modelling.
The current workflow contains reusable wrappers for:
- LASSO regression (
glmnet, alpha = 1), - Ridge regression (
glmnet, alpha = 0), - Elastic Net (
glmnet, configurable alpha), - XGBoost (
xgboost, native missing-value handling), - BART via
bartMachineas an optional model.
The benchmark reports:
- AUC,
- MSE,
- RMSE,
- precision,
- recall,
- F1 score,
- accuracy.
By default, MSE and RMSE are computed using predicted probabilities. For direct comparison with older chapter scripts, the helper function also supports class-based MSE/RMSE.
ml-benchmarks-missing-data/
├── R/ # Reusable functions
├── scripts/ # Reproducible workflow scripts
├── config/ # Project configuration
Run the scripts from the repository root in this order:
source("scripts/01_generate_data.R")
source("scripts/02_run_benchmark.R")
source("scripts/03_summarise_results.R")For a small demonstration run, the default configuration uses only a small number of simulated datasets. For a dissertation-style simulation, increase n_datasets in config/default.R.
The workflow is organised around small functions rather than one-off scripts:
- simulate complete data from a static probit data-generating process,
- impose MCAR or MAR missingness,
- optionally impute incomplete predictors,
- run k-fold cross-validation,
- compute predictive performance metrics,
- aggregate results across simulation replications.
All paths are relative to the repository root. No private Windows, OneDrive or institutional paths are used.
MIT License. See LICENSE.