Machine Learning Benchmarks under Missing Data

This repository contains a simulation-based benchmark comparing regularised regression and machine-learning methods under different missing data mechanisms and data handling strategies.

The project is intended as a public, reproducible research-software example. It uses simulated data only. Restricted-use, institutional or confidential data are not included.

Aim

The benchmark evaluates how predictive models behave when covariate data are incomplete. The cleaned public workflow compares model performance for:

full data before deletion,
complete-case analysis after missingness,
native handling of missing values where supported,
multiple imputation followed by predictive modelling.

Models

The current workflow contains reusable wrappers for:

LASSO regression (glmnet, alpha = 1),
Ridge regression (glmnet, alpha = 0),
Elastic Net (glmnet, configurable alpha),
XGBoost (xgboost, native missing-value handling),
BART via bartMachine as an optional model.

Evaluation metrics

The benchmark reports:

AUC,
MSE,
RMSE,
precision,
recall,
F1 score,
accuracy.

By default, MSE and RMSE are computed using predicted probabilities. For direct comparison with older chapter scripts, the helper function also supports class-based MSE/RMSE.

Repository structure

ml-benchmarks-missing-data/
├── R/                  # Reusable functions
├── scripts/            # Reproducible workflow scripts
├── config/             # Project configuration

Quick start

Run the scripts from the repository root in this order:

source("scripts/01_generate_data.R")
source("scripts/02_run_benchmark.R")
source("scripts/03_summarise_results.R")

For a small demonstration run, the default configuration uses only a small number of simulated datasets. For a dissertation-style simulation, increase n_datasets in config/default.R.

Reproducibility

The workflow is organised around small functions rather than one-off scripts:

simulate complete data from a static probit data-generating process,
impose MCAR or MAR missingness,
optionally impute incomplete predictors,
run k-fold cross-validation,
compute predictive performance metrics,
aggregate results across simulation replications.

All paths are relative to the repository root. No private Windows, OneDrive or institutional paths are used.

License

MIT License. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
R		R
config		config
scripts		scripts
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Benchmarks under Missing Data

Aim

Models

Evaluation metrics

Repository structure

Quick start

Reproducibility

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Benchmarks under Missing Data

Aim

Models

Evaluation metrics

Repository structure

Quick start

Reproducibility

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages