Skip to content

myberg/Machine-Learning-Benchmarks-under-Missing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Learning Benchmarks under Missing Data

This repository contains a simulation-based benchmark comparing regularised regression and machine-learning methods under different missing data mechanisms and data handling strategies.

The project is intended as a public, reproducible research-software example. It uses simulated data only. Restricted-use, institutional or confidential data are not included.

Aim

The benchmark evaluates how predictive models behave when covariate data are incomplete. The cleaned public workflow compares model performance for:

  • full data before deletion,
  • complete-case analysis after missingness,
  • native handling of missing values where supported,
  • multiple imputation followed by predictive modelling.

Models

The current workflow contains reusable wrappers for:

  • LASSO regression (glmnet, alpha = 1),
  • Ridge regression (glmnet, alpha = 0),
  • Elastic Net (glmnet, configurable alpha),
  • XGBoost (xgboost, native missing-value handling),
  • BART via bartMachine as an optional model.

Evaluation metrics

The benchmark reports:

  • AUC,
  • MSE,
  • RMSE,
  • precision,
  • recall,
  • F1 score,
  • accuracy.

By default, MSE and RMSE are computed using predicted probabilities. For direct comparison with older chapter scripts, the helper function also supports class-based MSE/RMSE.

Repository structure

ml-benchmarks-missing-data/
├── R/                  # Reusable functions
├── scripts/            # Reproducible workflow scripts
├── config/             # Project configuration

Quick start

Run the scripts from the repository root in this order:

source("scripts/01_generate_data.R")
source("scripts/02_run_benchmark.R")
source("scripts/03_summarise_results.R")

For a small demonstration run, the default configuration uses only a small number of simulated datasets. For a dissertation-style simulation, increase n_datasets in config/default.R.

Reproducibility

The workflow is organised around small functions rather than one-off scripts:

  1. simulate complete data from a static probit data-generating process,
  2. impose MCAR or MAR missingness,
  3. optionally impute incomplete predictors,
  4. run k-fold cross-validation,
  5. compute predictive performance metrics,
  6. aggregate results across simulation replications.

All paths are relative to the repository root. No private Windows, OneDrive or institutional paths are used.

License

MIT License. See LICENSE.

About

Simulation study comparing Elastic Net, LASSO, Ridge Regression, BART and XGBoost under different missing data mechanisms.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages