DeltaCheck

An asynchronous checkpointing engine for large language model post-training.

DeltaCheck is motivated by an empirical observation on post-training dynamics: a large fraction of training states exhibits a previously underexplored stable-changing behavior. These states keep changing, but usually only within bounded ranges and with limited impact on convergence and final model quality. Based on this observation, DeltaCheck designs a stability-aware incremental checkpointing system that significantly reduces checkpoint storage and latency overhead while preserving model accuracy and recovery correctness.

Core idea

DeltaCheck operates in two phases:

1. Offline analysis phase

In the offline phase, DeltaCheck profiles post-training states to identify:

the stable start step
the stable-changing state set

It then determines which states must be persisted and which ones can be reused or reconstructed incrementally in later checkpoints.

2. Online checkpointing phase

In the online phase, DeltaCheck applies a stage-aware checkpointing policy to reduce both:

intra-state redundancy
inter-state redundancy across checkpoints

This allows DeltaCheck to persist only the information that is truly necessary during training, greatly reducing checkpoint size and checkpoint latency.

What this repository contains

This repository contains the core implementation and experimental workflow for DeltaCheck, including:

stability analysis methods for post-training state dynamics
the design and implementation of DeltaCheck checkpointing policies
adaptations and integration around checkpointing.py
newly added supervised_finetuning analysis, experiments, and workflows
evaluation workflows for checkpoint size, checkpoint latency, recovery correctness, and model quality

Key features

stability-aware checkpointing for LLM post-training
incremental checkpoint persistence
reduction of both intra-state and inter-state redundancy
preserved model accuracy and recovery correctness
compatibility with DeepSpeed / Megatron-style training workflows

Results

Across multiple post-training workloads, DeltaCheck achieves:

up to 99.25% reduction in checkpoint size
up to 99.92% reduction in checkpoint latency

while preserving model accuracy and recovery correctness.

Install and test

git clone https://gitee.com/iscas-research/deltacheck.git
cd datastates/
pip install . -v

# Simple PyTorch test (DeepSpeed not required)
python datastates/tests/test_ckpt_engine.py

# DeepSpeed-style workflow/config test
python datastates/tests/test_deltacheck_llm.py

DeepSpeed integration

To integrate the current asynchronous checkpointing engine, a few lines need to be modified in the DeepSpeed repository. For now, please use the following DataStates DeepSpeed branch:

https://github.com/DataStates/DeepSpeed/tree/dev

The DeltaCheck implementation in this repository is intended to work with that branch.

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
.vscode		.vscode
datastates		datastates
supervised_finetuning		supervised_finetuning
.gitignore		.gitignore
LICENSE		LICENSE
README-zh.md		README-zh.md
README.md		README.md
all-files.txt		all-files.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DeltaCheck

Core idea

1. Offline analysis phase

2. Online checkpointing phase

What this repository contains

Key features

Results

Install and test

DeepSpeed integration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DeltaCheck

Core idea

1. Offline analysis phase

2. Online checkpointing phase

What this repository contains

Key features

Results

Install and test

DeepSpeed integration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages