An asynchronous checkpointing engine for large language model post-training.
DeltaCheck is motivated by an empirical observation on post-training dynamics: a large fraction of training states exhibits a previously underexplored stable-changing behavior. These states keep changing, but usually only within bounded ranges and with limited impact on convergence and final model quality. Based on this observation, DeltaCheck designs a stability-aware incremental checkpointing system that significantly reduces checkpoint storage and latency overhead while preserving model accuracy and recovery correctness.
DeltaCheck operates in two phases:
In the offline phase, DeltaCheck profiles post-training states to identify:
- the stable start step
- the stable-changing state set
It then determines which states must be persisted and which ones can be reused or reconstructed incrementally in later checkpoints.
In the online phase, DeltaCheck applies a stage-aware checkpointing policy to reduce both:
- intra-state redundancy
- inter-state redundancy across checkpoints
This allows DeltaCheck to persist only the information that is truly necessary during training, greatly reducing checkpoint size and checkpoint latency.
This repository contains the core implementation and experimental workflow for DeltaCheck, including:
- stability analysis methods for post-training state dynamics
- the design and implementation of DeltaCheck checkpointing policies
- adaptations and integration around
checkpointing.py - newly added
supervised_finetuninganalysis, experiments, and workflows - evaluation workflows for checkpoint size, checkpoint latency, recovery correctness, and model quality
- stability-aware checkpointing for LLM post-training
- incremental checkpoint persistence
- reduction of both intra-state and inter-state redundancy
- preserved model accuracy and recovery correctness
- compatibility with DeepSpeed / Megatron-style training workflows
Across multiple post-training workloads, DeltaCheck achieves:
- up to 99.25% reduction in checkpoint size
- up to 99.92% reduction in checkpoint latency
while preserving model accuracy and recovery correctness.
git clone https://gitee.com/iscas-research/deltacheck.git
cd datastates/
pip install . -v
# Simple PyTorch test (DeepSpeed not required)
python datastates/tests/test_ckpt_engine.py
# DeepSpeed-style workflow/config test
python datastates/tests/test_deltacheck_llm.pyTo integrate the current asynchronous checkpointing engine, a few lines need to be modified in the DeepSpeed repository. For now, please use the following DataStates DeepSpeed branch:
https://github.com/DataStates/DeepSpeed/tree/dev
The DeltaCheck implementation in this repository is intended to work with that branch.