Skip to content

iscas-system/deltacheck

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeltaCheck

An asynchronous checkpointing engine for large language model post-training.

DeltaCheck is motivated by an empirical observation on post-training dynamics: a large fraction of training states exhibits a previously underexplored stable-changing behavior. These states keep changing, but usually only within bounded ranges and with limited impact on convergence and final model quality. Based on this observation, DeltaCheck designs a stability-aware incremental checkpointing system that significantly reduces checkpoint storage and latency overhead while preserving model accuracy and recovery correctness.

Core idea

DeltaCheck operates in two phases:

1. Offline analysis phase

In the offline phase, DeltaCheck profiles post-training states to identify:

  • the stable start step
  • the stable-changing state set

It then determines which states must be persisted and which ones can be reused or reconstructed incrementally in later checkpoints.

2. Online checkpointing phase

In the online phase, DeltaCheck applies a stage-aware checkpointing policy to reduce both:

  • intra-state redundancy
  • inter-state redundancy across checkpoints

This allows DeltaCheck to persist only the information that is truly necessary during training, greatly reducing checkpoint size and checkpoint latency.

What this repository contains

This repository contains the core implementation and experimental workflow for DeltaCheck, including:

  • stability analysis methods for post-training state dynamics
  • the design and implementation of DeltaCheck checkpointing policies
  • adaptations and integration around checkpointing.py
  • newly added supervised_finetuning analysis, experiments, and workflows
  • evaluation workflows for checkpoint size, checkpoint latency, recovery correctness, and model quality

Key features

  • stability-aware checkpointing for LLM post-training
  • incremental checkpoint persistence
  • reduction of both intra-state and inter-state redundancy
  • preserved model accuracy and recovery correctness
  • compatibility with DeepSpeed / Megatron-style training workflows

Results

Across multiple post-training workloads, DeltaCheck achieves:

  • up to 99.25% reduction in checkpoint size
  • up to 99.92% reduction in checkpoint latency

while preserving model accuracy and recovery correctness.

Install and test

git clone https://gitee.com/iscas-research/deltacheck.git
cd datastates/
pip install . -v

# Simple PyTorch test (DeepSpeed not required)
python datastates/tests/test_ckpt_engine.py

# DeepSpeed-style workflow/config test
python datastates/tests/test_deltacheck_llm.py

DeepSpeed integration

To integrate the current asynchronous checkpointing engine, a few lines need to be modified in the DeepSpeed repository. For now, please use the following DataStates DeepSpeed branch:

https://github.com/DataStates/DeepSpeed/tree/dev

The DeltaCheck implementation in this repository is intended to work with that branch.

About

[SoCC 26 underview] Stability-Aware Incremental Checkpointing for LLM Post-Training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors