Skip to content

[Draft] Add ZeRO-3 elastic checkpoint save/load support#8031

Draft
nathon-lee wants to merge 28 commits into
deepspeedai:masterfrom
nathon-lee:feat/zero3-elastic-checkpoint
Draft

[Draft] Add ZeRO-3 elastic checkpoint save/load support#8031
nathon-lee wants to merge 28 commits into
deepspeedai:masterfrom
nathon-lee:feat/zero3-elastic-checkpoint

Conversation

@nathon-lee
Copy link
Copy Markdown
Contributor

Summary

This PR adds initial support for elastic checkpoints in ZeRO Stage 3.

Today, ZeRO-3 checkpoints are tied to the data-parallel world size used at save time. If the checkpoint is later loaded with a different DP world size, loading optimizer states fails. This PR extends ZeRO-3 checkpoint save/load so that checkpoints can be saved with one DP world size and resumed with another.

At a high level, the implementation:

  • adds an elastic ZeRO-3 checkpoint format for optimizer state,
  • auto-detects elastic vs. rigid checkpoint format on load,
  • re-merges and re-partitions optimizer state for the current DP world size,
  • preserves fp32 flat groups so existing weight reconstruction paths continue to work,
  • adds unit tests and documentation for the new behavior.

Motivation

Elastic checkpointing is useful when:

  • resuming training after preemption on a different number of GPUs,
  • scaling training up or down between runs,
  • restoring a checkpoint on smaller hardware for evaluation/debugging.

ZeRO Stage 1/2 already support elastic checkpointing through elastic_checkpoint; this PR aims to extend similar functionality to ZeRO Stage 3.

Design

Save path

Previously, ZeRO-3 rejected elastic_checkpoint=True with NotImplementedError.

This PR changes the save path to emit an elastic checkpoint format for ZeRO-3:

  • FP32_FLAT_GROUPS are still saved in the padded flat layout,
  • optimizer state is saved as lean per-parameter shards with padding stripped,
  • the elastic format is identified by the presence of BASE_OPTIMIZER_STATE.

The intent is:

  • optimizer state can be reconstructed independent of the original DP world size,
  • existing fp32 weight recovery workflows remain compatible.

Load path

On load, the code now detects whether the checkpoint is rigid or elastic.

For elastic checkpoints, it:

  1. reads the per-rank optimizer shards,
  2. merges lean shards back into full per-parameter tensors,
  3. re-partitions them for the current rank under the current DP world size,
  4. restores optimizer state and fp32 master weights,
  5. syncs fp16 partitions from the restored fp32 state.

This allows loading a ZeRO-3 elastic checkpoint with a different DP world size than the one used at save time.

User-facing behavior

Enable with:

{
  "zero_optimization": {
    "stage": 3,
    "elastic_checkpoint": true
  }
}

Then save_checkpoint() / load_checkpoint() work as usual.

The load path also auto-detects elastic checkpoint contents via BASE_OPTIMIZER_STATE, so an elastic checkpoint can still be loaded even if the runtime config does not explicitly set elastic_checkpoint=true.

Tests

This PR adds coverage for:

  • ZeRO-3 elastic checkpoint round-trip with the same world size,
  • loading without optimizer states,
  • elastic vs. rigid state dict format checks,
  • auto-detection of elastic checkpoint format on load,
  • cross-world-size restore (save on 4 GPUs, load on 2 GPUs),
  • load_from_fp32_weights=True path.

Documentation

Adds a ZeRO-3 elastic checkpoint section to the ZeRO tutorial covering:

  • what the feature does,
  • how to enable it,
  • how the elastic save/load flow works,
  • an example of changing world size between save and resume,
  • current limitations.

Notes / limitations

This is being opened as a Draft PR to get feedback on the format and load-path design before calling it merge-ready.

A few caveats / follow-ups I want to highlight explicitly:

  • the implementation is primarily validated on the Adam-based ZeRO-3 optimizer path,
  • cross-world-size testing currently includes 4 -> 2 restore,
  • the swap_optimizer / NVMe offload path is supported in code but has not been deeply stress-tested across world-size changes,
  • I would especially appreciate feedback on the checkpoint format choice and load-path compatibility expectations.

Files of interest

  • deepspeed/runtime/zero/stage3.py
  • deepspeed/runtime/engine.py
  • tests/unit/checkpoint/test_zero_optimizer.py
  • docs/_tutorials/zero.md

Request for feedback

I’d appreciate feedback on:

  1. whether the elastic checkpoint format is acceptable for ZeRO-3,
  2. whether the load-time auto-detection approach is the right compatibility model,
  3. whether there are additional checkpoint / optimizer edge cases that should be covered before this is considered ready.

Copilot AI and others added 28 commits February 27, 2026 06:30
This reverts commit ff88670.

Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com>
Revert "fix: update 1 file reformatted." (ff88670)
Revert accidental Muon optimizer code re-introduction from copilot PRs
Signed-off-by: nathon-lee <leejianwoo@gmail.com>

fix: fix some format errs by tool

Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>

fix: fix some format errs by tool

Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
…test

feat(zero): implement elastic checkpoint support for ZeRO-3
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
…test

docs(zero): document ZeRO-3 elastic checkpoint support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants