[Draft] Add ZeRO-3 elastic checkpoint save/load support#8031
Draft
nathon-lee wants to merge 28 commits into
Draft
[Draft] Add ZeRO-3 elastic checkpoint save/load support#8031nathon-lee wants to merge 28 commits into
nathon-lee wants to merge 28 commits into
Conversation
This reverts commit ff88670. Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com>
Revert "fix: update 1 file reformatted." (ff88670)
This reverts commit b90aee5.
Revert accidental Muon optimizer code re-introduction from copilot PRs
Signed-off-by: nathon-lee <leejianwoo@gmail.com> fix: fix some format errs by tool Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com> fix: fix some format errs by tool Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
…test feat(zero): implement elastic checkpoint support for ZeRO-3
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
…test docs(zero): document ZeRO-3 elastic checkpoint support
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds initial support for elastic checkpoints in ZeRO Stage 3.
Today, ZeRO-3 checkpoints are tied to the data-parallel world size used at save time. If the checkpoint is later loaded with a different DP world size, loading optimizer states fails. This PR extends ZeRO-3 checkpoint save/load so that checkpoints can be saved with one DP world size and resumed with another.
At a high level, the implementation:
Motivation
Elastic checkpointing is useful when:
ZeRO Stage 1/2 already support elastic checkpointing through
elastic_checkpoint; this PR aims to extend similar functionality to ZeRO Stage 3.Design
Save path
Previously, ZeRO-3 rejected
elastic_checkpoint=TruewithNotImplementedError.This PR changes the save path to emit an elastic checkpoint format for ZeRO-3:
FP32_FLAT_GROUPSare still saved in the padded flat layout,BASE_OPTIMIZER_STATE.The intent is:
Load path
On load, the code now detects whether the checkpoint is rigid or elastic.
For elastic checkpoints, it:
This allows loading a ZeRO-3 elastic checkpoint with a different DP world size than the one used at save time.
User-facing behavior
Enable with:
{ "zero_optimization": { "stage": 3, "elastic_checkpoint": true } }Then
save_checkpoint()/load_checkpoint()work as usual.The load path also auto-detects elastic checkpoint contents via
BASE_OPTIMIZER_STATE, so an elastic checkpoint can still be loaded even if the runtime config does not explicitly setelastic_checkpoint=true.Tests
This PR adds coverage for:
load_from_fp32_weights=Truepath.Documentation
Adds a ZeRO-3 elastic checkpoint section to the ZeRO tutorial covering:
Notes / limitations
This is being opened as a Draft PR to get feedback on the format and load-path design before calling it merge-ready.
A few caveats / follow-ups I want to highlight explicitly:
swap_optimizer/ NVMe offload path is supported in code but has not been deeply stress-tested across world-size changes,Files of interest
deepspeed/runtime/zero/stage3.pydeepspeed/runtime/engine.pytests/unit/checkpoint/test_zero_optimizer.pydocs/_tutorials/zero.mdRequest for feedback
I’d appreciate feedback on: