Skip to content

Fix DeepCompile ZeRO-1 grad target lifetime#8036

Draft
tohtana wants to merge 3 commits into
deepspeedai:masterfrom
tohtana:tohtana/fix/deepcompile-zero1-flat-release
Draft

Fix DeepCompile ZeRO-1 grad target lifetime#8036
tohtana wants to merge 3 commits into
deepspeedai:masterfrom
tohtana:tohtana/fix/deepcompile-zero1-flat-release

Conversation

@tohtana
Copy link
Copy Markdown
Collaborator

@tohtana tohtana commented May 29, 2026

DeepCompile ZeRO-1 kept compile-time reduce target buffers alive into the optimizer step, causing backward gradient storage to overlap with optimizer temporaries. This PR fixes the issue by making DeepCompile ZeRO-1 reduce targets follow the normal step-local ZeRO partition gradient-buffer lifetime, instead of preserving cloned target storage from compile setup.

The actual code changes are:

  • During compile initialization, register empty DeepCompile ZeRO-1 gradient targets, then bind them to the step-local flat ZeRO partition gradient buffer and per-parameter views when gradients are ready to synchronize.
  • After ZeRO-1 builds the optimizer-facing fp32 gradient partition, release the DeepCompile registry references and clear reduce bucket storage after backward synchronization.
  • Keep this branch limited to the flat-buffer lifetime fix, without the experimental direct-reduce path or allocator fallback behavior.
Step 10-30 avg sec Peak alloc GiB
Without this PR 0.858 43.594
Without this PR 0.859 39.366

Fine-tuning style training (8xH100, Qwen3-8B random weights, bs/GPU=1, seq=4096, GAS=1) showed only finite value losses for 1000 steps.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@tohtana tohtana marked this pull request as ready for review May 29, 2026 22:08
@tohtana tohtana requested review from loadams and tjruwase as code owners May 29, 2026 22:08
@tohtana tohtana requested a review from eternalNight May 29, 2026 22:08
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1947a25a80

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread csrc/includes/deepcompile.h Outdated
@tohtana tohtana marked this pull request as draft May 29, 2026 23:40
tohtana added 2 commits May 29, 2026 23:28
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant