Add AutoEP#5
Conversation
Correctly handle `ds_grad_is_ready` in ZeRO2 --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The current code has the following issues: - `use_default_specs: false` doesn't work - Injection by the traditional pattern runs even when custom patterns are set - `mpu` needs to be passed to `deepspeed.initialize` (HF integration doesn't pass mpu) This PR fixes AutoTP setup to respect `use_default_specs: false` and disable the traditional injection path when custom patterns are enabled. Also, when `mpu` is not passed, we create a TP group in the initialization process. With these changes, the [related tests](https://github.com/deepspeedai/DeepSpeed/tree/master/tests/unit/model_parallelism) pass and [all AutoTP examples](https://github.com/tohtana/DeepSpeedExamples/tree/tohtana/custom_auto_tp/training/tensor_parallel) in DeepSpeedExamples work now ([PR](deepspeedai/DeepSpeedExamples#998)). --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
|
To use Codex here, create a Codex account and connect to github. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fd07c93a5e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cabfebcdca
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if self.return_router_logits: | ||
| logits = self._cached_router_logits | ||
| self._cached_router_logits = None |
There was a problem hiding this comment.
Populate router logits when returning tuple output
When _detect_forward_contract sets return_router_logits=True for legacy MoE blocks (router_logits_capture_target == "moe_block"), _register_logit_hook is not installed and _cached_router_logits is never set. The forward path then returns (output, None) here, which breaks callers that expect actual router logits (e.g., OutputRecorder/z-loss paths that rely on the second return value). This only shows up for models using the MoE-block tuple contract, but in that case the logits are silently missing.
Useful? React with 👍 / 👎.
Current metaclasses for layers and parameters access annotations in a way that is incompatible with python 3.14+ See: - [Python 3.14 release notes](https://docs.python.org/3/whatsnew/3.14.html) - [Porting annotations](https://docs.python.org/3/whatsnew/3.14.html#whatsnew314-porting-annotations) - [PEP649](https://peps.python.org/pep-0649/) and [PEP749](https://peps.python.org/pep-0749/) This PR uses annotationlib from python 3.14 onwards and keeps backwards compatibility. closes deepspeedai#7673 should unblock CF builds for py3.14 conda-forge/deepspeed-feedstock#114 A question is, does deepspeed support officially 3.14 yet? Should we test it in CIs? --------- Signed-off-by: Santi Villalba <sdvillal@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
## Bug `fractions.gcd` was deprecated in Python 3.5 and removed in Python 3.9. This causes an `AttributeError` on Python 3.9+. ## Fix Replaced `fractions.gcd` with `math.gcd` which is the standard replacement.
Fixes: deepspeedai#7837 ZeRO-0 + bf16 has two bugs in `engine.py`: 1. `FP16_UnfusedOptimizer` applies `dynamic_loss_scale` with `cur_scale=65536` but `engine.backward()` never scales the loss, so `step()` divides gradients by 65536 2. `_take_model_step` skips `zero_grad` for bf16 without ZeRO, causing gradient accumulation. Fix: disable loss scaling for bf16 and remove the `zero_optimization()` gate on `zero_grad`. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…peedai#7840) Fixes deepspeedai#7835. On torch==2.10.0, importing DeepSpeed emitted deprecation warnings from import-time JIT-decorated helpers. This change updates the compatibility path to align with PyTorch guidance while keeping import clean. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
…ost-0.18.6 Update version post release
This PR addresses deepspeedai#7677 by flattening parameter tensors on the accelerators instead of the CPU during zero stage 1 and 2 initialization. This should alleviate CPU contention, with the caveat that the optimization is only used when there is enough VRAM to allocate a full copy of the parameter buffers. On 8 x H100s and a Intel Xeon Platinum 8480+, profiling the initialization of DeepSpeed on 32 layers of `Qwen3-30B` with Z2 gives the following: Old = ~382s New = ~130s ------------------------- If necessary, this optimization can be extended to allowed a tiered system that trades off VRAM space with performance, which might look like the following: ``` if enough VRAM for 2x model_size: naive flatten else if enough VRAM for model_size / N: distributed flatten across N devices else: flatten on CPU ``` The distributed flatten would involve each device flattening a portion of the parameters and performing an all-gather to assemble the full flattened model. See deepspeedai#7677 for original discussion. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Kento Sugama <kentosugama@protonmail.ch> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com> Signed-off-by: vensen <vensenmu@gmail.com> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: nathon <leejianwoo@gmail.com> Co-authored-by: Vensen <vensenmu@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: jp <jsb10121249@gmail.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
This PR enables shared memory communication in single node for arm hosts - deepspeedai#7625 <img width="908" height="108" alt="image" src="https://github.com/user-attachments/assets/a5d1a5c7-f28e-4129-9503-cc2b477993ac" /> --------- Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>
Added a new news entry about DeepSpeed ZeRO++ support for LLM distillation work at LinkedIn.
## Summary Add support for LG AI Research's EXAONE 4.0 model family in DeepSpeed Inference V2. Closes deepspeedai#7453 ## Changes - New model implementation: `deepspeed/inference/v2/model_implementations/exaone4/` - `container.py`: Transformer and non-transformer parameter containers - `model.py`: Inference model with post-norm architecture and QK-Norm support - `policy.py`: Inference V2 policy - Register EXAONE 4.0 in `engine_factory.py` and `__init__.py` ## Key architectural differences from Mistral/Llama - **Post-norm**: RMSNorm is applied after attention/MLP outputs (not before), followed by residual addition - **QK-Norm**: Per-head RMSNorm applied to Q and K projections after the QKV linear layer - **Hybrid attention**: 32B model uses 3:1 sliding window/full attention ratio (via `layer_types` config) ## Supported models - [EXAONE-4.0-1.2B](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-1.2B) (all full attention) - [EXAONE-4.0-32B](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B) (hybrid sliding/full attention) Requires `transformers >= 4.54.0`. ## Related - Supersedes deepspeedai#7456 (draft, inactive for 6 months) --------- Signed-off-by: Bias92 <pewpewplay315@gmail.com>
deepspeedai#7846) Fixes deepspeedai#7843 On HIP/ROCm (the AMD path), several CUDA-style BF16 intrinsics used in the code are not provided, e.g.: - `__ll2bfloat16_rn` - `__int2bfloat16_rn` - `__short2bfloat16_rn` - `__bfloat162uint_rn` This causes compilation errors on HIP platforms. This PR introduces fallback paths using functions available on HIP platform mirroring the [conversion util in csrc](https://github.com/deepspeedai/DeepSpeed/blob/2c362837b0ef906ea7e7506bab3a625faa945cdd/csrc/includes/conversion_utils.h#L351). The converion paths are: - int/uint -> bf16: convert to float (or double for 64-bit), then to bf16. - bf16 -> int/uint: convert bf16 to float, then to the integer type. - float -> bf16: build from bf16 via supported HIP helpers. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
`EvoformerAttnBuilder` returns instances of `Path` from `include_paths` which then cause failures in `OpBuilder.builder` when passing them to `strip_empty_entries` that calls `len` on them which isn't defined for `Path` instances: > TypeError: object of type 'PosixPath' has no len() Fixes regression introduced in deepspeedai#7760 cc @sdvillal Signed-off-by: Alexander Grund <alexander.grund@tu-dresden.de>
…edai#7832) deepspeedai#7817 added a test to verify that we throw an error when parameters are modified in `GatheredParameters` and `modifier_rank` is None. However, the PR just checks devices and doesn't detect modifications on parameters. This causes an [error](https://github.com/deepspeedai/DeepSpeed/actions/runs/21653729382/job/62424014222) in our full test run. This PR adds the detection of parameter modifications to properly throw an error. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
PR deepspeedai#7839 introduced a regression by changing `TestZeroStaticScale` from `assert optim.dynamic_loss_scale == False` to `assert optim.loss_scale_config.dynamic_loss_scale == False`. `loss_scale_config` is not part of the ZeRO optimizer (only non-ZeRO optimizer have it), while this test runs with ZeRO optimizers. With this fix, `TestZeroStaticScale` now passes for stages 1/2/3. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The full test workflow passed though it is still flakey ([Success](https://github.com/deepspeedai/DeepSpeed/actions/runs/22269243373) / [Failure](https://github.com/deepspeedai/DeepSpeed/actions/runs/22266498530)) This PR schedules a nightly run of the full test. It is launched only when we have update since the last successful run. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
…pspeedai#7874) Fix links and manu items for AutoTP doc Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Reduce retained AutoEP tests to critical path
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
## Summary RFC: deepspeedai#7884 Wire `sdma_allgather` into ZeRO-3's parameter prefetch path (`_dist_allgather_fn`). When enabled, ZeRO-3 allgather routes through `mori_cpp.AllGatherIntoTensor` (intra-node SDMA copy on AMD MI300), with a transparent fallback to `dist.allgather_fn` (RCCL/NCCL) on init failure. End-to-end demo + repro steps + verified numbers live in [`examples/sdma_allgather/README.md`](examples/sdma_allgather/README.md). Headline (8x MI300X, DeepSpeed default ZeRO-3 buckets, 100 steps): | | GPT-7B-ish | Qwen3-32B | |---|---|---| | SDMA off | 697.7 ms / step | 1402.5 ms / step | | SDMA on | 622.0 ms / step | 1263.2 ms / step | | **gain** | **+10.85 %** | **+9.93 %** | Loss curves match off ↔ on, peak memory unchanged. Speedup is workload-dependent — gains shrink (or invert) when allgather can't be overlapped with compute Co-authored-by: wuyl1 <yangwu@amd.com> --------- Signed-off-by: wuyl1 <yangwu@amd.com> Signed-off-by: inkcherry <mingzhi.liu@amd.com> Co-authored-by: wuyl1 <yangwu@amd.com>
…lier Remove AutoEP backward loss multiplier
…-groups-mpu Use active MPU for AutoEP sequence-parallel size
…pspeedai#8005) Fixes deepspeedai#8003 ## Summary `FastFileWriter._fini()` overwrote `self._aio_fd = INVALID_FD` without calling `os.close()`, leaking one fd per save. With unlink-based checkpoint rotation this stranded the unlinked inode in the ext4 orphan list, fs blocks were never reclaimed, and long-running save loops hit ENOSPC at iter ~60 (60 GB/iter on a 4 TB partition). This PR adds explicit `os.fsync()` + `os.close()` in `_fini()` and a regression test that asserts no `/proc/self/fd` entry points at a deleted file after a save+close+unlink cycle. ## Verification - 20-iteration repro of `save() / close() / unlink()` leaked 20 fds before the fix, 0 after. - 700-iter / 42 TB / 60 h endurance run on ext4/NVMe: `df_used` stable at 736 GB (drift +281 MB / 697 rotations) with the fix; same workload hit ENOSPC at iter ~60 without it. - Performance impact: ~5% wall-time overhead from the added `os.fsync()` at ~10 GB/s peak. ## Test plan - [x] New regression test `tests/unit/ops/aio/test_fast_file_writer_fd_close.py` verifies fd cleanup after a single save and after 5/20-iter rotation loops via `/proc/self/fd` scoped to `tmp_path`. - [x] Gated on `async_io` compatibility, Linux, and CUDA accelerator so unsupported CI matrix entries skip cleanly. - [x] Confirmed test FAILS without this PR's `_fini()` change and PASSES with it. - [x] `pre-commit run --files <changed files>` clean. ## Notes - The `__del__` assertion `assert self._aio_fd == INVALID_FD` passes even with the bug because it checks the Python attribute that `_fini` itself sets. The new test checks OS-level state via `/proc/self/fd`. - `os.fsync()` is included for post-close durability — required for correctness on the unaligned-tail path that re-opens the file as buffered I/O. If maintainers prefer to drop it for performance, removing only the `os.fsync(...)` line still fixes the leak. Happy to adjust shape, naming, or test placement to fit project conventions. Thanks for the review. Signed-off-by: jg-heo <csjg.heo@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Three changes to make Moonlight-16B-A3B (DeepSeek-V3 MoE) work with AutoEP + Muon + ZeRO-2: 1. e_score_correction_bias: copy the pretrained noaux_tc score-correction bias from the source gate into AutoEP routers and apply it in the TokenChoiceTopKRouter forward pass so expert selection matches the pretrained checkpoint. 2. is_expert_group: mark GroupedExperts w1/w2/w3 tensors with is_expert_group=True so Muon applies Newton-Schulz independently per expert slice rather than treating the stacked (E, I, O) tensor as a single matrix. muon_update grows an is_expert_group kwarg; all four call sites inside original_muon.py and the ZeRO-2 path in stage_1_and_2.py pass getattr(p, 'is_expert_group', False). 3. Muon + MoE param groups in engine.py: flatten dict-style param groups produced by configure_moe_param_groups before filtering by use_muon; re-tag optimizer flags after AutoEP layer replacement; add name keys for MoE group splitting; call split_params_into_different_moe_groups when the model has MoE layers. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Two fixes addressing masahiro's review feedback on PR deepspeedai#7938: 1. Auto-fill AutoEPConfig from HF model config (auto_ep_config.py, auto_ep.py): add fill_autoep_config_from_hf() which maps HF field names to AutoEP internal names on AutoEP.__init__: - n_group -> num_expert_groups - topk_group -> num_limited_groups - routed_scaling_factor -> route_scale User-supplied values always take precedence. Without this, Moonlight (DeepSeek-V3) training used route_scale=1.0 instead of 2.446, producing systematically wrong MoE output magnitudes. 2. Restore batched Newton-Schulz in muon_update (original_muon.py): replace the per-expert Python loop with a single batched call to zeropower_via_newtonschulz5, which already supports ndim>=2 inputs. This restores GPU parallelism across all E experts per step. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Now that the per-expert Python loop is replaced with a single batched call to zeropower_via_newtonschulz5, muon_update has no dynamic control flow and can be compiled again. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Guokai Ma <guokai.ma@intel.com>
DS4Sci EvoformerAttention currently depends on CUTLASS, but requiring users to manually set `CUTLASS_PATH` creates unnecessary friction for an otherwise standard extension build flow. This change makes CUTLASS discovery automatic while preserving `CUTLASS_PATH` as the explicit override. The discovery approach is based on PyTorch's CUDA detection pattern in `torch.utils.cpp_extension`: honor the explicit environment variable first, then infer from installed packages and conventional filesystem locations, and only fail with an actionable message when discovery cannot succeed. This improves first-run usability, CI behavior, editable installs, and package-based environments where CUTLASS may already be installed in a discoverable location. It also reduces setup divergence between users who clone CUTLASS manually and users who install NVIDIA's `nvidia-cutlass` package. DeepSpeed should already have had this because EvoformerAttention is part of DeepSpeed's extension-builder system, and extension builders should locate common build dependencies using predictable heuristics instead of requiring users to export paths manually. CUDA itself is not treated as "you must always set CUDA_HOME"; PyTorch attempts discovery first and uses the env var as a fallback. CUTLASS should follow the same principle here. --------- Signed-off-by: Max Tretikov <max@tretikov.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
…edai#7994) ## Summary Fix critical severity security issue in `deepspeed/runtime/data_pipeline/data_sampling/data_analyzer.py`. ## Vulnerability | Field | Value | |-------|-------| | **ID** | V-001 | | **Severity** | CRITICAL | | **Scanner** | multi_agent_ai | | **Rule** | `V-001` | | **File** | `deepspeed/runtime/data_pipeline/data_sampling/data_analyzer.py:75` | **Description**: The data_analyzer.py file uses os.system() with an f-string that directly interpolates the variable metric_to_sample_fname into a shell command without any sanitization. This variable is derived from user-supplied dataset configuration or file paths. Because os.system() invokes a shell interpreter, any shell metacharacters in the variable (semicolons, backticks, dollar signs, pipes, ampersands) will be interpreted and executed as separate shell commands. ## Changes - `deepspeed/runtime/data_pipeline/data_sampling/data_analyzer.py` ## Verification - [x] Build passes - [x] Scanner re-scan confirms fix - [x] LLM code review passed --- *Automated security fix by [OrbisAI Security](https://orbisappsec.com)* Signed-off-by: orbisai0security <mediratta01.pally@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
1. gram_newtonschulz: replace torch.addmm (2D only) with equivalent a*Q + Z@Q to support batched 3D expert weight tensors [num_local_experts, n, m]. Also fix diagonal() to specify dim1/dim2 for 3D tensors. 2. deepseek_v3 preset: remove e_score_correction_bias from unsupported_router_bias_names since auto_ep_layer.py already copies it correctly (lines 398-402). Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
…speedai#8009) ## Summary Fixes deepspeedai#6961 ZeRO-3 forward crashes with `AttributeError: 'dict' object has no attribute '_in_forward'` since torch 2.5. PyTorch changed `nn.Module._parameters` from `OrderedDict` to plain `dict` (pytorch/pytorch#129164), and a plain `dict` does not allow attribute assignment. DeepSpeed wraps every module into `ZeROOrderedDict` at engine init via `_inject_parameters`. Any module not present at that point keeps the plain dict and crashes the next forward. This includes a submodule attached after `deepspeed.initialize()` (PEFT/LoRA adapters), or a module restored by `deepspeed/compile/init_z3.py:35`. The fix adds `ensure_zero_ordered_dict()` and calls it from the forward prologue. It wraps lazily, is idempotent, and keeps the original container so the deepcompile un-injection path still works. The epilogue gets an `isinstance` guard for modules that show up between the two hooks. This only fixes the crash. Late-attached parameters are still not in the optimizer and not partitioned by ZeRO-3. For full ZeRO-3 semantics on a late adapter, build it inside `deepspeed.zero.Init()`. ## Tests `tests/unit/runtime/zero/test_zero_late_module_attach.py` - forward after attaching a Linear post-init, with `_parameters` forced to plain dict so the bug reproduces on any torch version - repeated forwards do not re-wrap an already-wrapped module Signed-off-by: Sung Hyun Cho <hope5487@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
## What Removes an obsolete paragraph from the `DeepSpeedCPUAdam` constructor docstring. ## Why The docstring described a `step()` option that "updates optimizer states and copies the parameters back to GPU at the same time" — the old `adam_update_copy` kernel, invoked via `step(fp16_param_groups=...)`. That fused-copy path no longer exists in the codebase: - `csrc/adam/cpu_adam.cpp` binds only `adam_update` (no `adam_update_copy`). - `DeepSpeedCPUAdam.step()` / `step_subgroup()` take no `fp16_param_groups` argument and only call `adam_update`. So the "two options" text is stale and misleading to anyone reading the API. Docstring-only change; no functional impact. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Lucas Pirola <lucas@pirola.eu> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
## Summary The modal test now shows the error because of the combination of PyTorch v2.7 and Transformer `main` branch, and it is blocking PRs. To address it, we improve our test workflows as follows. - Add manual dependency version inputs for the torch-latest CI workflows and default the torch-latest family to PyTorch 2.10 plus Transformers git `main`. - Let CPU and AWS full torch-latest runs select either released Transformers package versions or an explicit Transformers git ref for manual validation. - Let Modal torch-latest runs select supported PyTorch/CUDA image presets and an optional Transformers git ref, defaulting to `2.10.0-cuda12.8` and Transformers git `main`. ## Known follow-up - The AWS full real CI lane for PyTorch 2.10 plus Transformers main reached `Unit tests (parallel)` but failed with 33 failures. Some of these may overlap with fixes in deepspeedai#8015; I am opening this PR now so the workflow/input changes can be reviewed while those failures are handled separately. - CPU and Modal real CI validation for the requested tuple passed. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
) Run the FastFileWriter fd-close regression in the sequential CI bucket without pytest-forked. The test exercises the real torch.save() through FastFileWriter with async I/O and pinned memory. The scheduled AWS [full CI failure](deepspeedai#8015) happens before the fd-close assertion because the sequential bucket still runs under --forked, and CUDA-backed pinned memory is not safe in that forked worker context. Marking this regression as sequential keeps it out of the parallel bucket, and removing --forked from the sequential run lets it test the intended close/unlink behavior. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
This PR makes GitHub Actions check names unique so required status checks can be configured reliably. GitHub's protected-branch documentation recommends unique job names across workflows when requiring specific status checks: https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/managing-protected-branches/about-protected-branches Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
## What
DeepSpeed's bf16 `bf16_optimizer_states` option and `offload_optimizer`
are
currently mutually exclusive, per the support matrix in
`docs/_pages/config-json.md`:
| ZeRO 1/2/3 | `bf16_optimizer_states=false` |
`bf16_optimizer_states=true` |
|---|---|---|
| | requires ZeRO-Offload + `DeepSpeedCPUAdam`; states **fp32, on CPU**
| supported **without offload**; states **bf16, on GPU** |
This PR fills the missing cell: `bf16_optimizer_states=true` *together
with*
`offload_optimizer: {device: cpu}` for ZeRO 1/2/3 — Adam moments held in
**bf16**
*and* offloaded to **CPU host RAM**. That reduces the offloaded
optimizer state
from ~10 to ~6 bytes/param (bf16 master + two bf16 moments) with no
added GPU
memory.
## Why
CPU offload currently forces fp32 optimizer states; for large models the
offloaded optimizer state dominates host RAM. Keeping the moments in
bf16
(matching the already-bf16 master weights) cuts that footprint
substantially
while keeping the state off the GPU.
## How
`DeepSpeedCPUAdam` already supports bf16 momentum/variance through its
`fp32_optimizer_states` constructor flag — the feature was simply not
wired up.
No C++/CUDA kernel changes.
- **`engine.py`** — `_configure_basic_optimizer` builds
`DeepSpeedCPUAdam` /
`ZenFlowCPUAdam` with `fp32_optimizer_states=False` when
`bf16_optimizer_states`
is set (a user-supplied value is popped to avoid a keyword clash, and
overridden
with a warning).
- **`base_optimizer.py`** — `_configure_master_weights` runs the offload
+
`DeepSpeedCPUAdam` validator whenever offload is configured (not only
for the
fp32-states case), and asserts a user-provided optimizer actually stores
bf16
moments.
- **`stage3.py` / `stage_1_and_2.py`** — pass `offload_enabled` through.
- **`config-json.md`** — updated bf16 support matrix.
## Backward compatibility
`false`+offload and `true`+no-offload configs are unaffected: the
default
resolves to `fp32_optimizer_states=True` (prior behavior), and the
no-offload
bf16-states path (FusedAdam on GPU) is untouched.
## Numerics
`bf16_optimizer_states` continues to require
`bf16_master_weights_and_grads`, so
master weights are bf16 — identical precision to the existing on-GPU
bf16-states
path. CPU Adam computes updates in fp32 internally and rounds moments to
bf16
(round-to-nearest-even), matching that path.
## Testing
- `tests/unit/ops/adam/test_cpu_adam.py` — `DeepSpeedCPUAdam` bf16
moment
allocation + fp32 parity.
- `tests/unit/v1/half_precision/test_bf16.py` — `bf16_optimizer_states`
+ CPU
offload across ZeRO 1/2/3 (extends `TestBF16MasterWeightsGradients`),
plus a
guard test that a user-provided `DeepSpeedCPUAdam` must opt into bf16
moments.
All new and affected existing tests pass;
`TestBF16MasterWeightsGradients`
(9 cases) was verified on a 2-GPU host.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Signed-off-by: Lucas Pirola <lucas@pirola.eu>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
fix:When running following test command, it may hang. ```pytest -v runtime/zenflow/test_zf.py::TestZenFlowDistributed::test_zenflow_distributed[epoch-1-4-False-0-3]``` The reason is that when param.selected_indices got an empty result, its dtype would be torch.float32 instead of torch.int64. However, if the float32 empty tensor is used as an index just like grad_2d[param.selected_indices, :], it would cause a hang. So in order to solve this bug, I add a dtype cast to int64 when judge the param.selected_indices is empty, which means its original dtype is torch.float32. Signed-off-by: binchengxiong <binchengxiong@alibaba-inc.com> Co-authored-by: binchengxiong <binchengxiong@alibaba-inc.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Author: @PKUWZP & @delock Blog post introducing Muon optimizer support in DeepSpeed, covering how it integrates with ZeRO Stage 2/3, measured convergence and memory results, and the roadmap ahead. --------- Signed-off-by: Ma, Guokai <guokai.ma@intel.com> Signed-off-by: Ma, Guokai <guokai.ma@gmail.com> Signed-off-by: Guokai Ma <guokai.ma@intel.com>
…ai#7990) This PR is based on deepspeedai#7975 and fix CI errors. Thanks for @mingxiang1006 for providing the fix. --------- Signed-off-by: Guokai Ma <guokai.ma@intel.com>
`LinearFunctionForZeroStage3` uses the legacy `forward(ctx, ...)` pattern which is incompatible with `torch.func` transforms (`torch.func.grad`, `torch.func.grad_and_value`, `vmap`, etc.): ``` RuntimeError: In order to use an autograd.Function with functorch transforms (vmap, grad, jvp, jacrev, ...), it must override the setup_context staticmethod. ``` This affects any library that uses `torch.func` internally on a ZeRO-3 model. ## Fix Fixes deepspeedai#7913 ## Note As pointed out by @zhangj1an in deepspeedai#7913, `PostBackwardFunctionModule` and `PreBackwardFunctionForModule` in `parameter_offload.py` have the same issue. Those will be addressed in a follow-up commit within this PR. --------- Signed-off-by: Sung Hyun Cho <hope5487@gmail.com> Signed-off-by: Zhang <jianmusings@gmail.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Zhang Jian <jianmusings@gmail.com> Co-authored-by: zhangj1an <jianmusings@gmail.com> Co-authored-by: Zhang Jian <zhang.jian@u.nus.edu> Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Add AutoEP
@codex