Skip to content

Add AutoEP#5

Open
tohtana wants to merge 185 commits into
tohtana/add_autoep_reviewfrom
tohtana/add_autoep
Open

Add AutoEP#5
tohtana wants to merge 185 commits into
tohtana/add_autoep_reviewfrom
tohtana/add_autoep

Conversation

@tohtana
Copy link
Copy Markdown
Owner

@tohtana tohtana commented Feb 8, 2026

Add AutoEP
@codex

sfc-gh-truwase and others added 6 commits February 3, 2026 22:26
Correctly handle `ds_grad_is_ready` in ZeRO2

---------

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The current code has the following issues:
- `use_default_specs: false` doesn't work
- Injection by the traditional pattern runs even when custom patterns
are set
- `mpu` needs to be passed to `deepspeed.initialize` (HF integration
doesn't pass mpu)

This PR fixes AutoTP setup to respect `use_default_specs: false` and
disable the traditional injection path when custom patterns are enabled.
Also, when `mpu` is not passed, we create a TP group in the
initialization process.


With these changes, the [related
tests](https://github.com/deepspeedai/DeepSpeed/tree/master/tests/unit/model_parallelism)
pass and [all AutoTP
examples](https://github.com/tohtana/DeepSpeedExamples/tree/tohtana/custom_auto_tp/training/tensor_parallel)
in DeepSpeedExamples work now
([PR](deepspeedai/DeepSpeedExamples#998)).

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@tohtana
Copy link
Copy Markdown
Owner Author

tohtana commented Feb 8, 2026

@codex

@chatgpt-codex-connector
Copy link
Copy Markdown

To use Codex here, create a Codex account and connect to github.

@tohtana tohtana changed the base branch from master to tohtana/add_autoep_review February 8, 2026 07:27
@tohtana
Copy link
Copy Markdown
Owner Author

tohtana commented Feb 9, 2026

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd07c93a5e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread deepspeed/checkpoint/autoep_universal.py Outdated
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
@tohtana
Copy link
Copy Markdown
Owner Author

tohtana commented Feb 9, 2026

@codex

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cabfebcdca

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +527 to +529
if self.return_router_logits:
logits = self._cached_router_logits
self._cached_router_logits = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Populate router logits when returning tuple output

When _detect_forward_contract sets return_router_logits=True for legacy MoE blocks (router_logits_capture_target == "moe_block"), _register_logit_hook is not installed and _cached_router_logits is never set. The forward path then returns (output, None) here, which breaks callers that expect actual router logits (e.g., OutputRecorder/z-loss paths that rely on the second return value). This only shows up for models using the MoE-block tuple contract, but in that case the logits are silently missing.

Useful? React with 👍 / 👎.

sdvillal and others added 16 commits February 9, 2026 16:22
Current metaclasses for layers and parameters access annotations in a
way that is incompatible with python 3.14+

See:
- [Python 3.14 release
notes](https://docs.python.org/3/whatsnew/3.14.html)
- [Porting
annotations](https://docs.python.org/3/whatsnew/3.14.html#whatsnew314-porting-annotations)
- [PEP649](https://peps.python.org/pep-0649/) and
[PEP749](https://peps.python.org/pep-0749/)

This PR uses annotationlib from python 3.14 onwards and keeps backwards
compatibility.

closes deepspeedai#7673
should unblock CF builds for py3.14
conda-forge/deepspeed-feedstock#114

A question is, does deepspeed support officially 3.14 yet? Should we
test it in CIs?

---------

Signed-off-by: Santi Villalba <sdvillal@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
## Bug
`fractions.gcd` was deprecated in Python 3.5 and removed in Python 3.9.
This causes an `AttributeError` on Python 3.9+.

## Fix
Replaced `fractions.gcd` with `math.gcd` which is the standard
replacement.
Fixes: deepspeedai#7837

ZeRO-0 + bf16 has two bugs in `engine.py`: 
1. `FP16_UnfusedOptimizer` applies `dynamic_loss_scale` with
`cur_scale=65536` but `engine.backward()` never scales the loss, so
`step()` divides gradients by 65536
2. `_take_model_step` skips `zero_grad` for bf16 without ZeRO, causing
gradient accumulation.

Fix: disable loss scaling for bf16 and remove the `zero_optimization()`
gate on `zero_grad`.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…peedai#7840)

Fixes deepspeedai#7835.

On torch==2.10.0, importing DeepSpeed emitted deprecation warnings from
import-time JIT-decorated helpers.
This change updates the compatibility path to align with PyTorch
guidance while keeping import clean.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
This PR addresses deepspeedai#7677 by flattening parameter tensors on the
accelerators instead of the CPU during zero stage 1 and 2
initialization. This should alleviate CPU contention, with the caveat
that the optimization is only used when there is enough VRAM to allocate
a full copy of the parameter buffers.

On 8 x H100s and a Intel Xeon Platinum 8480+, profiling the
initialization of DeepSpeed on 32 layers of `Qwen3-30B` with Z2 gives
the following:

Old = ~382s
New = ~130s

-------------------------

If necessary, this optimization can be extended to allowed a tiered
system that trades off VRAM space with performance, which might look
like the following:

```
if enough VRAM for 2x model_size:
    naive flatten
else if enough VRAM for model_size / N:
    distributed flatten across N devices
else:
    flatten on CPU
```

The distributed flatten would involve each device flattening a portion
of the parameters and performing an all-gather to assemble the full
flattened model. See deepspeedai#7677 for original discussion.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Kento Sugama <kentosugama@protonmail.ch>
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: nathon <leejianwoo@gmail.com>
Co-authored-by: Vensen <vensenmu@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: jp <jsb10121249@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
This PR enables shared memory communication in single node for arm hosts
- deepspeedai#7625

<img width="908" height="108" alt="image"
src="https://github.com/user-attachments/assets/a5d1a5c7-f28e-4129-9503-cc2b477993ac"
/>

---------

Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>
Added a new news entry about DeepSpeed ZeRO++ support for LLM
distillation work at LinkedIn.
## Summary
Add support for LG AI Research's EXAONE 4.0 model family in DeepSpeed
Inference V2.

Closes deepspeedai#7453

## Changes
- New model implementation:
`deepspeed/inference/v2/model_implementations/exaone4/`
  - `container.py`: Transformer and non-transformer parameter containers
- `model.py`: Inference model with post-norm architecture and QK-Norm
support
  - `policy.py`: Inference V2 policy
- Register EXAONE 4.0 in `engine_factory.py` and `__init__.py`

## Key architectural differences from Mistral/Llama
- **Post-norm**: RMSNorm is applied after attention/MLP outputs (not
before), followed by residual addition
- **QK-Norm**: Per-head RMSNorm applied to Q and K projections after the
QKV linear layer
- **Hybrid attention**: 32B model uses 3:1 sliding window/full attention
ratio (via `layer_types` config)

## Supported models
- [EXAONE-4.0-1.2B](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-1.2B)
(all full attention)
- [EXAONE-4.0-32B](https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B)
(hybrid sliding/full attention)

Requires `transformers >= 4.54.0`.

## Related
- Supersedes deepspeedai#7456 (draft, inactive for 6 months)

---------

Signed-off-by: Bias92 <pewpewplay315@gmail.com>
deepspeedai#7846)

Fixes deepspeedai#7843

On HIP/ROCm (the AMD path), several CUDA-style BF16 intrinsics used in
the code are not provided, e.g.:
- `__ll2bfloat16_rn`
- `__int2bfloat16_rn`
- `__short2bfloat16_rn`
- `__bfloat162uint_rn`

This causes compilation errors on HIP platforms.

This PR introduces fallback paths using functions available on HIP
platform mirroring the [conversion util in
csrc](https://github.com/deepspeedai/DeepSpeed/blob/2c362837b0ef906ea7e7506bab3a625faa945cdd/csrc/includes/conversion_utils.h#L351).
The converion paths are:

- int/uint -> bf16: convert to float (or double for 64-bit), then to
bf16.
- bf16 -> int/uint: convert bf16 to float, then to the integer type.
- float -> bf16: build from bf16 via supported HIP helpers.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
`EvoformerAttnBuilder` returns instances of `Path` from `include_paths`
which then cause failures in `OpBuilder.builder` when passing them to
`strip_empty_entries` that calls `len` on them which isn't defined for
`Path` instances:
>   TypeError: object of type 'PosixPath' has no len()

Fixes regression introduced in deepspeedai#7760

cc @sdvillal

Signed-off-by: Alexander Grund <alexander.grund@tu-dresden.de>
…edai#7832)

deepspeedai#7817 added a test to verify that we throw an error when parameters are
modified in `GatheredParameters` and `modifier_rank` is None. However,
the PR just checks devices and doesn't detect modifications on
parameters.
This causes an
[error](https://github.com/deepspeedai/DeepSpeed/actions/runs/21653729382/job/62424014222)
in our full test run.

This PR adds the detection of parameter modifications to properly throw
an error.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
PR deepspeedai#7839 introduced a regression by changing `TestZeroStaticScale` from
`assert optim.dynamic_loss_scale == False` to `assert
optim.loss_scale_config.dynamic_loss_scale == False`.
`loss_scale_config` is not part of the ZeRO optimizer (only non-ZeRO
optimizer have it), while this test runs with ZeRO optimizers.

With this fix, `TestZeroStaticScale` now passes for stages 1/2/3.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
The full test workflow passed though it is still flakey
([Success](https://github.com/deepspeedai/DeepSpeed/actions/runs/22269243373)
/
[Failure](https://github.com/deepspeedai/DeepSpeed/actions/runs/22266498530))

This PR schedules a nightly run of the full test. It is launched only
when we have update since the last successful run.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
…pspeedai#7874)

Fix links and manu items for AutoTP doc

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
tohtana and others added 30 commits May 14, 2026 01:03
Reduce retained AutoEP tests to critical path
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
## Summary
RFC: deepspeedai#7884

Wire `sdma_allgather` into ZeRO-3's parameter prefetch path
(`_dist_allgather_fn`).  When enabled, ZeRO-3 allgather routes through
`mori_cpp.AllGatherIntoTensor` (intra-node SDMA copy on AMD MI300), with
a
transparent fallback to `dist.allgather_fn` (RCCL/NCCL) on init failure.

End-to-end demo + repro steps + verified numbers live in

[`examples/sdma_allgather/README.md`](examples/sdma_allgather/README.md).

Headline (8x MI300X, DeepSpeed default ZeRO-3 buckets, 100 steps):

| | GPT-7B-ish | Qwen3-32B |
|---|---|---|
| SDMA off | 697.7 ms / step | 1402.5 ms / step |
| SDMA on  | 622.0 ms / step | 1263.2 ms / step |
| **gain** | **+10.85 %**    | **+9.93 %**      |

Loss curves match off ↔ on, peak memory unchanged.

Speedup is workload-dependent — gains shrink (or invert) when allgather
can't be overlapped with compute

Co-authored-by: wuyl1 <yangwu@amd.com>

---------

Signed-off-by: wuyl1 <yangwu@amd.com>
Signed-off-by: inkcherry <mingzhi.liu@amd.com>
Co-authored-by: wuyl1 <yangwu@amd.com>
…lier

Remove AutoEP backward loss multiplier
…-groups-mpu

Use active MPU for AutoEP sequence-parallel size
…pspeedai#8005)

Fixes deepspeedai#8003

## Summary

`FastFileWriter._fini()` overwrote `self._aio_fd = INVALID_FD` without
calling `os.close()`, leaking one fd per save. With unlink-based
checkpoint rotation this stranded the unlinked inode in the ext4
orphan list, fs blocks were never reclaimed, and long-running save
loops hit ENOSPC at iter ~60 (60 GB/iter on a 4 TB partition).

This PR adds explicit `os.fsync()` + `os.close()` in `_fini()` and a
regression test that asserts no `/proc/self/fd` entry points at a
deleted file after a save+close+unlink cycle.

## Verification

- 20-iteration repro of `save() / close() / unlink()` leaked 20 fds
  before the fix, 0 after.
- 700-iter / 42 TB / 60 h endurance run on ext4/NVMe: `df_used`
  stable at 736 GB (drift +281 MB / 697 rotations) with the fix;
  same workload hit ENOSPC at iter ~60 without it.
- Performance impact: ~5% wall-time overhead from the added
  `os.fsync()` at ~10 GB/s peak.

## Test plan

- [x] New regression test
  `tests/unit/ops/aio/test_fast_file_writer_fd_close.py` verifies fd
  cleanup after a single save and after 5/20-iter rotation loops via
  `/proc/self/fd` scoped to `tmp_path`.
- [x] Gated on `async_io` compatibility, Linux, and CUDA accelerator
  so unsupported CI matrix entries skip cleanly.
- [x] Confirmed test FAILS without this PR's `_fini()` change and
  PASSES with it.
- [x] `pre-commit run --files <changed files>` clean.

## Notes

- The `__del__` assertion `assert self._aio_fd == INVALID_FD` passes
  even with the bug because it checks the Python attribute that
  `_fini` itself sets. The new test checks OS-level state via
  `/proc/self/fd`.
- `os.fsync()` is included for post-close durability — required for
  correctness on the unaligned-tail path that re-opens the file as
  buffered I/O. If maintainers prefer to drop it for performance,
  removing only the `os.fsync(...)` line still fixes the leak.

Happy to adjust shape, naming, or test placement to fit project
conventions. Thanks for the review.

Signed-off-by: jg-heo <csjg.heo@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Three changes to make Moonlight-16B-A3B (DeepSeek-V3 MoE) work with
AutoEP + Muon + ZeRO-2:

1. e_score_correction_bias: copy the pretrained noaux_tc score-correction
   bias from the source gate into AutoEP routers and apply it in the
   TokenChoiceTopKRouter forward pass so expert selection matches the
   pretrained checkpoint.

2. is_expert_group: mark GroupedExperts w1/w2/w3 tensors with
   is_expert_group=True so Muon applies Newton-Schulz independently per
   expert slice rather than treating the stacked (E, I, O) tensor as a
   single matrix.  muon_update grows an is_expert_group kwarg; all four
   call sites inside original_muon.py and the ZeRO-2 path in
   stage_1_and_2.py pass getattr(p, 'is_expert_group', False).

3. Muon + MoE param groups in engine.py: flatten dict-style param groups
   produced by configure_moe_param_groups before filtering by use_muon;
   re-tag optimizer flags after AutoEP layer replacement; add name keys
   for MoE group splitting; call split_params_into_different_moe_groups
   when the model has MoE layers.

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Two fixes addressing masahiro's review feedback on PR deepspeedai#7938:

1. Auto-fill AutoEPConfig from HF model config (auto_ep_config.py,
   auto_ep.py): add fill_autoep_config_from_hf() which maps HF field
   names to AutoEP internal names on AutoEP.__init__:
   - n_group          -> num_expert_groups
   - topk_group       -> num_limited_groups
   - routed_scaling_factor -> route_scale
   User-supplied values always take precedence. Without this, Moonlight
   (DeepSeek-V3) training used route_scale=1.0 instead of 2.446,
   producing systematically wrong MoE output magnitudes.

2. Restore batched Newton-Schulz in muon_update (original_muon.py):
   replace the per-expert Python loop with a single batched call to
   zeropower_via_newtonschulz5, which already supports ndim>=2 inputs.
   This restores GPU parallelism across all E experts per step.

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Now that the per-expert Python loop is replaced with a single batched
call to zeropower_via_newtonschulz5, muon_update has no dynamic control
flow and can be compiled again.

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Guokai Ma <guokai.ma@intel.com>
DS4Sci EvoformerAttention currently depends on CUTLASS, but requiring
users to manually set `CUTLASS_PATH` creates unnecessary friction for an
otherwise standard extension build flow. This change makes CUTLASS
discovery automatic while preserving `CUTLASS_PATH` as the explicit
override.

The discovery approach is based on PyTorch's CUDA detection pattern in
`torch.utils.cpp_extension`: honor the explicit environment variable
first, then infer from installed packages and conventional filesystem
locations, and only fail with an actionable message when discovery
cannot succeed.

This improves first-run usability, CI behavior, editable installs, and
package-based environments where CUTLASS may already be installed in a
discoverable location. It also reduces setup divergence between users
who clone CUTLASS manually and users who install NVIDIA's
`nvidia-cutlass` package.

DeepSpeed should already have had this because EvoformerAttention is
part of DeepSpeed's extension-builder system, and extension builders
should locate common build dependencies using predictable heuristics
instead of requiring users to export paths manually. CUDA itself is not
treated as "you must always set CUDA_HOME"; PyTorch attempts discovery
first and uses the env var as a fallback. CUTLASS should follow the same
principle here.

---------

Signed-off-by: Max Tretikov <max@tretikov.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
…edai#7994)

## Summary
Fix critical severity security issue in
`deepspeed/runtime/data_pipeline/data_sampling/data_analyzer.py`.

## Vulnerability
| Field | Value |
|-------|-------|
| **ID** | V-001 |
| **Severity** | CRITICAL |
| **Scanner** | multi_agent_ai |
| **Rule** | `V-001` |
| **File** |
`deepspeed/runtime/data_pipeline/data_sampling/data_analyzer.py:75` |

**Description**: The data_analyzer.py file uses os.system() with an
f-string that directly interpolates the variable metric_to_sample_fname
into a shell command without any sanitization. This variable is derived
from user-supplied dataset configuration or file paths. Because
os.system() invokes a shell interpreter, any shell metacharacters in the
variable (semicolons, backticks, dollar signs, pipes, ampersands) will
be interpreted and executed as separate shell commands.

## Changes
- `deepspeed/runtime/data_pipeline/data_sampling/data_analyzer.py`

## Verification
- [x] Build passes
- [x] Scanner re-scan confirms fix
- [x] LLM code review passed

---
*Automated security fix by [OrbisAI Security](https://orbisappsec.com)*

Signed-off-by: orbisai0security <mediratta01.pally@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
1. gram_newtonschulz: replace torch.addmm (2D only) with equivalent
   a*Q + Z@Q to support batched 3D expert weight tensors
   [num_local_experts, n, m]. Also fix diagonal() to specify dim1/dim2
   for 3D tensors.

2. deepseek_v3 preset: remove e_score_correction_bias from
   unsupported_router_bias_names since auto_ep_layer.py already
   copies it correctly (lines 398-402).

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
…speedai#8009)

## Summary

Fixes deepspeedai#6961

ZeRO-3 forward crashes with `AttributeError: 'dict' object has no
attribute '_in_forward'` since torch 2.5. PyTorch changed
`nn.Module._parameters` from `OrderedDict` to plain `dict`
(pytorch/pytorch#129164), and a plain `dict` does not allow attribute
assignment.

DeepSpeed wraps every module into `ZeROOrderedDict` at engine init via
`_inject_parameters`. Any module not present at that point keeps the
plain dict and crashes the next forward. This includes a submodule
attached after `deepspeed.initialize()` (PEFT/LoRA adapters), or a
module restored by `deepspeed/compile/init_z3.py:35`.

The fix adds `ensure_zero_ordered_dict()` and calls it from the forward
prologue. It wraps lazily, is idempotent, and keeps the original
container so the deepcompile un-injection path still works. The epilogue
gets an `isinstance` guard for modules that show up between the two
hooks.

This only fixes the crash. Late-attached parameters are still not in the
optimizer and not partitioned by ZeRO-3. For full ZeRO-3 semantics on a
late adapter, build it inside `deepspeed.zero.Init()`.

## Tests

`tests/unit/runtime/zero/test_zero_late_module_attach.py`

- forward after attaching a Linear post-init, with `_parameters` forced
to plain dict so the bug reproduces on any torch version
- repeated forwards do not re-wrap an already-wrapped module

Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
## What

Removes an obsolete paragraph from the `DeepSpeedCPUAdam` constructor
docstring.

## Why

The docstring described a `step()` option that "updates optimizer states
and
copies the parameters back to GPU at the same time" — the old
`adam_update_copy`
kernel, invoked via `step(fp16_param_groups=...)`. That fused-copy path
no longer
exists in the codebase:

- `csrc/adam/cpu_adam.cpp` binds only `adam_update` (no
`adam_update_copy`).
- `DeepSpeedCPUAdam.step()` / `step_subgroup()` take no
`fp16_param_groups`
  argument and only call `adam_update`.

So the "two options" text is stale and misleading to anyone reading the
API.
Docstring-only change; no functional impact.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Lucas Pirola <lucas@pirola.eu>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
## Summary

The modal test now shows the error because of the combination of PyTorch
v2.7 and Transformer `main` branch, and it is blocking PRs. To address
it, we improve our test workflows as follows.

- Add manual dependency version inputs for the torch-latest CI workflows
and default the torch-latest family to PyTorch 2.10 plus Transformers
git `main`.
- Let CPU and AWS full torch-latest runs select either released
Transformers package versions or an explicit Transformers git ref for
manual validation.
- Let Modal torch-latest runs select supported PyTorch/CUDA image
presets and an optional Transformers git ref, defaulting to
`2.10.0-cuda12.8` and Transformers git `main`.

## Known follow-up

- The AWS full real CI lane for PyTorch 2.10 plus Transformers main
reached `Unit tests (parallel)` but failed with 33 failures. Some of
these may overlap with fixes in deepspeedai#8015; I am opening this PR now so the
workflow/input changes can be reviewed while those failures are handled
separately.
- CPU and Modal real CI validation for the requested tuple passed.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
)

Run the FastFileWriter fd-close regression in the sequential CI bucket
without pytest-forked.

The test exercises the real torch.save() through FastFileWriter with
async I/O and pinned memory. The scheduled AWS [full CI
failure](deepspeedai#8015) happens
before the fd-close assertion because the sequential bucket still runs
under --forked, and CUDA-backed pinned memory is not safe in that forked
worker context.

Marking this regression as sequential keeps it out of the parallel
bucket, and removing --forked from the sequential run lets it test the
intended close/unlink behavior.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
This PR makes GitHub Actions check names unique so required status
checks can be configured reliably.

GitHub's protected-branch documentation recommends unique job names
across workflows when requiring specific status checks:
https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/managing-protected-branches/about-protected-branches

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
## What

DeepSpeed's bf16 `bf16_optimizer_states` option and `offload_optimizer`
are
currently mutually exclusive, per the support matrix in
`docs/_pages/config-json.md`:

| ZeRO 1/2/3 | `bf16_optimizer_states=false` |
`bf16_optimizer_states=true` |
|---|---|---|
| | requires ZeRO-Offload + `DeepSpeedCPUAdam`; states **fp32, on CPU**
| supported **without offload**; states **bf16, on GPU** |

This PR fills the missing cell: `bf16_optimizer_states=true` *together
with*
`offload_optimizer: {device: cpu}` for ZeRO 1/2/3 — Adam moments held in
**bf16**
*and* offloaded to **CPU host RAM**. That reduces the offloaded
optimizer state
from ~10 to ~6 bytes/param (bf16 master + two bf16 moments) with no
added GPU
memory.

## Why

CPU offload currently forces fp32 optimizer states; for large models the
offloaded optimizer state dominates host RAM. Keeping the moments in
bf16
(matching the already-bf16 master weights) cuts that footprint
substantially
while keeping the state off the GPU.

## How

`DeepSpeedCPUAdam` already supports bf16 momentum/variance through its
`fp32_optimizer_states` constructor flag — the feature was simply not
wired up.
No C++/CUDA kernel changes.

- **`engine.py`** — `_configure_basic_optimizer` builds
`DeepSpeedCPUAdam` /
`ZenFlowCPUAdam` with `fp32_optimizer_states=False` when
`bf16_optimizer_states`
is set (a user-supplied value is popped to avoid a keyword clash, and
overridden
  with a warning).
- **`base_optimizer.py`** — `_configure_master_weights` runs the offload
+
`DeepSpeedCPUAdam` validator whenever offload is configured (not only
for the
fp32-states case), and asserts a user-provided optimizer actually stores
bf16
  moments.
- **`stage3.py` / `stage_1_and_2.py`** — pass `offload_enabled` through.
- **`config-json.md`** — updated bf16 support matrix.

## Backward compatibility

`false`+offload and `true`+no-offload configs are unaffected: the
default
resolves to `fp32_optimizer_states=True` (prior behavior), and the
no-offload
bf16-states path (FusedAdam on GPU) is untouched.

## Numerics

`bf16_optimizer_states` continues to require
`bf16_master_weights_and_grads`, so
master weights are bf16 — identical precision to the existing on-GPU
bf16-states
path. CPU Adam computes updates in fp32 internally and rounds moments to
bf16
(round-to-nearest-even), matching that path.

## Testing

- `tests/unit/ops/adam/test_cpu_adam.py` — `DeepSpeedCPUAdam` bf16
moment
  allocation + fp32 parity.
- `tests/unit/v1/half_precision/test_bf16.py` — `bf16_optimizer_states`
+ CPU
offload across ZeRO 1/2/3 (extends `TestBF16MasterWeightsGradients`),
plus a
guard test that a user-provided `DeepSpeedCPUAdam` must opt into bf16
moments.

All new and affected existing tests pass;
`TestBF16MasterWeightsGradients`
(9 cases) was verified on a 2-GPU host.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Signed-off-by: Lucas Pirola <lucas@pirola.eu>
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
fix:When running following test command, it may hang.
```pytest -v runtime/zenflow/test_zf.py::TestZenFlowDistributed::test_zenflow_distributed[epoch-1-4-False-0-3]```
The reason is that when param.selected_indices got an empty result, its dtype would be torch.float32 instead of torch.int64. However, if the float32 empty tensor is used as an index just like grad_2d[param.selected_indices, :], it would cause a hang. So in order to solve this bug, I add a dtype cast to int64 when judge the param.selected_indices is empty, which means its original dtype is torch.float32.

Signed-off-by: binchengxiong <binchengxiong@alibaba-inc.com>
Co-authored-by: binchengxiong <binchengxiong@alibaba-inc.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Author: @PKUWZP & @delock 
Blog post introducing Muon optimizer support in DeepSpeed, covering how
it integrates with
ZeRO Stage 2/3, measured convergence and memory results, and the roadmap
ahead.

---------

Signed-off-by: Ma, Guokai <guokai.ma@intel.com>
Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Guokai Ma <guokai.ma@intel.com>
…ai#7990)

This PR is based on deepspeedai#7975 and fix CI errors. Thanks for @mingxiang1006
for providing the fix.

---------

Signed-off-by: Guokai Ma <guokai.ma@intel.com>
`LinearFunctionForZeroStage3` uses the legacy `forward(ctx, ...)`
pattern which is incompatible with `torch.func` transforms
(`torch.func.grad`, `torch.func.grad_and_value`, `vmap`, etc.):
```
RuntimeError: In order to use an autograd.Function with functorch transforms
(vmap, grad, jvp, jacrev, ...), it must override the setup_context staticmethod.
```

This affects any library that uses `torch.func` internally on a ZeRO-3
model.

## Fix

Fixes deepspeedai#7913 

## Note

As pointed out by @zhangj1an in deepspeedai#7913, `PostBackwardFunctionModule` and
`PreBackwardFunctionForModule` in `parameter_offload.py` have the same
issue. Those will be addressed in a follow-up commit within this PR.

---------

Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>
Signed-off-by: Zhang <jianmusings@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Zhang Jian <jianmusings@gmail.com>
Co-authored-by: zhangj1an <jianmusings@gmail.com>
Co-authored-by: Zhang Jian <zhang.jian@u.nus.edu>
Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.