AutoEP + Muon fixes for Moonlight (DeepSeek-V3 MoE)#25
Closed
delock wants to merge 82 commits into
Closed
Conversation
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Currently the CI full test shows a [CUDA reinit error](https://github.com/deepspeedai/DeepSpeed/actions/runs/24444633640/job/71417719445). This PR includes the following fixes: - Fix `compute_capability_args()` in JIT mode to read `TORCH_CUDA_ARCH_LIST` before calling `torch.cuda.get_device_capability()` and restores JIT builder state after `jit_load()`. It also adds regression tests for the explicit-arch, bad-fork, and restore paths. - Delay initialization of CUDA streams in DeepCompile After this fix, the full test [passed](https://github.com/deepspeedai/DeepSpeed/actions/runs/24508304055/job/71632434455) again. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
… step (#7981) ## Summary ZeRO-1/2 + `offload_optimizer` + `gradient_accumulation_steps=1` with multiple `engine.backward()` calls per optimizer step (via `set_gradient_accumulation_boundary()`, formalized in #7665) silently drops all but the last backward's gradient. `copy_grads_in_partition` only called `async_accumulate_grad_in_cpu_via_gpu` under `if gradient_accumulation_steps > 1`, so with `ga_steps=1` intermediate backwards' reduced grads were never stored. The boundary `async_inplace_copy_grad_to_fp32_buffer_from_gpu` then overwrote (not added) the fp32 buffer with the last chunk only. ZeRO-3 + offload and non-offload ZeRO-1/2 are unaffected. ## Fix Replace the `ga > 1` gate with one that fires exactly when a CPU accumulator is needed: ```python if self.micro_step_id > 0 or not self.is_gradient_accumulation_boundary: self.async_accumulate_grad_in_cpu_via_gpu(param) ``` - `ga_steps=1` + single `backward()` → skipped. No CPU buffer, no extra copy. Fast path preserved. - `ga_steps=1` + multi-backward → accumulates correctly across calls. - `ga_steps>1` → identical to prior behaviour. ## Measurement 2x H100, 3-layer MLP, Adam, lr=1e-3, N=4 backwards/step, ga_steps=1 Max param diff vs no-offload reference: | | fp32 | bf16 | | ------ | ------------------------------------ | -------------------- | | Before | 2.00e-03 (wrong, around 2 x lr) | — | | After | 7.45e-09 (noise) | 0.00e+00 | ## Tests New `tests/unit/v1/zero/test_zero2_offload_multi_backward.py`, parametrized over ZeRO-1/2: multi-backward offload matches no-offload / single-backward unchanged / multi-step state-leak guard / single-backward allocates no CPU buffer (perf guard) / `ga_steps>1` + offload unchanged (#7967 regression guard). --------- Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>
…rd/backward visits (#7980) In PyTorch AOT Autograd, having tensors requiring grad in inputs doesn't guarantee backward graph compilation. If no output requires grad and no input requiring grad is mutated, aot_autograd skips backward compilation (see [1]). DeepCompile previously required backward compilation for every forward graph which required grad, but relied solely on the existence of require_grad tensors. This mismatch caused unbalanced forward/backward visits, leaving graphs unvisited in `frames_needing_bwd`. The patched FunctionMeta then remained effective during backward execution, raising KeyError when removing the (already-removed) frame IDs from the `frames_needing_bwd` set. A reproduction can be found at [2]. Simply put a guard on the set removal operation is insufficient. The backward graph is still recompiled on each iteration, severely impacting performance. Instead of duplicating how AOT Autograd determines whether to compile the backward graph, use the fact that a joint graph requires a backward pass if and only if it is partitioned into a forward and a backward module. The frame IDs of partitioned graphs are collected in the patched partition functions and then used to determine `needs_backward` in the forward compile function. `backend_fn` is not a proper place for the second step since autograd creates fw/bw compile functions before partitioning a joint graph. References [1] https://github.com/pytorch/pytorch/blob/aea31e0c306e2315bf6d84255e0dde7adf09762a/torch/_functorch/aot_autograd.py#L618 [2] https://gist.github.com/eternalNight/96d6bc60e2bf566fda1300154d0e89dc Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
Fix CPUAdam same-step subgroup drift in ZeRO-3 (#7819) This PR ports the fix from #7820 to the latest DeepSpeed version. It makes `Adam_Optimizer::IncrementStep` idempotent for repeated calls at the same logical step and avoids unnecessary recomputation when the step has not changed. ZeRO-3/SuperOffload can invoke multiple subgroup updates within a single logical step on a shared native optimizer object. The previous logic mixed multiply and recompute paths, producing non-bit-identical bias-correction metadata across subgroup calls. This change aligns the step-transition logic in both the CPU and XPU headers, clarifies first-step and non-sequential-step behavior, and prevents unnecessary work on repeated same-step updates. It also adds CPUAdam regression tests covering subgroup-style repeated same-step updates through both `step_subgroup()` and `step()` with parameter swapping. Signed-off-by: st_bang <st.bang@dgist.ac.kr>
Fix #6596 --------- Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
# Fix BF16_Optimizer last-microbatch grad leak under ZeRO-1
## Summary
`DeepSpeedEngine._backward_epilogue` calls `allreduce_gradients()`
**before** `optimizer.backward_epilogue()`. For `BF16_Optimizer` (used
when `bf16` model + `grad_accum_dtype: fp32` + ZeRO stage 1) without
`immediate_grad_update`, this means the **boundary microbatch's gradient
is added to the rank-local fp32 accumulator AFTER the cross-rank
allreduce, so it is silently skipped from the average.**
The bias is `(world_size − 1) / world_size × 1 /
gradient_accumulation_steps` of the per-step gradient. Because the bias
scales with per-microbatch grad weight, training trajectories visibly
diverge depending on `per_device_train_batch_size` even with identical
effective batch size — the symptom users see is loss / grad-norm curves
drifting apart between otherwise-equivalent configs.
The bug is reproducible in DeepSpeed 0.18.6 through current `master`
(0.18.10 at time of writing).
## Fix
Swap the order so `optimizer.backward_epilogue()` runs before
`allreduce_gradients()`, with `exit_backward()` left after.
`exit_backward()` only manages backward-hook state
(`_backward_hook_state`); it has no ordering dependency on the gradient
accumulator.
```python
def _backward_epilogue(self):
self._stop_timers(self.engine_timers.backward_inner_timers)
self._start_timers(self.engine_timers.backward_reduce_timers)
# NEW: run backward_epilogue() before allreduce so the boundary microbatch
# grad lands in the optimizer accumulator that gets reduced.
if isinstance(self.optimizer, ZeROOptimizer):
self.optimizer.backward_epilogue()
if self.enable_backward_allreduce and not self.inside_no_sync_ctxt:
self.allreduce_gradients()
if isinstance(self.optimizer, ZeROOptimizer):
self.optimizer.exit_backward()
see_memory_usage("Engine after backward", force=self.memory_breakdown())
self._stop_timers(self.engine_timers.backward_reduce_timers)
self._stop_timers(self.engine_timers.backward_timers)
```
Diff: +10 / −1, single file (`deepspeed/runtime/engine.py`).
## Root cause walkthrough
The bug requires **both** of the following to be true:
1. The accumulator that `optimizer.backward_epilogue()` mutates is the
**same tensor** that `engine.allreduce_gradients()` later reduces, AND
2. The accumulator is updated *only* by `optimizer.backward_epilogue()`
(no per-param hooks updating it inline during backward).
Both conditions hold for `BF16_Optimizer` without
`immediate_grad_update`:
- It maintains a **separate fp32 accumulator**
(`fp32_groups_gradients_flat`) — distinct from `param.grad`.
- Its `backward_epilogue()` calls `update_hp_grads()` which casts each
param's bf16 `lp.grad` to fp32 and adds it into that accumulator (and
only this code path fills the accumulator when
`immediate_grad_update=False`).
- `engine.allreduce_gradients()` → `buffered_allreduce_fallback()` →
`optimizer.get_grads_for_reduction()` returns **the same
`non_expert_gradients` list = `fp32_groups_gradients_flat`**.
So on the gradient-accumulation boundary microbatch:
1. `loss.backward()` populates bf16 `lp.grad` for that microbatch.
2. `_backward_epilogue` first calls `allreduce_gradients()`. The fp32
accumulator at this point contains microbatches `0..ga-2`'s grads
(summed locally on each rank). The allreduce averages **only that**
across ranks.
3. `_backward_epilogue` then calls `optimizer.backward_epilogue()` →
`update_hp_grads()` → adds the boundary microbatch's local `lp.grad` to
the now-allreduced accumulator.
Result, per rank `i`:
```
fp32_buffer_rank_i = avg_ranks(Σ_{m=0..ga-2} grad_m) + local_grad_{ga-1}_rank_i
└────── shared across ranks ─────┘ └─── rank-private leak ───┘
```
ZeRO-1 partitions optimizer states across ranks, so each rank then runs
`optimizer.step()` on its slice of this rank-divergent buffer;
`update_lp_params()` allgathers the bf16 params back. The effective
gradient applied to parameter `p` is:
```
g_p = avg_ranks(prior microbatches' grad for p) + local_grad_last_for_p_at_owning_rank
```
i.e. the boundary microbatch's contribution captures only `1 /
world_size` of the cross-rank average, biasing the global gradient by
`(world_size − 1) / world_size × 1 / ga_steps`.
## Impact on other optimizers (no behavior change)
| Optimizer | Accumulator | How accumulator is filled | Reduction path |
Affected by current bug? | Affected by this fix? |
|---|---|---|---|---|---|
| `BF16_Optimizer` (immediate_grad_update=False) | separate
`fp32_groups_gradients_flat` | only via `optimizer.backward_epilogue()`
→ `update_hp_grads()` | `engine.allreduce_gradients` →
`buffered_allreduce_fallback` → `optimizer.get_grads_for_reduction()`
returns the same fp32 buffer | **Yes — leak** | **Yes — leak fixed** |
| `BF16_Optimizer` (immediate_grad_update=True) | same fp32 buffer |
per-param hooks (`create_grad_acc_hooks`) fire inline during backward |
same allreduce path | No (hooks already filled buffer before allreduce)
| No-op (`update_hp_grads` early-returns when `immediate_grad_update`) |
| `DeepSpeedZeroOptimizer_Stage1And2` (ZeRO-1, default for
bf16+bf16-grad-accum) | `param.grad` directly + ipg buckets | hooks fire
inline during backward (`overlap_comm=True` default), or
`reduce_gradients()` walks all params at boundary |
`engine.allreduce_gradients` takes the `if hasattr(self.optimizer,
'reduce_gradients')` branch → `optimizer.reduce_gradients()` walks all
params; the boundary microbatch's grad is already on `param.grad`
(autograd populates this before `_backward_epilogue` runs) | No | No-op
(`Stage1And2.backward_epilogue` does not mutate the reduction buffer) |
| `DeepSpeedZeroOptimizer_Stage3` | partitioned via
`overlapping_partition_gradients_reduce_epilogue()` | reduce-scatter
inline during backward via hooks | `engine.allreduce_gradients` takes
the `if zero_optimization_partition_gradients()` branch → calls
overlapping epilogue, which is fed by hooks | No | No-op |
In short: the fix is functionally relevant **only** for `BF16_Optimizer`
without `immediate_grad_update`. For every other ZeRO optimizer the
change is observably a no-op because their `backward_epilogue` does not
mutate the buffer being reduced.
## Reproducer
The minimum reproducer is a 2-rank standalone script that runs one
gradient-accumulation cycle and prints the per-rank fp32 accumulator
norm at each microbatch and immediately before `engine.step()`. With the
bug present the per-rank values **disagree at the boundary microbatch
and going into the optimizer step**; with the fix they agree.
Save as `probe_bf16_grad_accum.py`:
```python
"""Probe whether DeepSpeed's BF16_Optimizer leaks the boundary microbatch grad
out of the cross-rank average. Run with two ranks, e.g.:
deepspeed --num_gpus 2 probe_bf16_grad_accum.py
or
accelerate launch --num_processes 2 --num_machines 1 probe_bf16_grad_accum.py
"""
import os
import torch
import torch.nn as nn
import deepspeed
import torch.distributed as dist
def main():
rank = int(os.environ.get("RANK", "0"))
world = int(os.environ.get("WORLD_SIZE", "1"))
GA_STEPS = 4
HIDDEN = 64
BATCH = 4
torch.manual_seed(0) # SAME init across ranks
model = nn.Sequential(
nn.Linear(HIDDEN, HIDDEN),
nn.GELU(),
nn.Linear(HIDDEN, HIDDEN),
).to(torch.bfloat16).cuda()
ds_config = {
"bf16": {"enabled": True}, # set "immediate_grad_update": True to also bypass the bug
"data_types": {"grad_accum_dtype": "fp32"},
"communication_data_type": "fp32",
"zero_optimization": {
"stage": 1,
"overlap_comm": True,
"contiguous_gradients": True,
"reduce_scatter": True,
"allgather_partitions": True,
"allgather_bucket_size": 200000000,
"reduce_bucket_size": 200000000,
},
"train_micro_batch_size_per_gpu": BATCH,
"gradient_accumulation_steps": GA_STEPS,
"train_batch_size": BATCH * world * GA_STEPS,
"gradient_clipping": 0.0,
"steps_per_print": 9999,
}
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, betas=(0.9, 0.95))
engine, _, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config=ds_config)
bf16_opt = engine.optimizer
if rank == 0:
print(f"[INFO] DeepSpeed optimizer class: {type(bf16_opt).__name__}", flush=True)
print(f"[INFO] grad_acc_dtype = {bf16_opt.grad_acc_dtype}", flush=True)
print(f"[INFO] world={world} ga={GA_STEPS} batch={BATCH}", flush=True)
def buffer_summary(label):
bufs = bf16_opt.fp32_groups_gradients_flat
local_norm = sum(b.detach().to(torch.float64).norm().item() ** 2 for b in bufs) ** 0.5
rank_buf = torch.tensor([local_norm], device="cuda", dtype=torch.float64)
gathered = [torch.zeros_like(rank_buf) for _ in range(world)]
dist.all_gather(gathered, rank_buf)
if rank == 0:
vals = [g.item() for g in gathered]
diffs = [(v - vals[0]) / vals[0] * 100 if vals[0] != 0 else 0 for v in vals]
print(f"[{label}] per_rank={vals} diff_pct_vs_rank0={diffs}", flush=True)
dist.barrier()
buffer_summary("init (zero)")
# Different inputs per rank so per-rank grads differ.
torch.manual_seed(100 + rank)
inputs = [torch.randn(BATCH, HIDDEN, device="cuda", dtype=torch.bfloat16) for _ in range(GA_STEPS)]
targets = [torch.randn(BATCH, HIDDEN, device="cuda", dtype=torch.bfloat16) for _ in range(GA_STEPS)]
for i in range(GA_STEPS):
is_boundary = (i == GA_STEPS - 1)
engine.set_gradient_accumulation_boundary(is_boundary=is_boundary) # what accelerate's wrapper does
out = engine(inputs[i])
loss = ((out - targets[i]) ** 2).mean()
engine.backward(loss)
buffer_summary(f"after backward microbatch {i} (boundary={is_boundary})")
buffer_summary("BEFORE engine.step()")
engine.step()
if __name__ == "__main__":
main()
```
## Verification
### Probe (synthetic, 2 GPUs, 1 grad-accum cycle)
Run on `master` (bug):
```
[init (zero)] per_rank=[0.0, 0.0 ] diff = 0%
[after microbatch 0 (boundary=False)] per_rank=[0.1495, 0.1378] diff = -7.78% (local-only accumulation)
[after microbatch 1 (boundary=False)] per_rank=[0.1998, 0.2098] diff = +5.03%
[after microbatch 2 (boundary=False)] per_rank=[0.2322, 0.2434] diff = +4.82%
[after microbatch 3 (boundary=True)] per_rank=[0.2206, 0.2123] diff = -3.80% ← bug
[BEFORE engine.step()] per_rank=[0.2206, 0.2123] diff = -3.80% ← bug
```
Run on this PR (fixed):
```
[init (zero)] per_rank=[0.0, 0.0 ] diff = 0%
[after microbatch 0 (boundary=False)] per_rank=[0.1495, 0.1378] diff = -7.78%
[after microbatch 1 (boundary=False)] per_rank=[0.1998, 0.2098] diff = +5.03%
[after microbatch 2 (boundary=False)] per_rank=[0.2322, 0.2434] diff = +4.82%
[after microbatch 3 (boundary=True)] per_rank=[0.1942, 0.1942] diff = 0.00% ← fixed
[BEFORE engine.step()] per_rank=[0.1942, 0.1942] diff = 0.00% ← fixed
```
The same agreement is reproduced by the existing `bf16: {
immediate_grad_update: true }` workaround, which uses per-param hooks to
fill the fp32 accumulator inline during backward (and is therefore not
affected by the `_backward_epilogue` ordering).
### End-to-end training (HuggingFace Trainer + accelerate + DeepSpeed,
2× A100)
A small custom Qwen3-derived model (~64M params, bf16, ZeRO-1 with
`grad_accum_dtype: fp32`), 10 optimizer steps, identical seed and data
ordering, identical effective batch size (`global_batch_size = 64`),
only `per_device_train_batch_size` varies (so
`gradient_accumulation_steps = global_batch_size / (per_device *
world_size)` differs).
| Configuration | Run A: per_device=2, ga=16 | Run B: per_device=8, ga=4
| Final loss gap |
|---|---|---|---|
| DeepSpeed `master` + `grad_accum_dtype: fp32` (broken) | `train_loss =
6.896` | `train_loss = 6.999` | **0.103** |
| No DeepSpeed (DDP + native grad-accum, control) | `train_loss =
6.9035` | `train_loss = 6.9037` | 0.0002 (bf16 noise) |
| `master` + `bf16.immediate_grad_update: true` (existing workaround) |
`train_loss = 6.9057` | `train_loss = 6.9057` | < 0.0001 |
| **This PR + original config** | `train_loss = 6.9057` | `train_loss =
6.9059` | 0.0002 (bf16 noise) |
The broken case also produces qualitatively misleading instabilities —
e.g. at step 5 in the broken run, B's grad-norm spikes to **17.0** vs
A's **1.35** (≈ 12× ratio), while in the fixed case the two grad-norm
trajectories agree to within bf16 noise at every step.
Per-step loss / grad-norm trajectories under the fixed engine (this PR),
for completeness:
| step | A loss | A gnorm | B loss | B gnorm |
|---|---|---|---|---|
| 1 | 9.1907 | 8.0613 | 9.1907 | 8.0615 |
| 2 | 8.1962 | 5.0553 | 8.1961 | 5.0561 |
| 3 | 7.2035 | 2.2668 | 7.2035 | 2.2683 |
| 4 | 7.0588 | 3.1079 | 7.0618 | 3.1118 |
| 5 | 6.5661 | 2.6192 | 6.5627 | 2.4634 |
| 6 | 6.3097 | 1.8798 | 6.3086 | 1.8913 |
| 7 | 6.1332 | 1.2811 | 6.1317 | 1.2584 |
| 8 | 6.1297 | 2.8776 | 6.1305 | 2.9574 |
| 9 | 6.1739 | 1.4336 | 6.1748 | 1.4687 |
| 10 | 6.0950 | 1.5941 | 6.0957 | 1.6031 |
## Notes
- `BF16_Optimizer` users on `master` who are not pinning
`per_device_train_batch_size` may see silently degraded training when
sweeping per-device batch sizes (the symptom that triggered this
investigation). The bug also makes per-rank model weights briefly
diverge between the optimizer step and the next `update_lp_params()`
allgather, which means cross-rank invariants (e.g. asserts that compare
per-rank state) can flip behavior depending on
`gradient_accumulation_steps`.
- Tested on DeepSpeed 0.18.6 (where the bug was first observed) and
confirmed unchanged on `master` (0.18.10).
- No new tests are added in this PR, but a regression test that asserts
cross-rank fp32 buffer agreement after the boundary microbatch in
`BF16_Optimizer` would be a natural follow-up.
---------
Signed-off-by: Max Yu <18641481+maxyu1115@users.noreply.github.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
`topk_masked_gates` was previously being used across the tokens
dimension to determine which tokens had the highest importance for each
expert. However, it was using logits rather than probabilities to
determine this.
This was causing print statements like
```
if dist.get_rank() == 0:
print(f"Mask mean: {mask.float().mean()}")
print(f"Capacity mask mean: {capacity_mask.mean()}")
mask = torch.logical_and(mask, capacity_mask)
if dist.get_rank() == 0:
print(f"Mask (after AND) mean: {mask.float().mean()}")
```
to often yield values like
```
Mask mean: 0.0625
Capacity mask mean: 0.0625
Mask (after AND) mean: 0.005908316932618618
```
and in turn the average number of routed experts per token was as low as
`0.001`.
---------
Signed-off-by: Daniel Shen <dandanshen2002@gmail.com>
## Summary Addresses #7912. This PR adds DeepSpeed-specific NVTX domain support for instrumentation ranges while preserving the existing fallback behavior. ## Changes - Add a `DeepSpeed` NVTX domain name for `instrument_w_nvtx`. - Extend accelerator `range_push` / `range_pop` APIs with optional `domain` and `category` arguments. - Use the NVIDIA `nvtx` package domain API in the CUDA accelerator when available. - Fall back to `torch.cuda.nvtx` when the `nvtx` package is unavailable. - Keep non-CUDA accelerator behavior unchanged by accepting and ignoring the optional arguments. - Add focused unit tests for domain instrumentation, CUDA domain usage, and fallback behavior. ## Tests ### Compile check ```bash PYTHONNOUSERSITE=1 /home/xdu/anaconda3/envs/simlingo/bin/python -m py_compile \ deepspeed/utils/nvtx.py \ accelerator/abstract_accelerator.py \ accelerator/cuda_accelerator.py \ accelerator/cpu_accelerator.py \ accelerator/hpu_accelerator.py \ accelerator/mlu_accelerator.py \ accelerator/mps_accelerator.py \ accelerator/npu_accelerator.py \ accelerator/sdaa_accelerator.py \ accelerator/xpu_accelerator.py \ tests/unit/utils/test_nvtx.py ```` Output: ```text Passed with no output. ``` ### Unit tests ```bash PYTHONNOUSERSITE=1 PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 /home/xdu/anaconda3/envs/simlingo/bin/python -m pytest \ tests/unit/utils/test_nvtx.py \ tests/unit/accelerator/test_accelerator.py -v ``` Key output: ```text NVTX instrumentation calls: [('push', '_sample_nvtx_function', 'DeepSpeed', None), ('pop', 'DeepSpeed')] CUDA NVTX domain calls: [('push', 'my_range', 'zero'), ('pop',)] CUDA torch.nvtx fallback calls: [('push', 'my_range'), ('pop',)] 11 passed, 4 warnings in 1.88s ``` ### Pre-commit ```bash PRE_COMMIT_HOME=/tmp/pre-commit-cache PYTHONNOUSERSITE=1 /home/xdu/anaconda3/envs/simlingo/bin/python -m pre_commit run --files \ accelerator/abstract_accelerator.py \ accelerator/cpu_accelerator.py \ accelerator/cuda_accelerator.py \ accelerator/hpu_accelerator.py \ accelerator/mlu_accelerator.py \ accelerator/mps_accelerator.py \ accelerator/npu_accelerator.py \ accelerator/sdaa_accelerator.py \ accelerator/xpu_accelerator.py \ deepspeed/utils/nvtx.py \ tests/unit/utils/test_nvtx.py ``` Output: ```text All hooks passed. ``` ```` ```` Signed-off-by: heurry <restart12212022@163.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Author: @delock and @PKUWZP ## Summary Integrate Gram Newton-Schulz (Gram NS) as the default orthogonalization method for the Muon optimizer, with a configurable `ns_method` switch to fall back to the original iteration when needed. Based on the Gram Newton-Schulz method from https://tridao.me/blog/2026/gram-newton-schulz/ ## Motivation Standard Newton-Schulz iterates on the full rectangular matrix X (n × m). Gram NS iterates on the much smaller Gram matrix R = X @ X.T (n × n), which is significantly cheaper when m >> n — the common case for transformer weight matrices (typical aspect ratio α ≈ 5). ## Changes - Add `zeropower_via_gram_newtonschulz` in `original_muon.py` with fp16 compute (better precision than bf16 at the same cost) and a restart at iteration 2 for half-precision stability - Add `ns_method` parameter (`"gram"` | `"standard"`) to `muon_update` and all Muon optimizer classes - Thread `ns_method` through ZeRO Stage 1/2/3 call sites and DeepSpeed JSON config - Automatic fallback to standard NS for square matrices (m ≤ n) where Gram NS has no FLOP advantage - Documentation and unit tests for both methods across ZeRO Stage 1, 2, and 3 ## Usage ```json "optimizer": { "type": "Muon", "params": { "ns_method": "gram" } } Set "ns_method": "standard" to disable Gram NS and revert to original behavior (e.g., for debugging convergence issues). ``` Performance improvement: <img width="1630" height="409" alt="image" src="https://github.com/user-attachments/assets/66364bb0-3a99-4cab-a428-10f31b7ae5fa" /> --------- Signed-off-by: Ma, Guokai <guokai.ma@gmail.com> Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com>
…LLM) (#7984) ## Description Hello DeepSpeed Team! 👋 This PR directly addresses the **"Multimodal model support"** goal outlined in the **DeepSpeed Roadmap Q2 2026 (#7861)**. It introduces **AutoSP (Sequence Parallelism) support for Multimodal Models (ViT + LLM)** out of the box. As noted in the roadmap, multimodal models handle significantly longer sequence lengths, making SP critical. This PR automates the injection of DeepSpeed Ulysses-based sequence parallelism into multimodal architectures, removing the need for manual and error-prone engineering efforts. This is a consolidated PR of several incremental features developed and thoroughly tested in my fork. ### 🎯 Related Issue - Addresses the Multimodal model support item in **#7861 (DeepSpeed Roadmap Q2 2026)**. - Builds upon the AutoSP foundation introduced in #7860. ### 🌟 Key Features & Contributions 1. **AutoSP Scaffolding & Detector (`auto_wrap_model_for_sp`)**: - Introduced a scanning utility to automatically detect ViT encoders and LLM decoders within a multimodal model. - Automatically wraps LLM decoder attention layers with DeepSpeed's existing `DistributedAttention`. 2. **ViT Sequence Parallelism (`UlyssesSPViTAttention`)**: - Implemented a Ulysses-style `Gather-Compute-Scatter` sequence parallel wrapper tailored for non-causal ViT attention layers. - Significantly reduces the memory footprint of ViT Feed-Forward Networks (FFN) and LayerNorms across the sequence dimension. 3. **Cross-Modal Fusion Adapters (Phase 2)**: - Handled the complex sequence scatter/gather at the vision-language boundary to ensure the LLM decoder receives uniformly sharded fused sequences. - Supported architectures include: - **LLaVA** (`LlavaFusionAdapter`): Visual token splice replacing image placeholders. - **InternVL** (`InternVLFusionAdapter`): `IMG_CONTEXT` token splice. - **Qwen2-VL** (`Qwen2VLFusionAdapter`): Vision_start/end bounded splice. ### 🧪 Testing & Validation To ensure this PR does not break any existing functionality and is numerically sound, comprehensive tests have been added: - **Numerical Equivalence Tests**: Added multi-GPU tests (`tests/unit/sequence_parallelism/test_autosp_equivalence.py`) verifying that the SP-wrapped path across N ranks produces the **exact same numerical results** as the equivalent single-device (non-SP) computation. - **Integration Tests**: End-to-end mock integration tests validating the full pipeline from ViT to fusion adapter. - **Benchmarks Provided**: Included a multimodal SP benchmark script (`benchmarks/autosp/bench_multimodal_sp.py`) to easily verify throughput scaling and peak GPU memory reduction. *(All tests pass cleanly on 2 GPUs with `NCCL_P2P_DISABLE=1`)* ### 🚧 Known Limitations & Future Work To be fully transparent, there are a few limitations in the current design that I plan to improve in follow-up iterations (or would love guidance on from the team): 1. **Manual Wrapping for Fusion Layers**: While ViT and LLM attentions are wrapped automatically, the vision projection layer currently requires manual wrapping with `ModalityFusionSPAdapter` due to varying HF model implementations. Fully automating Phase 2 is a logical next step. 2. **ViT SP Trade-off**: The current `UlyssesSPViTAttention` uses a Gather-Compute-Scatter approach. While it successfully reduces FFN memory by $1/N$, it still computes the full attention matrix on every rank. A true All-to-All sequence-to-head transposition for Opaque ViT layers is something I am actively exploring. 3. **Padding Attention Mask**: When `fused_len % world_size != 0`, zero-padding is applied. Currently, the global `attention_mask` is not automatically intercepted and patched, which might require user attention during inference. --- I would deeply appreciate any feedback or suggestions from the maintainers! I am more than happy to make any required adjustments, refactorings, or add further test cases to get this perfectly aligned with the Q2 roadmap and DeepSpeed's standards. Thank you for your time reviewing this! 🚀 --------- Signed-off-by: nathon-lee <leejianwoo@gmail.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Add office hours link to README Signed-off-by: Logan Adams <loadams@microsoft.com>
After #7986, `tests/unit/moe/test_moe.py::TestTopkGate::test` started failing because the expected mask changed. This was caught by the full CI run ([log](https://github.com/deepspeedai/DeepSpeed/actions/runs/25724245746)). This PR updates `TestTopkGate`'s `drop_policy='probs'` expected mask for the `logits2` case to match the probability-based capacity selection introduced by #7986. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…e-skip AutoEP: skip AutoEP subtrees in AutoTP traversal
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Fix AutoEP PR 7938 review regressions
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…w-followups Fix AutoEP PR #7938 review follow-ups
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…ode-layout AutoEP: support DeepSeek-V3 remote-code layout
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…nups AutoEP PR #7938 review cleanups
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Fix AutoEP PR 7938 review findings
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Reduce retained AutoEP tests to critical path
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…lier Remove AutoEP backward loss multiplier
…-groups-mpu Use active MPU for AutoEP sequence-parallel size
Three changes to make Moonlight-16B-A3B (DeepSeek-V3 MoE) work with AutoEP + Muon + ZeRO-2: 1. e_score_correction_bias: copy the pretrained noaux_tc score-correction bias from the source gate into AutoEP routers and apply it in the TokenChoiceTopKRouter forward pass so expert selection matches the pretrained checkpoint. 2. is_expert_group: mark GroupedExperts w1/w2/w3 tensors with is_expert_group=True so Muon applies Newton-Schulz independently per expert slice rather than treating the stacked (E, I, O) tensor as a single matrix. muon_update grows an is_expert_group kwarg; all four call sites inside original_muon.py and the ZeRO-2 path in stage_1_and_2.py pass getattr(p, 'is_expert_group', False). 3. Muon + MoE param groups in engine.py: flatten dict-style param groups produced by configure_moe_param_groups before filtering by use_muon; re-tag optimizer flags after AutoEP layer replacement; add name keys for MoE group splitting; call split_params_into_different_moe_groups when the model has MoE layers. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Two fixes addressing masahiro's review feedback on PR #7938: 1. Auto-fill AutoEPConfig from HF model config (auto_ep_config.py, auto_ep.py): add fill_autoep_config_from_hf() which maps HF field names to AutoEP internal names on AutoEP.__init__: - n_group -> num_expert_groups - topk_group -> num_limited_groups - routed_scaling_factor -> route_scale User-supplied values always take precedence. Without this, Moonlight (DeepSeek-V3) training used route_scale=1.0 instead of 2.446, producing systematically wrong MoE output magnitudes. 2. Restore batched Newton-Schulz in muon_update (original_muon.py): replace the per-expert Python loop with a single batched call to zeropower_via_newtonschulz5, which already supports ndim>=2 inputs. This restores GPU parallelism across all E experts per step. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Now that the per-expert Python loop is replaced with a single batched call to zeropower_via_newtonschulz5, muon_update has no dynamic control flow and can be compiled again. Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Guokai Ma <guokai.ma@intel.com>
1. gram_newtonschulz: replace torch.addmm (2D only) with equivalent a*Q + Z@Q to support batched 3D expert weight tensors [num_local_experts, n, m]. Also fix diagonal() to specify dim1/dim2 for 3D tensors. 2. deepseek_v3 preset: remove e_score_correction_bias from unsupported_router_bias_names since auto_ep_layer.py already copies it correctly (lines 398-402). Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Author
|
Closing - this PR included the full history due to rebase. Will create a clean PR with only the 5 new commits. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes to make Moonlight-16B-A3B (DeepSeek-V3 MoE) work with AutoEP + Muon + ZeRO-2, rebased onto the latest PR deepspeedai#7938 HEAD.
Changes
e_score_correction_bias: Copy pretrained noaux_tc score-correction bias into AutoEP routers.is_expert_group: Muon applies NS independently per expert slice. Fixgram_newtonschulzfor batched 3D tensors.fill_autoep_config_from_hf: Auto-fill AutoEPConfig from HF model config.e_score_correction_biasfrom unsupported list.ns_methodin expert group path: Respectns_methodwhenis_expert_group=True.Tested with Moonlight-16B-A3B on MMLU, MBPP, GSM8K (4x H200, AutoEP ep_size=4, ZeRO-2 + Muon).
Related
For PR deepspeedai#7938.