Skip to content

AutoEP + Muon fixes for Moonlight (DeepSeek-V3 MoE)#25

Closed
delock wants to merge 82 commits into
tohtana:add_autoepfrom
deepspeedai:gma/autoep-muon-fixes
Closed

AutoEP + Muon fixes for Moonlight (DeepSeek-V3 MoE)#25
delock wants to merge 82 commits into
tohtana:add_autoepfrom
deepspeedai:gma/autoep-muon-fixes

Conversation

@delock
Copy link
Copy Markdown

@delock delock commented May 19, 2026

Summary

Fixes to make Moonlight-16B-A3B (DeepSeek-V3 MoE) work with AutoEP + Muon + ZeRO-2, rebased onto the latest PR deepspeedai#7938 HEAD.

Changes

  1. e_score_correction_bias: Copy pretrained noaux_tc score-correction bias into AutoEP routers.
  2. is_expert_group: Muon applies NS independently per expert slice. Fix gram_newtonschulz for batched 3D tensors.
  3. fill_autoep_config_from_hf: Auto-fill AutoEPConfig from HF model config.
  4. Muon + MoE param groups: Flatten dict-style param groups; re-tag optimizer flags; split MoE groups.
  5. deepseek_v3 preset: Remove e_score_correction_bias from unsupported list.
  6. ns_method in expert group path: Respect ns_method when is_expert_group=True.

Tested with Moonlight-16B-A3B on MMLU, MBPP, GSM8K (4x H200, AutoEP ep_size=4, ZeRO-2 + Muon).

Related

For PR deepspeedai#7938.

tohtana and others added 30 commits April 16, 2026 04:25
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Currently the CI full test shows a [CUDA reinit
error](https://github.com/deepspeedai/DeepSpeed/actions/runs/24444633640/job/71417719445).
This PR includes the following fixes:

- Fix `compute_capability_args()` in JIT mode to read
`TORCH_CUDA_ARCH_LIST` before calling
`torch.cuda.get_device_capability()` and restores JIT builder state
after `jit_load()`. It also adds regression tests for the explicit-arch,
bad-fork, and restore paths.
- Delay initialization of CUDA streams in DeepCompile

After this fix, the full test
[passed](https://github.com/deepspeedai/DeepSpeed/actions/runs/24508304055/job/71632434455)
again.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
… step (#7981)

## Summary

ZeRO-1/2 + `offload_optimizer` + `gradient_accumulation_steps=1` with
multiple `engine.backward()` calls per optimizer step (via
`set_gradient_accumulation_boundary()`, formalized in #7665) silently
drops all but the last backward's gradient.

`copy_grads_in_partition` only called
`async_accumulate_grad_in_cpu_via_gpu` under `if
gradient_accumulation_steps > 1`, so with `ga_steps=1` intermediate
backwards' reduced grads were never stored. The boundary
`async_inplace_copy_grad_to_fp32_buffer_from_gpu` then overwrote (not
added) the fp32 buffer with the last chunk only.

ZeRO-3 + offload and non-offload ZeRO-1/2 are unaffected.

## Fix

Replace the `ga > 1` gate with one that fires exactly when a CPU
accumulator is needed:

```python
if self.micro_step_id > 0 or not self.is_gradient_accumulation_boundary:
    self.async_accumulate_grad_in_cpu_via_gpu(param)
```

- `ga_steps=1` + single `backward()` → skipped. No CPU buffer, no extra
copy. Fast path preserved.
- `ga_steps=1` + multi-backward → accumulates correctly across calls.
- `ga_steps>1` → identical to prior behaviour.

## Measurement

2x H100, 3-layer MLP, Adam, lr=1e-3, N=4 backwards/step, ga_steps=1
Max param diff vs no-offload reference:

|        | fp32                                 | bf16                 |
| ------ | ------------------------------------ | -------------------- |
| Before | 2.00e-03 (wrong, around 2 x lr) | —                    |
| After  | 7.45e-09  (noise)             | 0.00e+00 |

## Tests

New `tests/unit/v1/zero/test_zero2_offload_multi_backward.py`,
parametrized over ZeRO-1/2:
multi-backward offload matches no-offload / single-backward unchanged /
multi-step state-leak guard / single-backward allocates no CPU buffer
(perf guard) / `ga_steps>1` + offload unchanged (#7967 regression
guard).

---------

Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>
…rd/backward visits (#7980)

In PyTorch AOT Autograd, having tensors requiring grad in inputs doesn't
guarantee backward graph compilation. If no output requires grad and no
input requiring grad is mutated, aot_autograd skips backward compilation
(see [1]).

DeepCompile previously required backward compilation for every forward
graph which required grad, but relied solely on the existence of
require_grad tensors. This mismatch caused unbalanced forward/backward
visits, leaving graphs unvisited in `frames_needing_bwd`. The patched
FunctionMeta then remained effective during backward execution, raising
KeyError when removing the (already-removed) frame IDs from the
`frames_needing_bwd` set. A reproduction can be found at [2].

Simply put a guard on the set removal operation is insufficient. The
backward graph is still recompiled on each iteration, severely impacting
performance.

Instead of duplicating how AOT Autograd determines whether to compile
the backward graph, use the fact that a joint graph requires a backward
pass if and only if it is partitioned into a forward and a backward
module. The frame IDs of partitioned graphs are collected in the patched
partition functions and then used to determine `needs_backward` in the
forward compile function. `backend_fn` is not a proper place for the
second step since autograd creates fw/bw compile functions before
partitioning a joint graph.

References

[1]
https://github.com/pytorch/pytorch/blob/aea31e0c306e2315bf6d84255e0dde7adf09762a/torch/_functorch/aot_autograd.py#L618
[2]
https://gist.github.com/eternalNight/96d6bc60e2bf566fda1300154d0e89dc

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
Fix CPUAdam same-step subgroup drift in ZeRO-3 (#7819)

This PR ports the fix from #7820 to the latest DeepSpeed version.

It makes `Adam_Optimizer::IncrementStep` idempotent for repeated calls
at the same logical step and avoids unnecessary recomputation when the
step has not changed.

ZeRO-3/SuperOffload can invoke multiple subgroup updates within a single
logical step on a shared native optimizer object. The previous logic
mixed multiply and recompute paths, producing non-bit-identical
bias-correction metadata across subgroup calls.

This change aligns the step-transition logic in both the CPU and XPU
headers, clarifies first-step and non-sequential-step behavior, and
prevents unnecessary work on repeated same-step updates.

It also adds CPUAdam regression tests covering subgroup-style repeated
same-step updates through both `step_subgroup()` and `step()` with
parameter swapping.

Signed-off-by: st_bang <st.bang@dgist.ac.kr>
Fix #6596

---------

Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
# Fix BF16_Optimizer last-microbatch grad leak under ZeRO-1

## Summary

`DeepSpeedEngine._backward_epilogue` calls `allreduce_gradients()`
**before** `optimizer.backward_epilogue()`. For `BF16_Optimizer` (used
when `bf16` model + `grad_accum_dtype: fp32` + ZeRO stage 1) without
`immediate_grad_update`, this means the **boundary microbatch's gradient
is added to the rank-local fp32 accumulator AFTER the cross-rank
allreduce, so it is silently skipped from the average.**

The bias is `(world_size − 1) / world_size × 1 /
gradient_accumulation_steps` of the per-step gradient. Because the bias
scales with per-microbatch grad weight, training trajectories visibly
diverge depending on `per_device_train_batch_size` even with identical
effective batch size — the symptom users see is loss / grad-norm curves
drifting apart between otherwise-equivalent configs.

The bug is reproducible in DeepSpeed 0.18.6 through current `master`
(0.18.10 at time of writing).

## Fix

Swap the order so `optimizer.backward_epilogue()` runs before
`allreduce_gradients()`, with `exit_backward()` left after.
`exit_backward()` only manages backward-hook state
(`_backward_hook_state`); it has no ordering dependency on the gradient
accumulator.

```python
def _backward_epilogue(self):
    self._stop_timers(self.engine_timers.backward_inner_timers)
    self._start_timers(self.engine_timers.backward_reduce_timers)
    # NEW: run backward_epilogue() before allreduce so the boundary microbatch
    # grad lands in the optimizer accumulator that gets reduced.
    if isinstance(self.optimizer, ZeROOptimizer):
        self.optimizer.backward_epilogue()

    if self.enable_backward_allreduce and not self.inside_no_sync_ctxt:
        self.allreduce_gradients()

    if isinstance(self.optimizer, ZeROOptimizer):
        self.optimizer.exit_backward()

    see_memory_usage("Engine after backward", force=self.memory_breakdown())
    self._stop_timers(self.engine_timers.backward_reduce_timers)
    self._stop_timers(self.engine_timers.backward_timers)
```

Diff: +10 / −1, single file (`deepspeed/runtime/engine.py`).

## Root cause walkthrough

The bug requires **both** of the following to be true:

1. The accumulator that `optimizer.backward_epilogue()` mutates is the
**same tensor** that `engine.allreduce_gradients()` later reduces, AND
2. The accumulator is updated *only* by `optimizer.backward_epilogue()`
(no per-param hooks updating it inline during backward).

Both conditions hold for `BF16_Optimizer` without
`immediate_grad_update`:

- It maintains a **separate fp32 accumulator**
(`fp32_groups_gradients_flat`) — distinct from `param.grad`.
- Its `backward_epilogue()` calls `update_hp_grads()` which casts each
param's bf16 `lp.grad` to fp32 and adds it into that accumulator (and
only this code path fills the accumulator when
`immediate_grad_update=False`).
- `engine.allreduce_gradients()` → `buffered_allreduce_fallback()` →
`optimizer.get_grads_for_reduction()` returns **the same
`non_expert_gradients` list = `fp32_groups_gradients_flat`**.

So on the gradient-accumulation boundary microbatch:
1. `loss.backward()` populates bf16 `lp.grad` for that microbatch.
2. `_backward_epilogue` first calls `allreduce_gradients()`. The fp32
accumulator at this point contains microbatches `0..ga-2`'s grads
(summed locally on each rank). The allreduce averages **only that**
across ranks.
3. `_backward_epilogue` then calls `optimizer.backward_epilogue()` →
`update_hp_grads()` → adds the boundary microbatch's local `lp.grad` to
the now-allreduced accumulator.

Result, per rank `i`:

```
fp32_buffer_rank_i = avg_ranks(Σ_{m=0..ga-2} grad_m) + local_grad_{ga-1}_rank_i
                     └────── shared across ranks ─────┘   └─── rank-private leak ───┘
```

ZeRO-1 partitions optimizer states across ranks, so each rank then runs
`optimizer.step()` on its slice of this rank-divergent buffer;
`update_lp_params()` allgathers the bf16 params back. The effective
gradient applied to parameter `p` is:

```
g_p = avg_ranks(prior microbatches' grad for p) + local_grad_last_for_p_at_owning_rank
```

i.e. the boundary microbatch's contribution captures only `1 /
world_size` of the cross-rank average, biasing the global gradient by
`(world_size − 1) / world_size × 1 / ga_steps`.

## Impact on other optimizers (no behavior change)

| Optimizer | Accumulator | How accumulator is filled | Reduction path |
Affected by current bug? | Affected by this fix? |
|---|---|---|---|---|---|
| `BF16_Optimizer` (immediate_grad_update=False) | separate
`fp32_groups_gradients_flat` | only via `optimizer.backward_epilogue()`
→ `update_hp_grads()` | `engine.allreduce_gradients` →
`buffered_allreduce_fallback` → `optimizer.get_grads_for_reduction()`
returns the same fp32 buffer | **Yes — leak** | **Yes — leak fixed** |
| `BF16_Optimizer` (immediate_grad_update=True) | same fp32 buffer |
per-param hooks (`create_grad_acc_hooks`) fire inline during backward |
same allreduce path | No (hooks already filled buffer before allreduce)
| No-op (`update_hp_grads` early-returns when `immediate_grad_update`) |
| `DeepSpeedZeroOptimizer_Stage1And2` (ZeRO-1, default for
bf16+bf16-grad-accum) | `param.grad` directly + ipg buckets | hooks fire
inline during backward (`overlap_comm=True` default), or
`reduce_gradients()` walks all params at boundary |
`engine.allreduce_gradients` takes the `if hasattr(self.optimizer,
'reduce_gradients')` branch → `optimizer.reduce_gradients()` walks all
params; the boundary microbatch's grad is already on `param.grad`
(autograd populates this before `_backward_epilogue` runs) | No | No-op
(`Stage1And2.backward_epilogue` does not mutate the reduction buffer) |
| `DeepSpeedZeroOptimizer_Stage3` | partitioned via
`overlapping_partition_gradients_reduce_epilogue()` | reduce-scatter
inline during backward via hooks | `engine.allreduce_gradients` takes
the `if zero_optimization_partition_gradients()` branch → calls
overlapping epilogue, which is fed by hooks | No | No-op |

In short: the fix is functionally relevant **only** for `BF16_Optimizer`
without `immediate_grad_update`. For every other ZeRO optimizer the
change is observably a no-op because their `backward_epilogue` does not
mutate the buffer being reduced.

## Reproducer

The minimum reproducer is a 2-rank standalone script that runs one
gradient-accumulation cycle and prints the per-rank fp32 accumulator
norm at each microbatch and immediately before `engine.step()`. With the
bug present the per-rank values **disagree at the boundary microbatch
and going into the optimizer step**; with the fix they agree.

Save as `probe_bf16_grad_accum.py`:

```python
"""Probe whether DeepSpeed's BF16_Optimizer leaks the boundary microbatch grad
out of the cross-rank average. Run with two ranks, e.g.:
    deepspeed --num_gpus 2 probe_bf16_grad_accum.py
or
    accelerate launch --num_processes 2 --num_machines 1 probe_bf16_grad_accum.py
"""
import os
import torch
import torch.nn as nn
import deepspeed
import torch.distributed as dist


def main():
    rank = int(os.environ.get("RANK", "0"))
    world = int(os.environ.get("WORLD_SIZE", "1"))
    GA_STEPS = 4
    HIDDEN = 64
    BATCH = 4

    torch.manual_seed(0)  # SAME init across ranks
    model = nn.Sequential(
        nn.Linear(HIDDEN, HIDDEN),
        nn.GELU(),
        nn.Linear(HIDDEN, HIDDEN),
    ).to(torch.bfloat16).cuda()

    ds_config = {
        "bf16": {"enabled": True},  # set "immediate_grad_update": True to also bypass the bug
        "data_types": {"grad_accum_dtype": "fp32"},
        "communication_data_type": "fp32",
        "zero_optimization": {
            "stage": 1,
            "overlap_comm": True,
            "contiguous_gradients": True,
            "reduce_scatter": True,
            "allgather_partitions": True,
            "allgather_bucket_size": 200000000,
            "reduce_bucket_size": 200000000,
        },
        "train_micro_batch_size_per_gpu": BATCH,
        "gradient_accumulation_steps": GA_STEPS,
        "train_batch_size": BATCH * world * GA_STEPS,
        "gradient_clipping": 0.0,
        "steps_per_print": 9999,
    }
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, betas=(0.9, 0.95))
    engine, _, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config=ds_config)
    bf16_opt = engine.optimizer
    if rank == 0:
        print(f"[INFO] DeepSpeed optimizer class: {type(bf16_opt).__name__}", flush=True)
        print(f"[INFO] grad_acc_dtype = {bf16_opt.grad_acc_dtype}", flush=True)
        print(f"[INFO] world={world} ga={GA_STEPS} batch={BATCH}", flush=True)

    def buffer_summary(label):
        bufs = bf16_opt.fp32_groups_gradients_flat
        local_norm = sum(b.detach().to(torch.float64).norm().item() ** 2 for b in bufs) ** 0.5
        rank_buf = torch.tensor([local_norm], device="cuda", dtype=torch.float64)
        gathered = [torch.zeros_like(rank_buf) for _ in range(world)]
        dist.all_gather(gathered, rank_buf)
        if rank == 0:
            vals = [g.item() for g in gathered]
            diffs = [(v - vals[0]) / vals[0] * 100 if vals[0] != 0 else 0 for v in vals]
            print(f"[{label}] per_rank={vals}  diff_pct_vs_rank0={diffs}", flush=True)
        dist.barrier()

    buffer_summary("init (zero)")

    # Different inputs per rank so per-rank grads differ.
    torch.manual_seed(100 + rank)
    inputs = [torch.randn(BATCH, HIDDEN, device="cuda", dtype=torch.bfloat16) for _ in range(GA_STEPS)]
    targets = [torch.randn(BATCH, HIDDEN, device="cuda", dtype=torch.bfloat16) for _ in range(GA_STEPS)]

    for i in range(GA_STEPS):
        is_boundary = (i == GA_STEPS - 1)
        engine.set_gradient_accumulation_boundary(is_boundary=is_boundary)  # what accelerate's wrapper does
        out = engine(inputs[i])
        loss = ((out - targets[i]) ** 2).mean()
        engine.backward(loss)
        buffer_summary(f"after backward microbatch {i} (boundary={is_boundary})")

    buffer_summary("BEFORE engine.step()")

    engine.step()


if __name__ == "__main__":
    main()
```

## Verification

### Probe (synthetic, 2 GPUs, 1 grad-accum cycle)

Run on `master` (bug):

```
[init (zero)]                            per_rank=[0.0,    0.0   ]   diff = 0%
[after microbatch 0  (boundary=False)]   per_rank=[0.1495, 0.1378]   diff = -7.78%   (local-only accumulation)
[after microbatch 1  (boundary=False)]   per_rank=[0.1998, 0.2098]   diff = +5.03%
[after microbatch 2  (boundary=False)]   per_rank=[0.2322, 0.2434]   diff = +4.82%
[after microbatch 3  (boundary=True)]    per_rank=[0.2206, 0.2123]   diff = -3.80%   ← bug
[BEFORE engine.step()]                   per_rank=[0.2206, 0.2123]   diff = -3.80%   ← bug
```

Run on this PR (fixed):

```
[init (zero)]                            per_rank=[0.0,    0.0   ]   diff = 0%
[after microbatch 0  (boundary=False)]   per_rank=[0.1495, 0.1378]   diff = -7.78%
[after microbatch 1  (boundary=False)]   per_rank=[0.1998, 0.2098]   diff = +5.03%
[after microbatch 2  (boundary=False)]   per_rank=[0.2322, 0.2434]   diff = +4.82%
[after microbatch 3  (boundary=True)]    per_rank=[0.1942, 0.1942]   diff = 0.00%   ← fixed
[BEFORE engine.step()]                   per_rank=[0.1942, 0.1942]   diff = 0.00%   ← fixed
```

The same agreement is reproduced by the existing `bf16: {
immediate_grad_update: true }` workaround, which uses per-param hooks to
fill the fp32 accumulator inline during backward (and is therefore not
affected by the `_backward_epilogue` ordering).

### End-to-end training (HuggingFace Trainer + accelerate + DeepSpeed,
2× A100)

A small custom Qwen3-derived model (~64M params, bf16, ZeRO-1 with
`grad_accum_dtype: fp32`), 10 optimizer steps, identical seed and data
ordering, identical effective batch size (`global_batch_size = 64`),
only `per_device_train_batch_size` varies (so
`gradient_accumulation_steps = global_batch_size / (per_device *
world_size)` differs).

| Configuration | Run A: per_device=2, ga=16 | Run B: per_device=8, ga=4
| Final loss gap |
|---|---|---|---|
| DeepSpeed `master` + `grad_accum_dtype: fp32` (broken) | `train_loss =
6.896` | `train_loss = 6.999` | **0.103** |
| No DeepSpeed (DDP + native grad-accum, control) | `train_loss =
6.9035` | `train_loss = 6.9037` | 0.0002 (bf16 noise) |
| `master` + `bf16.immediate_grad_update: true` (existing workaround) |
`train_loss = 6.9057` | `train_loss = 6.9057` | < 0.0001 |
| **This PR + original config** | `train_loss = 6.9057` | `train_loss =
6.9059` | 0.0002 (bf16 noise) |

The broken case also produces qualitatively misleading instabilities —
e.g. at step 5 in the broken run, B's grad-norm spikes to **17.0** vs
A's **1.35** (≈ 12× ratio), while in the fixed case the two grad-norm
trajectories agree to within bf16 noise at every step.

Per-step loss / grad-norm trajectories under the fixed engine (this PR),
for completeness:

| step | A loss | A gnorm | B loss | B gnorm |
|---|---|---|---|---|
| 1 | 9.1907 | 8.0613 | 9.1907 | 8.0615 |
| 2 | 8.1962 | 5.0553 | 8.1961 | 5.0561 |
| 3 | 7.2035 | 2.2668 | 7.2035 | 2.2683 |
| 4 | 7.0588 | 3.1079 | 7.0618 | 3.1118 |
| 5 | 6.5661 | 2.6192 | 6.5627 | 2.4634 |
| 6 | 6.3097 | 1.8798 | 6.3086 | 1.8913 |
| 7 | 6.1332 | 1.2811 | 6.1317 | 1.2584 |
| 8 | 6.1297 | 2.8776 | 6.1305 | 2.9574 |
| 9 | 6.1739 | 1.4336 | 6.1748 | 1.4687 |
| 10 | 6.0950 | 1.5941 | 6.0957 | 1.6031 |

## Notes

- `BF16_Optimizer` users on `master` who are not pinning
`per_device_train_batch_size` may see silently degraded training when
sweeping per-device batch sizes (the symptom that triggered this
investigation). The bug also makes per-rank model weights briefly
diverge between the optimizer step and the next `update_lp_params()`
allgather, which means cross-rank invariants (e.g. asserts that compare
per-rank state) can flip behavior depending on
`gradient_accumulation_steps`.
- Tested on DeepSpeed 0.18.6 (where the bug was first observed) and
confirmed unchanged on `master` (0.18.10).
- No new tests are added in this PR, but a regression test that asserts
cross-rank fp32 buffer agreement after the boundary microbatch in
`BF16_Optimizer` would be a natural follow-up.

---------

Signed-off-by: Max Yu <18641481+maxyu1115@users.noreply.github.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
`topk_masked_gates` was previously being used across the tokens
dimension to determine which tokens had the highest importance for each
expert. However, it was using logits rather than probabilities to
determine this.

This was causing print statements like
```
            if dist.get_rank() == 0:
                print(f"Mask mean: {mask.float().mean()}")
                print(f"Capacity mask mean: {capacity_mask.mean()}")
            mask = torch.logical_and(mask, capacity_mask)
            if dist.get_rank() == 0:
                print(f"Mask (after AND) mean: {mask.float().mean()}")
```

to often yield values like
```
Mask mean: 0.0625
Capacity mask mean: 0.0625
Mask (after AND) mean: 0.005908316932618618
```

and in turn the average number of routed experts per token was as low as
`0.001`.

---------

Signed-off-by: Daniel Shen <dandanshen2002@gmail.com>
## Summary

Addresses #7912.

This PR adds DeepSpeed-specific NVTX domain support for instrumentation
ranges while preserving the existing fallback behavior.

## Changes

- Add a `DeepSpeed` NVTX domain name for `instrument_w_nvtx`.
- Extend accelerator `range_push` / `range_pop` APIs with optional
`domain` and `category` arguments.
- Use the NVIDIA `nvtx` package domain API in the CUDA accelerator when
available.
- Fall back to `torch.cuda.nvtx` when the `nvtx` package is unavailable.
- Keep non-CUDA accelerator behavior unchanged by accepting and ignoring
the optional arguments.
- Add focused unit tests for domain instrumentation, CUDA domain usage,
and fallback behavior.

## Tests

### Compile check

```bash
PYTHONNOUSERSITE=1 /home/xdu/anaconda3/envs/simlingo/bin/python -m py_compile \
  deepspeed/utils/nvtx.py \
  accelerator/abstract_accelerator.py \
  accelerator/cuda_accelerator.py \
  accelerator/cpu_accelerator.py \
  accelerator/hpu_accelerator.py \
  accelerator/mlu_accelerator.py \
  accelerator/mps_accelerator.py \
  accelerator/npu_accelerator.py \
  accelerator/sdaa_accelerator.py \
  accelerator/xpu_accelerator.py \
  tests/unit/utils/test_nvtx.py
````

Output:

```text
Passed with no output.
```

### Unit tests

```bash
PYTHONNOUSERSITE=1 PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 /home/xdu/anaconda3/envs/simlingo/bin/python -m pytest \
  tests/unit/utils/test_nvtx.py \
  tests/unit/accelerator/test_accelerator.py -v
```

Key output:

```text
NVTX instrumentation calls: [('push', '_sample_nvtx_function', 'DeepSpeed', None), ('pop', 'DeepSpeed')]
CUDA NVTX domain calls: [('push', 'my_range', 'zero'), ('pop',)]
CUDA torch.nvtx fallback calls: [('push', 'my_range'), ('pop',)]
11 passed, 4 warnings in 1.88s
```

### Pre-commit

```bash
PRE_COMMIT_HOME=/tmp/pre-commit-cache PYTHONNOUSERSITE=1 /home/xdu/anaconda3/envs/simlingo/bin/python -m pre_commit run --files \
  accelerator/abstract_accelerator.py \
  accelerator/cpu_accelerator.py \
  accelerator/cuda_accelerator.py \
  accelerator/hpu_accelerator.py \
  accelerator/mlu_accelerator.py \
  accelerator/mps_accelerator.py \
  accelerator/npu_accelerator.py \
  accelerator/sdaa_accelerator.py \
  accelerator/xpu_accelerator.py \
  deepspeed/utils/nvtx.py \
  tests/unit/utils/test_nvtx.py
```

Output:

```text
All hooks passed.
```

````
````

Signed-off-by: heurry <restart12212022@163.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Author: @delock and @PKUWZP 

## Summary
Integrate Gram Newton-Schulz (Gram NS) as the default orthogonalization
method for the Muon optimizer, with a configurable `ns_method` switch to
fall back to the original iteration when needed.
Based on the Gram Newton-Schulz method from
https://tridao.me/blog/2026/gram-newton-schulz/
## Motivation
Standard Newton-Schulz iterates on the full rectangular matrix X (n ×
m). Gram NS iterates on the much smaller Gram matrix R = X @ X.T (n ×
n), which is significantly cheaper when m >> n — the common case for
transformer weight matrices (typical aspect ratio α ≈ 5).
## Changes
- Add `zeropower_via_gram_newtonschulz` in `original_muon.py` with fp16
compute (better precision than bf16 at the same cost)
and a restart at iteration 2 for half-precision stability
- Add `ns_method` parameter (`"gram"` | `"standard"`) to `muon_update`
and all Muon optimizer classes
- Thread `ns_method` through ZeRO Stage 1/2/3 call sites and DeepSpeed
JSON config
- Automatic fallback to standard NS for square matrices (m ≤ n) where
Gram NS has no FLOP advantage
- Documentation and unit tests for both methods across ZeRO Stage 1, 2,
and 3

## Usage
```json                                                                                                                      
"optimizer": {                                                                                                               
    "type": "Muon",                                                                                                          
    "params": {                                                                                                              
        "ns_method": "gram"                                                                                                  
    }                                                                                                                        
}                                                                                                                            
                                                                                                                             
Set "ns_method": "standard" to disable Gram NS and revert to original behavior (e.g., for debugging convergence issues).     
```
Performance improvement:
<img width="1630" height="409" alt="image"
src="https://github.com/user-attachments/assets/66364bb0-3a99-4cab-a428-10f31b7ae5fa"
/>

---------

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com>
…LLM) (#7984)

## Description

Hello DeepSpeed Team! 👋 

This PR directly addresses the **"Multimodal model support"** goal
outlined in the **DeepSpeed Roadmap Q2 2026 (#7861)**.

It introduces **AutoSP (Sequence Parallelism) support for Multimodal
Models (ViT + LLM)** out of the box. As noted in the roadmap, multimodal
models handle significantly longer sequence lengths, making SP critical.
This PR automates the injection of DeepSpeed Ulysses-based sequence
parallelism into multimodal architectures, removing the need for manual
and error-prone engineering efforts.

This is a consolidated PR of several incremental features developed and
thoroughly tested in my fork.

### 🎯 Related Issue
- Addresses the Multimodal model support item in **#7861 (DeepSpeed
Roadmap Q2 2026)**.
- Builds upon the AutoSP foundation introduced in #7860.

### 🌟 Key Features & Contributions

1. **AutoSP Scaffolding & Detector (`auto_wrap_model_for_sp`)**:
- Introduced a scanning utility to automatically detect ViT encoders and
LLM decoders within a multimodal model.
- Automatically wraps LLM decoder attention layers with DeepSpeed's
existing `DistributedAttention`.

2. **ViT Sequence Parallelism (`UlyssesSPViTAttention`)**:
- Implemented a Ulysses-style `Gather-Compute-Scatter` sequence parallel
wrapper tailored for non-causal ViT attention layers.
- Significantly reduces the memory footprint of ViT Feed-Forward
Networks (FFN) and LayerNorms across the sequence dimension.

3. **Cross-Modal Fusion Adapters (Phase 2)**:
- Handled the complex sequence scatter/gather at the vision-language
boundary to ensure the LLM decoder receives uniformly sharded fused
sequences.
   - Supported architectures include:
- **LLaVA** (`LlavaFusionAdapter`): Visual token splice replacing image
placeholders.
- **InternVL** (`InternVLFusionAdapter`): `IMG_CONTEXT` token splice.
- **Qwen2-VL** (`Qwen2VLFusionAdapter`): Vision_start/end bounded
splice.

### 🧪 Testing & Validation

To ensure this PR does not break any existing functionality and is
numerically sound, comprehensive tests have been added:

- **Numerical Equivalence Tests**: Added multi-GPU tests
(`tests/unit/sequence_parallelism/test_autosp_equivalence.py`) verifying
that the SP-wrapped path across N ranks produces the **exact same
numerical results** as the equivalent single-device (non-SP)
computation.
- **Integration Tests**: End-to-end mock integration tests validating
the full pipeline from ViT to fusion adapter.
- **Benchmarks Provided**: Included a multimodal SP benchmark script
(`benchmarks/autosp/bench_multimodal_sp.py`) to easily verify throughput
scaling and peak GPU memory reduction.

*(All tests pass cleanly on 2 GPUs with `NCCL_P2P_DISABLE=1`)*

### 🚧 Known Limitations & Future Work

To be fully transparent, there are a few limitations in the current
design that I plan to improve in follow-up iterations (or would love
guidance on from the team):

1. **Manual Wrapping for Fusion Layers**: While ViT and LLM attentions
are wrapped automatically, the vision projection layer currently
requires manual wrapping with `ModalityFusionSPAdapter` due to varying
HF model implementations. Fully automating Phase 2 is a logical next
step.
2. **ViT SP Trade-off**: The current `UlyssesSPViTAttention` uses a
Gather-Compute-Scatter approach. While it successfully reduces FFN
memory by $1/N$, it still computes the full attention matrix on every
rank. A true All-to-All sequence-to-head transposition for Opaque ViT
layers is something I am actively exploring.
3. **Padding Attention Mask**: When `fused_len % world_size != 0`,
zero-padding is applied. Currently, the global `attention_mask` is not
automatically intercepted and patched, which might require user
attention during inference.

---

I would deeply appreciate any feedback or suggestions from the
maintainers! I am more than happy to make any required adjustments,
refactorings, or add further test cases to get this perfectly aligned
with the Q2 roadmap and DeepSpeed's standards.

Thank you for your time reviewing this! 🚀

---------

Signed-off-by: nathon-lee <leejianwoo@gmail.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: nathon-lee <248585198+nathon-lee@users.noreply.github.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Add office hours link to README

Signed-off-by: Logan Adams <loadams@microsoft.com>
After #7986, `tests/unit/moe/test_moe.py::TestTopkGate::test` started
failing because the expected mask changed. This was caught by the full
CI run
([log](https://github.com/deepspeedai/DeepSpeed/actions/runs/25724245746)).

This PR updates `TestTopkGate`'s `drop_policy='probs'` expected mask for
the `logits2` case to match the probability-based capacity selection
introduced by #7986.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
tohtana and others added 27 commits May 13, 2026 15:32
…e-skip

AutoEP: skip AutoEP subtrees in AutoTP traversal
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…w-followups

Fix AutoEP PR #7938 review follow-ups
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…ode-layout

AutoEP: support DeepSeek-V3 remote-code layout
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Reduce retained AutoEP tests to critical path
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
…lier

Remove AutoEP backward loss multiplier
…-groups-mpu

Use active MPU for AutoEP sequence-parallel size
Three changes to make Moonlight-16B-A3B (DeepSeek-V3 MoE) work with
AutoEP + Muon + ZeRO-2:

1. e_score_correction_bias: copy the pretrained noaux_tc score-correction
   bias from the source gate into AutoEP routers and apply it in the
   TokenChoiceTopKRouter forward pass so expert selection matches the
   pretrained checkpoint.

2. is_expert_group: mark GroupedExperts w1/w2/w3 tensors with
   is_expert_group=True so Muon applies Newton-Schulz independently per
   expert slice rather than treating the stacked (E, I, O) tensor as a
   single matrix.  muon_update grows an is_expert_group kwarg; all four
   call sites inside original_muon.py and the ZeRO-2 path in
   stage_1_and_2.py pass getattr(p, 'is_expert_group', False).

3. Muon + MoE param groups in engine.py: flatten dict-style param groups
   produced by configure_moe_param_groups before filtering by use_muon;
   re-tag optimizer flags after AutoEP layer replacement; add name keys
   for MoE group splitting; call split_params_into_different_moe_groups
   when the model has MoE layers.

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Two fixes addressing masahiro's review feedback on PR #7938:

1. Auto-fill AutoEPConfig from HF model config (auto_ep_config.py,
   auto_ep.py): add fill_autoep_config_from_hf() which maps HF field
   names to AutoEP internal names on AutoEP.__init__:
   - n_group          -> num_expert_groups
   - topk_group       -> num_limited_groups
   - routed_scaling_factor -> route_scale
   User-supplied values always take precedence. Without this, Moonlight
   (DeepSeek-V3) training used route_scale=1.0 instead of 2.446,
   producing systematically wrong MoE output magnitudes.

2. Restore batched Newton-Schulz in muon_update (original_muon.py):
   replace the per-expert Python loop with a single batched call to
   zeropower_via_newtonschulz5, which already supports ndim>=2 inputs.
   This restores GPU parallelism across all E experts per step.

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Now that the per-expert Python loop is replaced with a single batched
call to zeropower_via_newtonschulz5, muon_update has no dynamic control
flow and can be compiled again.

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: Guokai Ma <guokai.ma@intel.com>
1. gram_newtonschulz: replace torch.addmm (2D only) with equivalent
   a*Q + Z@Q to support batched 3D expert weight tensors
   [num_local_experts, n, m]. Also fix diagonal() to specify dim1/dim2
   for 3D tensors.

2. deepseek_v3 preset: remove e_score_correction_bias from
   unsupported_router_bias_names since auto_ep_layer.py already
   copies it correctly (lines 398-402).

Signed-off-by: Ma, Guokai <guokai.ma@gmail.com>
@delock delock requested a review from tohtana as a code owner May 19, 2026 06:58
@delock
Copy link
Copy Markdown
Author

delock commented May 19, 2026

Closing - this PR included the full history due to rebase. Will create a clean PR with only the 5 new commits.

@delock delock closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.