AutoEP + Muon fixes for Moonlight (DeepSeek-V3 MoE) by delock · Pull Request #25 · tohtana/DeepSpeed

delock · 2026-05-19T06:58:29Z

Summary

Fixes to make Moonlight-16B-A3B (DeepSeek-V3 MoE) work with AutoEP + Muon + ZeRO-2, rebased onto the latest PR deepspeedai#7938 HEAD.

Changes

e_score_correction_bias: Copy pretrained noaux_tc score-correction bias into AutoEP routers.
is_expert_group: Muon applies NS independently per expert slice. Fix gram_newtonschulz for batched 3D tensors.
fill_autoep_config_from_hf: Auto-fill AutoEPConfig from HF model config.
Muon + MoE param groups: Flatten dict-style param groups; re-tag optimizer flags; split MoE groups.
deepseek_v3 preset: Remove e_score_correction_bias from unsupported list.
ns_method in expert group path: Respect ns_method when is_expert_group=True.

Tested with Moonlight-16B-A3B on MMLU, MBPP, GSM8K (4x H200, AutoEP ep_size=4, ZeRO-2 + Muon).

Currently the CI full test shows a [CUDA reinit error](https://github.com/deepspeedai/DeepSpeed/actions/runs/24444633640/job/71417719445). This PR includes the following fixes: - Fix `compute_capability_args()` in JIT mode to read `TORCH_CUDA_ARCH_LIST` before calling `torch.cuda.get_device_capability()` and restores JIT builder state after `jit_load()`. It also adds regression tests for the explicit-arch, bad-fork, and restore paths. - Delay initialization of CUDA streams in DeepCompile After this fix, the full test [passed](https://github.com/deepspeedai/DeepSpeed/actions/runs/24508304055/job/71632434455) again. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

… step (#7981) ## Summary ZeRO-1/2 + `offload_optimizer` + `gradient_accumulation_steps=1` with multiple `engine.backward()` calls per optimizer step (via `set_gradient_accumulation_boundary()`, formalized in #7665) silently drops all but the last backward's gradient. `copy_grads_in_partition` only called `async_accumulate_grad_in_cpu_via_gpu` under `if gradient_accumulation_steps > 1`, so with `ga_steps=1` intermediate backwards' reduced grads were never stored. The boundary `async_inplace_copy_grad_to_fp32_buffer_from_gpu` then overwrote (not added) the fp32 buffer with the last chunk only. ZeRO-3 + offload and non-offload ZeRO-1/2 are unaffected. ## Fix Replace the `ga > 1` gate with one that fires exactly when a CPU accumulator is needed: ```python if self.micro_step_id > 0 or not self.is_gradient_accumulation_boundary: self.async_accumulate_grad_in_cpu_via_gpu(param) ``` - `ga_steps=1` + single `backward()` → skipped. No CPU buffer, no extra copy. Fast path preserved. - `ga_steps=1` + multi-backward → accumulates correctly across calls. - `ga_steps>1` → identical to prior behaviour. ## Measurement 2x H100, 3-layer MLP, Adam, lr=1e-3, N=4 backwards/step, ga_steps=1 Max param diff vs no-offload reference: | | fp32 | bf16 | | ------ | ------------------------------------ | -------------------- | | Before | 2.00e-03 (wrong, around 2 x lr) | — | | After | 7.45e-09 (noise) | 0.00e+00 | ## Tests New `tests/unit/v1/zero/test_zero2_offload_multi_backward.py`, parametrized over ZeRO-1/2: multi-backward offload matches no-offload / single-backward unchanged / multi-step state-leak guard / single-backward allocates no CPU buffer (perf guard) / `ga_steps>1` + offload unchanged (#7967 regression guard). --------- Signed-off-by: Sung Hyun Cho <hope5487@gmail.com>

…rd/backward visits (#7980) In PyTorch AOT Autograd, having tensors requiring grad in inputs doesn't guarantee backward graph compilation. If no output requires grad and no input requiring grad is mutated, aot_autograd skips backward compilation (see [1]). DeepCompile previously required backward compilation for every forward graph which required grad, but relied solely on the existence of require_grad tensors. This mismatch caused unbalanced forward/backward visits, leaving graphs unvisited in `frames_needing_bwd`. The patched FunctionMeta then remained effective during backward execution, raising KeyError when removing the (already-removed) frame IDs from the `frames_needing_bwd` set. A reproduction can be found at [2]. Simply put a guard on the set removal operation is insufficient. The backward graph is still recompiled on each iteration, severely impacting performance. Instead of duplicating how AOT Autograd determines whether to compile the backward graph, use the fact that a joint graph requires a backward pass if and only if it is partitioned into a forward and a backward module. The frame IDs of partitioned graphs are collected in the patched partition functions and then used to determine `needs_backward` in the forward compile function. `backend_fn` is not a proper place for the second step since autograd creates fw/bw compile functions before partitioning a joint graph. References [1] https://github.com/pytorch/pytorch/blob/aea31e0c306e2315bf6d84255e0dde7adf09762a/torch/_functorch/aot_autograd.py#L618 [2] https://gist.github.com/eternalNight/96d6bc60e2bf566fda1300154d0e89dc Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>

Fix CPUAdam same-step subgroup drift in ZeRO-3 (#7819) This PR ports the fix from #7820 to the latest DeepSpeed version. It makes `Adam_Optimizer::IncrementStep` idempotent for repeated calls at the same logical step and avoids unnecessary recomputation when the step has not changed. ZeRO-3/SuperOffload can invoke multiple subgroup updates within a single logical step on a shared native optimizer object. The previous logic mixed multiply and recompute paths, producing non-bit-identical bias-correction metadata across subgroup calls. This change aligns the step-transition logic in both the CPU and XPU headers, clarifies first-step and non-sequential-step behavior, and prevents unnecessary work on repeated same-step updates. It also adds CPUAdam regression tests covering subgroup-style repeated same-step updates through both `step_subgroup()` and `step()` with parameter swapping. Signed-off-by: st_bang <st.bang@dgist.ac.kr>

Fix #6596 --------- Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>

# Fix BF16_Optimizer last-microbatch grad leak under ZeRO-1 ## Summary `DeepSpeedEngine._backward_epilogue` calls `allreduce_gradients()` **before** `optimizer.backward_epilogue()`. For `BF16_Optimizer` (used when `bf16` model + `grad_accum_dtype: fp32` + ZeRO stage 1) without `immediate_grad_update`, this means the **boundary microbatch's gradient is added to the rank-local fp32 accumulator AFTER the cross-rank allreduce, so it is silently skipped from the average.** The bias is `(world_size − 1) / world_size × 1 / gradient_accumulation_steps` of the per-step gradient. Because the bias scales with per-microbatch grad weight, training trajectories visibly diverge depending on `per_device_train_batch_size` even with identical effective batch size — the symptom users see is loss / grad-norm curves drifting apart between otherwise-equivalent configs. The bug is reproducible in DeepSpeed 0.18.6 through current `master` (0.18.10 at time of writing). ## Fix Swap the order so `optimizer.backward_epilogue()` runs before `allreduce_gradients()`, with `exit_backward()` left after. `exit_backward()` only manages backward-hook state (`_backward_hook_state`); it has no ordering dependency on the gradient accumulator. ```python def _backward_epilogue(self): self._stop_timers(self.engine_timers.backward_inner_timers) self._start_timers(self.engine_timers.backward_reduce_timers) # NEW: run backward_epilogue() before allreduce so the boundary microbatch # grad lands in the optimizer accumulator that gets reduced. if isinstance(self.optimizer, ZeROOptimizer): self.optimizer.backward_epilogue() if self.enable_backward_allreduce and not self.inside_no_sync_ctxt: self.allreduce_gradients() if isinstance(self.optimizer, ZeROOptimizer): self.optimizer.exit_backward() see_memory_usage("Engine after backward", force=self.memory_breakdown()) self._stop_timers(self.engine_timers.backward_reduce_timers) self._stop_timers(self.engine_timers.backward_timers) ``` Diff: +10 / −1, single file (`deepspeed/runtime/engine.py`). ## Root cause walkthrough The bug requires **both** of the following to be true: 1. The accumulator that `optimizer.backward_epilogue()` mutates is the **same tensor** that `engine.allreduce_gradients()` later reduces, AND 2. The accumulator is updated *only* by `optimizer.backward_epilogue()` (no per-param hooks updating it inline during backward). Both conditions hold for `BF16_Optimizer` without `immediate_grad_update`: - It maintains a **separate fp32 accumulator** (`fp32_groups_gradients_flat`) — distinct from `param.grad`. - Its `backward_epilogue()` calls `update_hp_grads()` which casts each param's bf16 `lp.grad` to fp32 and adds it into that accumulator (and only this code path fills the accumulator when `immediate_grad_update=False`). - `engine.allreduce_gradients()` → `buffered_allreduce_fallback()` → `optimizer.get_grads_for_reduction()` returns **the same `non_expert_gradients` list = `fp32_groups_gradients_flat`**. So on the gradient-accumulation boundary microbatch: 1. `loss.backward()` populates bf16 `lp.grad` for that microbatch. 2. `_backward_epilogue` first calls `allreduce_gradients()`. The fp32 accumulator at this point contains microbatches `0..ga-2`'s grads (summed locally on each rank). The allreduce averages **only that** across ranks. 3. `_backward_epilogue` then calls `optimizer.backward_epilogue()` → `update_hp_grads()` → adds the boundary microbatch's local `lp.grad` to the now-allreduced accumulator. Result, per rank `i`: ``` fp32_buffer_rank_i = avg_ranks(Σ_{m=0..ga-2} grad_m) + local_grad_{ga-1}_rank_i └────── shared across ranks ─────┘ └─── rank-private leak ───┘ ``` ZeRO-1 partitions optimizer states across ranks, so each rank then runs `optimizer.step()` on its slice of this rank-divergent buffer; `update_lp_params()` allgathers the bf16 params back. The effective gradient applied to parameter `p` is: ``` g_p = avg_ranks(prior microbatches' grad for p) + local_grad_last_for_p_at_owning_rank ``` i.e. the boundary microbatch's contribution captures only `1 / world_size` of the cross-rank average, biasing the global gradient by `(world_size − 1) / world_size × 1 / ga_steps`. ## Impact on other optimizers (no behavior change) | Optimizer | Accumulator | How accumulator is filled | Reduction path | Affected by current bug? | Affected by this fix? | |---|---|---|---|---|---| | `BF16_Optimizer` (immediate_grad_update=False) | separate `fp32_groups_gradients_flat` | only via `optimizer.backward_epilogue()` → `update_hp_grads()` | `engine.allreduce_gradients` → `buffered_allreduce_fallback` → `optimizer.get_grads_for_reduction()` returns the same fp32 buffer | **Yes — leak** | **Yes — leak fixed** | | `BF16_Optimizer` (immediate_grad_update=True) | same fp32 buffer | per-param hooks (`create_grad_acc_hooks`) fire inline during backward | same allreduce path | No (hooks already filled buffer before allreduce) | No-op (`update_hp_grads` early-returns when `immediate_grad_update`) | | `DeepSpeedZeroOptimizer_Stage1And2` (ZeRO-1, default for bf16+bf16-grad-accum) | `param.grad` directly + ipg buckets | hooks fire inline during backward (`overlap_comm=True` default), or `reduce_gradients()` walks all params at boundary | `engine.allreduce_gradients` takes the `if hasattr(self.optimizer, 'reduce_gradients')` branch → `optimizer.reduce_gradients()` walks all params; the boundary microbatch's grad is already on `param.grad` (autograd populates this before `_backward_epilogue` runs) | No | No-op (`Stage1And2.backward_epilogue` does not mutate the reduction buffer) | | `DeepSpeedZeroOptimizer_Stage3` | partitioned via `overlapping_partition_gradients_reduce_epilogue()` | reduce-scatter inline during backward via hooks | `engine.allreduce_gradients` takes the `if zero_optimization_partition_gradients()` branch → calls overlapping epilogue, which is fed by hooks | No | No-op | In short: the fix is functionally relevant **only** for `BF16_Optimizer` without `immediate_grad_update`. For every other ZeRO optimizer the change is observably a no-op because their `backward_epilogue` does not mutate the buffer being reduced. ## Reproducer The minimum reproducer is a 2-rank standalone script that runs one gradient-accumulation cycle and prints the per-rank fp32 accumulator norm at each microbatch and immediately before `engine.step()`. With the bug present the per-rank values **disagree at the boundary microbatch and going into the optimizer step**; with the fix they agree. Save as `probe_bf16_grad_accum.py`: ```python """Probe whether DeepSpeed's BF16_Optimizer leaks the boundary microbatch grad out of the cross-rank average. Run with two ranks, e.g.: deepspeed --num_gpus 2 probe_bf16_grad_accum.py or accelerate launch --num_processes 2 --num_machines 1 probe_bf16_grad_accum.py """ import os import torch import torch.nn as nn import deepspeed import torch.distributed as dist def main(): rank = int(os.environ.get("RANK", "0")) world = int(os.environ.get("WORLD_SIZE", "1")) GA_STEPS = 4 HIDDEN = 64 BATCH = 4 torch.manual_seed(0) # SAME init across ranks model = nn.Sequential( nn.Linear(HIDDEN, HIDDEN), nn.GELU(), nn.Linear(HIDDEN, HIDDEN), ).to(torch.bfloat16).cuda() ds_config = { "bf16": {"enabled": True}, # set "immediate_grad_update": True to also bypass the bug "data_types": {"grad_accum_dtype": "fp32"}, "communication_data_type": "fp32", "zero_optimization": { "stage": 1, "overlap_comm": True, "contiguous_gradients": True, "reduce_scatter": True, "allgather_partitions": True, "allgather_bucket_size": 200000000, "reduce_bucket_size": 200000000, }, "train_micro_batch_size_per_gpu": BATCH, "gradient_accumulation_steps": GA_STEPS, "train_batch_size": BATCH * world * GA_STEPS, "gradient_clipping": 0.0, "steps_per_print": 9999, } optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, betas=(0.9, 0.95)) engine, _, _, _ = deepspeed.initialize(model=model, optimizer=optimizer, config=ds_config) bf16_opt = engine.optimizer if rank == 0: print(f"[INFO] DeepSpeed optimizer class: {type(bf16_opt).__name__}", flush=True) print(f"[INFO] grad_acc_dtype = {bf16_opt.grad_acc_dtype}", flush=True) print(f"[INFO] world={world} ga={GA_STEPS} batch={BATCH}", flush=True) def buffer_summary(label): bufs = bf16_opt.fp32_groups_gradients_flat local_norm = sum(b.detach().to(torch.float64).norm().item() ** 2 for b in bufs) ** 0.5 rank_buf = torch.tensor([local_norm], device="cuda", dtype=torch.float64) gathered = [torch.zeros_like(rank_buf) for _ in range(world)] dist.all_gather(gathered, rank_buf) if rank == 0: vals = [g.item() for g in gathered] diffs = [(v - vals[0]) / vals[0] * 100 if vals[0] != 0 else 0 for v in vals] print(f"[{label}] per_rank={vals} diff_pct_vs_rank0={diffs}", flush=True) dist.barrier() buffer_summary("init (zero)") # Different inputs per rank so per-rank grads differ. torch.manual_seed(100 + rank) inputs = [torch.randn(BATCH, HIDDEN, device="cuda", dtype=torch.bfloat16) for _ in range(GA_STEPS)] targets = [torch.randn(BATCH, HIDDEN, device="cuda", dtype=torch.bfloat16) for _ in range(GA_STEPS)] for i in range(GA_STEPS): is_boundary = (i == GA_STEPS - 1) engine.set_gradient_accumulation_boundary(is_boundary=is_boundary) # what accelerate's wrapper does out = engine(inputs[i]) loss = ((out - targets[i]) ** 2).mean() engine.backward(loss) buffer_summary(f"after backward microbatch {i} (boundary={is_boundary})") buffer_summary("BEFORE engine.step()") engine.step() if __name__ == "__main__": main() ``` ## Verification ### Probe (synthetic, 2 GPUs, 1 grad-accum cycle) Run on `master` (bug): ``` [init (zero)] per_rank=[0.0, 0.0 ] diff = 0% [after microbatch 0 (boundary=False)] per_rank=[0.1495, 0.1378] diff = -7.78% (local-only accumulation) [after microbatch 1 (boundary=False)] per_rank=[0.1998, 0.2098] diff = +5.03% [after microbatch 2 (boundary=False)] per_rank=[0.2322, 0.2434] diff = +4.82% [after microbatch 3 (boundary=True)] per_rank=[0.2206, 0.2123] diff = -3.80% ← bug [BEFORE engine.step()] per_rank=[0.2206, 0.2123] diff = -3.80% ← bug ``` Run on this PR (fixed): ``` [init (zero)] per_rank=[0.0, 0.0 ] diff = 0% [after microbatch 0 (boundary=False)] per_rank=[0.1495, 0.1378] diff = -7.78% [after microbatch 1 (boundary=False)] per_rank=[0.1998, 0.2098] diff = +5.03% [after microbatch 2 (boundary=False)] per_rank=[0.2322, 0.2434] diff = +4.82% [after microbatch 3 (boundary=True)] per_rank=[0.1942, 0.1942] diff = 0.00% ← fixed [BEFORE engine.step()] per_rank=[0.1942, 0.1942] diff = 0.00% ← fixed ``` The same agreement is reproduced by the existing `bf16: { immediate_grad_update: true }` workaround, which uses per-param hooks to fill the fp32 accumulator inline during backward (and is therefore not affected by the `_backward_epilogue` ordering). ### End-to-end training (HuggingFace Trainer + accelerate + DeepSpeed, 2× A100) A small custom Qwen3-derived model (~64M params, bf16, ZeRO-1 with `grad_accum_dtype: fp32`), 10 optimizer steps, identical seed and data ordering, identical effective batch size (`global_batch_size = 64`), only `per_device_train_batch_size` varies (so `gradient_accumulation_steps = global_batch_size / (per_device * world_size)` differs). | Configuration | Run A: per_device=2, ga=16 | Run B: per_device=8, ga=4 | Final loss gap | |---|---|---|---| | DeepSpeed `master` + `grad_accum_dtype: fp32` (broken) | `train_loss = 6.896` | `train_loss = 6.999` | **0.103** | | No DeepSpeed (DDP + native grad-accum, control) | `train_loss = 6.9035` | `train_loss = 6.9037` | 0.0002 (bf16 noise) | | `master` + `bf16.immediate_grad_update: true` (existing workaround) | `train_loss = 6.9057` | `train_loss = 6.9057` | < 0.0001 | | **This PR + original config** | `train_loss = 6.9057` | `train_loss = 6.9059` | 0.0002 (bf16 noise) | The broken case also produces qualitatively misleading instabilities — e.g. at step 5 in the broken run, B's grad-norm spikes to **17.0** vs A's **1.35** (≈ 12× ratio), while in the fixed case the two grad-norm trajectories agree to within bf16 noise at every step. Per-step loss / grad-norm trajectories under the fixed engine (this PR), for completeness: | step | A loss | A gnorm | B loss | B gnorm | |---|---|---|---|---| | 1 | 9.1907 | 8.0613 | 9.1907 | 8.0615 | | 2 | 8.1962 | 5.0553 | 8.1961 | 5.0561 | | 3 | 7.2035 | 2.2668 | 7.2035 | 2.2683 | | 4 | 7.0588 | 3.1079 | 7.0618 | 3.1118 | | 5 | 6.5661 | 2.6192 | 6.5627 | 2.4634 | | 6 | 6.3097 | 1.8798 | 6.3086 | 1.8913 | | 7 | 6.1332 | 1.2811 | 6.1317 | 1.2584 | | 8 | 6.1297 | 2.8776 | 6.1305 | 2.9574 | | 9 | 6.1739 | 1.4336 | 6.1748 | 1.4687 | | 10 | 6.0950 | 1.5941 | 6.0957 | 1.6031 | ## Notes - `BF16_Optimizer` users on `master` who are not pinning `per_device_train_batch_size` may see silently degraded training when sweeping per-device batch sizes (the symptom that triggered this investigation). The bug also makes per-rank model weights briefly diverge between the optimizer step and the next `update_lp_params()` allgather, which means cross-rank invariants (e.g. asserts that compare per-rank state) can flip behavior depending on `gradient_accumulation_steps`. - Tested on DeepSpeed 0.18.6 (where the bug was first observed) and confirmed unchanged on `master` (0.18.10). - No new tests are added in this PR, but a regression test that asserts cross-rank fp32 buffer agreement after the boundary microbatch in `BF16_Optimizer` would be a natural follow-up. --------- Signed-off-by: Max Yu <18641481+maxyu1115@users.noreply.github.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com> Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>