Enable bf16 check_grad_overflow by default (matching fp16) by yongzhe-wang · Pull Request #8035 · deepspeedai/DeepSpeed

yongzhe-wang · 2026-05-29T03:38:21Z

Summary

Flip DeepSpeedBF16Config.check_grad_overflow default from False to True, so bf16 users get the same gradient-overflow protection that fp16 users already get by default.

Motivation

The bf16 documentation states bf16 "does not require loss scaling" (deepspeed.ai/docs/config-json/), but this overstates the safety guarantee for the bf16 + ZeRO-2 (non-offload) partition-flat gradient accumulation path. We reproduced a deterministic catastrophic NaN under a small set of training conditions:

ZeRO-2 (non-offload) + bf16
Mixture-of-Transformers (modality-specific transformer branches)
Heterogeneous per-sample loss masks (e.g. 50% action-invalid samples in robotics VLA training)

Under these conditions, a single bf16 element in engine.optimizer.averaged_gradients[i] overflows to +inf. The downstream Adam.step then computes inf / sqrt(inf) = NaN in a fused kernel, which simultaneously corrupts the partition slice's exp_avg, exp_avg_sq, and fp32 master weights. The next forward pass propagates NaN through every layer; the training run is dead with no useful diagnostic. Reproduced consistently in DeepSpeed 0.16.9 - 0.17.1 at step ~22 with our internal repro.

The infrastructure to detect and skip such steps was correctly added by #6976 (check_grad_overflow option, DeepSpeedZeroOptimizer.check_overflow method, and step-skip logic at stage_1_and_2.step() lines ~2128-2143). However the default was set to False for bf16, so users hitting this condition do not receive the protection.

Change

Single line: check_grad_overflow: bool = False -> check_grad_overflow: bool = True in DeepSpeedBF16Config. Updated docstring + bf16 example block accordingly.

Backward compatibility

Users who have benchmarked the check as too expensive AND have separately confirmed their bf16 path cannot overflow can opt out by setting:

```json
"bf16": {
"enabled": true,
"check_grad_overflow": false
}
```

The runtime cost is one isfinite-style scan over the gradient partition per optimizer step (already implemented in DeepSpeedZeroOptimizer.check_overflow); typically under 1% of step wallclock.

Test plan

Reproducer (private repo): with default False, run dies at step ~22; with True, training survives via DeepSpeed's existing skip-step path.
Existing CI should pass unchanged; this PR only changes a default value in precision_config.py.

DeepSpeed bf16 has a documented assumption that bf16's wider dynamic range eliminates the need for skip-on-overflow handling (per the BF16 docs at deepspeed.ai/docs/config-json/: "Training with bfloat16 does not require loss scaling"). However ZeRO-2 (non-offload) + partition- flat gradient accumulation can produce a bf16 element overflow in averaged_gradients[i] under specific training patterns, including: - Mixture-of-Transformers (modality-specific transformer branches) - Heterogeneous per-sample loss masks (validity dropout) - Long-tail gradient distributions When this happens, Adam.step in a fused kernel computes inf/sqrt(inf) = NaN, corrupting the partition slice's m, v, and master weights in a single update. The next forward propagates NaN through every layer and the training run is dead with no useful diagnostic. This behavior is reproducible in DeepSpeed 0.16.9 - 0.17.1 with a small reproducer (ZeRO-2 + bf16 + MoT + 50% action-invalid samples, NaN at step ~22). PR deepspeedai#6976 introduced the `check_grad_overflow` option and the underlying step-skip logic in stage_1_and_2.step() already handles overflow correctly when the check fires, but the option defaults to False for bf16, leaving these users unprotected. This commit flips the default to True so bf16 users get the same protection fp16 users already get by default. Users who have benchmarked the check as too expensive AND have separately confirmed their bf16 path cannot overflow can opt out by setting `"bf16": {"check_grad_overflow": false}` in their ds_config.json. The runtime cost is one L2-norm-style scan over the gradient partition per optimizer step (already implemented in DeepSpeedZeroOptimizer.check_overflow); typically under 1% of step wallclock in our measurements. Refs: - Issue deepspeedai#5242 (open): bf16+ZeRO-2 NaN on real training runs - PR deepspeedai#6976: introduced check_grad_overflow option (default False)

yongzhe-wang requested review from tjruwase and tohtana as code owners May 29, 2026 03:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable bf16 check_grad_overflow by default (matching fp16)#8035

Enable bf16 check_grad_overflow by default (matching fp16)#8035
yongzhe-wang wants to merge 1 commit into
deepspeedai:masterfrom
yongzhe-wang:fix/bf16-check-grad-overflow-default-true

yongzhe-wang commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yongzhe-wang commented May 29, 2026

Summary

Motivation

Change

Backward compatibility

Related

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant