Enable bf16 check_grad_overflow by default (matching fp16)#8035
Open
yongzhe-wang wants to merge 1 commit into
Open
Enable bf16 check_grad_overflow by default (matching fp16)#8035yongzhe-wang wants to merge 1 commit into
yongzhe-wang wants to merge 1 commit into
Conversation
DeepSpeed bf16 has a documented assumption that bf16's wider dynamic range eliminates the need for skip-on-overflow handling (per the BF16 docs at deepspeed.ai/docs/config-json/: "Training with bfloat16 does not require loss scaling"). However ZeRO-2 (non-offload) + partition- flat gradient accumulation can produce a bf16 element overflow in averaged_gradients[i] under specific training patterns, including: - Mixture-of-Transformers (modality-specific transformer branches) - Heterogeneous per-sample loss masks (validity dropout) - Long-tail gradient distributions When this happens, Adam.step in a fused kernel computes inf/sqrt(inf) = NaN, corrupting the partition slice's m, v, and master weights in a single update. The next forward propagates NaN through every layer and the training run is dead with no useful diagnostic. This behavior is reproducible in DeepSpeed 0.16.9 - 0.17.1 with a small reproducer (ZeRO-2 + bf16 + MoT + 50% action-invalid samples, NaN at step ~22). PR deepspeedai#6976 introduced the `check_grad_overflow` option and the underlying step-skip logic in stage_1_and_2.step() already handles overflow correctly when the check fires, but the option defaults to False for bf16, leaving these users unprotected. This commit flips the default to True so bf16 users get the same protection fp16 users already get by default. Users who have benchmarked the check as too expensive AND have separately confirmed their bf16 path cannot overflow can opt out by setting `"bf16": {"check_grad_overflow": false}` in their ds_config.json. The runtime cost is one L2-norm-style scan over the gradient partition per optimizer step (already implemented in DeepSpeedZeroOptimizer.check_overflow); typically under 1% of step wallclock in our measurements. Refs: - Issue deepspeedai#5242 (open): bf16+ZeRO-2 NaN on real training runs - PR deepspeedai#6976: introduced check_grad_overflow option (default False)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Flip
DeepSpeedBF16Config.check_grad_overflowdefault fromFalsetoTrue, so bf16 users get the same gradient-overflow protection that fp16 users already get by default.Motivation
The bf16 documentation states bf16 "does not require loss scaling" (deepspeed.ai/docs/config-json/), but this overstates the safety guarantee for the bf16 + ZeRO-2 (non-offload) partition-flat gradient accumulation path. We reproduced a deterministic catastrophic NaN under a small set of training conditions:
Under these conditions, a single bf16 element in
engine.optimizer.averaged_gradients[i]overflows to+inf. The downstreamAdam.stepthen computesinf / sqrt(inf) = NaNin a fused kernel, which simultaneously corrupts the partition slice'sexp_avg,exp_avg_sq, and fp32 master weights. The next forward pass propagates NaN through every layer; the training run is dead with no useful diagnostic. Reproduced consistently in DeepSpeed 0.16.9 - 0.17.1 at step ~22 with our internal repro.The infrastructure to detect and skip such steps was correctly added by #6976 (
check_grad_overflowoption,DeepSpeedZeroOptimizer.check_overflowmethod, and step-skip logic atstage_1_and_2.step()lines ~2128-2143). However the default was set toFalsefor bf16, so users hitting this condition do not receive the protection.Change
Single line:
check_grad_overflow: bool = False->check_grad_overflow: bool = TrueinDeepSpeedBF16Config. Updated docstring + bf16 example block accordingly.Backward compatibility
Users who have benchmarked the check as too expensive AND have separately confirmed their bf16 path cannot overflow can opt out by setting:
```json
"bf16": {
"enabled": true,
"check_grad_overflow": false
}
```
The runtime cost is one isfinite-style scan over the gradient partition per optimizer step (already implemented in
DeepSpeedZeroOptimizer.check_overflow); typically under 1% of step wallclock.Related
check_grad_overflowoption and underlying skip logicTest plan
False, run dies at step ~22; withTrue, training survives via DeepSpeed's existing skip-step path.precision_config.py.