From 77da67e68446a0d7c28e456d3f8bacedb7a75ed8 Mon Sep 17 00:00:00 2001
From: Yongzhe Wang <yzwang2020@gmail.com>
Date: Fri, 29 May 2026 11:37:52 +0800
Subject: [PATCH] Enable bf16 check_grad_overflow by default (matching fp16)

DeepSpeed bf16 has a documented assumption that bf16's wider dynamic
range eliminates the need for skip-on-overflow handling (per the BF16
docs at deepspeed.ai/docs/config-json/: "Training with bfloat16 does
not require loss scaling"). However ZeRO-2 (non-offload) + partition-
flat gradient accumulation can produce a bf16 element overflow in
averaged_gradients[i] under specific training patterns, including:

  - Mixture-of-Transformers (modality-specific transformer branches)
  - Heterogeneous per-sample loss masks (validity dropout)
  - Long-tail gradient distributions

When this happens, Adam.step in a fused kernel computes inf/sqrt(inf)
= NaN, corrupting the partition slice's m, v, and master weights in
a single update. The next forward propagates NaN through every layer
and the training run is dead with no useful diagnostic.

This behavior is reproducible in DeepSpeed 0.16.9 - 0.17.1 with a
small reproducer (ZeRO-2 + bf16 + MoT + 50% action-invalid samples,
NaN at step ~22). PR #6976 introduced the `check_grad_overflow`
option and the underlying step-skip logic in stage_1_and_2.step()
already handles overflow correctly when the check fires, but the
option defaults to False for bf16, leaving these users unprotected.

This commit flips the default to True so bf16 users get the same
protection fp16 users already get by default. Users who have
benchmarked the check as too expensive AND have separately confirmed
their bf16 path cannot overflow can opt out by setting
`"bf16": {"check_grad_overflow": false}` in their ds_config.json.

The runtime cost is one L2-norm-style scan over the gradient
partition per optimizer step (already implemented in
DeepSpeedZeroOptimizer.check_overflow); typically under 1% of step
wallclock in our measurements.

Refs:
  - Issue #5242 (open): bf16+ZeRO-2 NaN on real training runs
  - PR #6976: introduced check_grad_overflow option (default False)
---
 deepspeed/runtime/precision_config.py | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/deepspeed/runtime/precision_config.py b/deepspeed/runtime/precision_config.py
index efec5c9d00c8..c2bb16b04e6d 100644
--- a/deepspeed/runtime/precision_config.py
+++ b/deepspeed/runtime/precision_config.py
@@ -24,7 +24,7 @@
 "bf16": {
   "enabled": true,
   "immediate_grad_update": false,
-  "check_grad_overflow": false
+  "check_grad_overflow": true
 }
 '''
 BFLOAT16 = "bf16"
@@ -53,9 +53,20 @@ class DeepSpeedBF16Config(DeepSpeedConfigModel):
     Apply gradient updates immediately rather than delayed.
     """
 
-    check_grad_overflow: bool = False
-    """
-    Check for gradient overflows and underflows
+    check_grad_overflow: bool = True
+    """
+    Detect gradient overflow/underflow before optimizer step and skip the step
+    when detected. Default True (matching fp16 default) because bf16 partition-flat
+    gradient accumulation in ZeRO-2 with heterogeneous per-sample loss masks (e.g.
+    Mixture-of-Transformers + per-sample validity dropout) can produce a bf16 element
+    that overflows to +inf in averaged_gradients[i]. Without this check, Adam.step
+    computes inf/sqrt(inf)=NaN inside a fused kernel, simultaneously corrupting
+    thousands of parameter tensors and ending the training run with no useful
+    diagnostic. Set False only if you have measured this check to be too expensive
+    and have separately confirmed your bf16 path cannot overflow.
+    See:
+      - github.com/deepspeedai/DeepSpeed/issues/5242
+      - github.com/deepspeedai/DeepSpeed/pull/6976 (introduced the option)
     """
 
     bf16_master_weights_and_grads: bool = False