[BUG] crashed training gpt model using 3d parallelization

I always failed to use 3d (using pipe stages 4, mp 4, and zero stage 1) to pretrain gpt model, the following is the crashed message,

[rank54]:   File "/.../miniconda3/lib/python3.13/site-packages/deepspeed/runtime/engine.py", line 2392, in _backward_post_hook

[rank54]:     raise RuntimeError(error_msg)

[rank54]: RuntimeError: Loss scaling is required for this configuration, but backward() was called directly without scaling the loss. Please use one of the following: 1. engine.backward(loss) 2. engine.scale(loss).backward()

[rank88]:[W325 16:06:00.516742289 ProcessGroupNCCL.cpp:3955] Warning: [PG ID 0 PG GUID 0(default_pg) Rank 88] An unbatched P2P op (send/recv) was called on this ProcessGroup with size 128.  In eager initialization mode, unbatched P2P ops are treated as independent collective ops, and are thus serialized with all other ops on this ProcessGroup, including other P2P ops. To avoid serialization, either create additional independent ProcessGroups for the P2P ops or use batched P2P ops. You can squash this warning by setting the environment variable TORCH_NCCL_SHOW_EAGER_INIT_P2P_SERIALIZATION_WARNING to false. (function operator())


**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

**Expected behavior**
A clear and concise description of what you expected to happen.

**ds_report output**
Please run `ds_report` to give us details about your setup.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**System info (please complete the following information):**
 - OS: [e.g. Ubuntu 18.04]
 - GPU count and types [e.g. two machines with x8 A100s each]
 - Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
 - Python version
 - Any other relevant info about your setup

**Launcher context**
Are you launching your experiment with the `deepspeed` launcher, MPI, or something else?

**Docker context**
Are you using a specific docker image that you can share?

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] crashed training gpt model using 3d parallelization #7922

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] crashed training gpt model using 3d parallelization #7922

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions