Skip to content

[BUG] crashed training gpt model using 3d parallelization #7922

@yuenxq

Description

@yuenxq

I always failed to use 3d (using pipe stages 4, mp 4, and zero stage 1) to pretrain gpt model, the following is the crashed message,

[rank54]: File "/.../miniconda3/lib/python3.13/site-packages/deepspeed/runtime/engine.py", line 2392, in _backward_post_hook

[rank54]: raise RuntimeError(error_msg)

[rank54]: RuntimeError: Loss scaling is required for this configuration, but backward() was called directly without scaling the loss. Please use one of the following: 1. engine.backward(loss) 2. engine.scale(loss).backward()

[rank88]:[W325 16:06:00.516742289 ProcessGroupNCCL.cpp:3955] Warning: [PG ID 0 PG GUID 0(default_pg) Rank 88] An unbatched P2P op (send/recv) was called on this ProcessGroup with size 128. In eager initialization mode, unbatched P2P ops are treated as independent collective ops, and are thus serialized with all other ops on this ProcessGroup, including other P2P ops. To avoid serialization, either create additional independent ProcessGroups for the P2P ops or use batched P2P ops. You can squash this warning by setting the environment variable TORCH_NCCL_SHOW_EAGER_INIT_P2P_SERIALIZATION_WARNING to false. (function operator())

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

ds_report output
Please run ds_report to give us details about your setup.

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: [e.g. Ubuntu 18.04]
  • GPU count and types [e.g. two machines with x8 A100s each]
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
  • Python version
  • Any other relevant info about your setup

Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context
Are you using a specific docker image that you can share?

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtraining

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions