Skip to content
This repository was archived by the owner on Feb 6, 2025. It is now read-only.
This repository was archived by the owner on Feb 6, 2025. It is now read-only.

Fail to initiation of communication #22

@YuMJie

Description

@YuMJie

The initiation of communication will fail if the number of nodes is set to 3.

which occur in get_group

the important of failure is followed: (run on 3 node and each node has 2 GPUs)

Traceback (most recent call last):
  File "pretrain_gpt.py", line 149, in <module>
    pretrain(train_valid_test_datasets_provider, model_provider, forward_step_func,
  File "/workspace/aceso/runtime/megatron/training.py", line 113, in pretrain
    initialize_megatron(extra_args_provider=extra_args_provider,
  File "/workspace/aceso/runtime/megatron/initialize.py", line 86, in initialize_megatron
    finish_mpu_init()
  File "/workspace/aceso/runtime/megatron/initialize.py", line 67, in finish_mpu_init
    _initialize_distributed()
  File "/workspace/aceso/runtime/megatron/initialize.py", line 209, in _initialize_distributed
    mpu.initialize_model_parallel_flexpipe()
  File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 288, in initialize_model_parallel_flexpipe
    get_group(ranks)    
  File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 560, in get_group
    group_bits = bitmap(ranks)
  File "/workspace/aceso/runtime/megatron/mpu/initialize.py", line 555, in bitmap
    raise ValueError("rank {} out of range ({})".format(rank, len(bits)))
ValueError: rank 6 out of range (6)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2618839) of binary: /usr/bin/python3
Traceback (most recent call last):

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions