Skip to content

Fork safety #7918

@Flamefire

Description

@Flamefire

It looks like deepspeed is not fork-safe:

  • During import it checks builder.is_compatible() for all ops
  • Some ops use torch.cuda.get_device_properties
  • This initializes the CUDA context

See

compatible_ops[op_name] = op_compatible

When the process then forks to run in parallel any access to CUDA through PyTorch will fail:

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

I'm not sure why this doesn't seem to be fully consistent but I can reproduce it with:
pytest --forked tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py -k 'test_DS4Sci_EvoformerAttention[tensor_shape1-dtype1]' -s

Where it will fail when invoking skip_on_arch(8 if dtype == torch.bfloat16 else 7) which calls torch.cuda.get_device_properties, now in the forked subprocess.

This is an issue in general: Fork-multiprocessing can not be used after importing deepspeed

But it also contradicts the documentation:

Note that pytest-forked and the --forked flag are required to test CUDA functionality in distributed tests.

It seems the opposite is true: The flag must not be used.

Or am I missing anything?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions