It looks like deepspeed is not fork-safe:
- During import it checks
builder.is_compatible() for all ops
- Some ops use
torch.cuda.get_device_properties
- This initializes the CUDA context
See
|
compatible_ops[op_name] = op_compatible |
When the process then forks to run in parallel any access to CUDA through PyTorch will fail:
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
I'm not sure why this doesn't seem to be fully consistent but I can reproduce it with:
pytest --forked tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py -k 'test_DS4Sci_EvoformerAttention[tensor_shape1-dtype1]' -s
Where it will fail when invoking skip_on_arch(8 if dtype == torch.bfloat16 else 7) which calls torch.cuda.get_device_properties, now in the forked subprocess.
This is an issue in general: Fork-multiprocessing can not be used after importing deepspeed
But it also contradicts the documentation:
Note that pytest-forked and the --forked flag are required to test CUDA functionality in distributed tests.
It seems the opposite is true: The flag must not be used.
Or am I missing anything?
It looks like deepspeed is not fork-safe:
builder.is_compatible()for all opstorch.cuda.get_device_propertiesSee
DeepSpeed/deepspeed/git_version_info.py
Line 30 in 38bd11a
When the process then forks to run in parallel any access to CUDA through PyTorch will fail:
I'm not sure why this doesn't seem to be fully consistent but I can reproduce it with:
pytest --forked tests/unit/ops/deepspeed4science/test_DS4Sci_EvoformerAttention.py -k 'test_DS4Sci_EvoformerAttention[tensor_shape1-dtype1]' -sWhere it will fail when invoking
skip_on_arch(8 if dtype == torch.bfloat16 else 7)which callstorch.cuda.get_device_properties, now in the forked subprocess.This is an issue in general: Fork-multiprocessing can not be used after importing deepspeed
But it also contradicts the documentation:
It seems the opposite is true: The flag must not be used.
Or am I missing anything?