Commit 3bdebc0
authored
Fix/fix autotp universal checkpoint ci (#7937)
The full CI test
[fails](https://github.com/deepspeedai/DeepSpeed/actions/runs/23735417401/job/69138729446)
throwing "RuntimeError: Cannot re-initialize CUDA" because of tests for
universal checkpoint and AutoTP.
It happens because they run `torch.cuda.current_device()` under `pytest
--forked`. As the tests only touch universal checkpoint metadata, we
won't need to call it. This PR skips constructor-time AutoTP
materialization when `mp_group` is `None`.
Partitioning still happens in the real AutoTP usage where an actual
model-parallel group is given.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>1 parent 89bf0d2 commit 3bdebc0
1 file changed
Lines changed: 13 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
369 | 369 | | |
370 | 370 | | |
371 | 371 | | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
372 | 378 | | |
373 | 379 | | |
374 | 380 | | |
| |||
579 | 585 | | |
580 | 586 | | |
581 | 587 | | |
582 | | - | |
| 588 | + | |
| 589 | + | |
583 | 590 | | |
584 | 591 | | |
585 | 592 | | |
| |||
674 | 681 | | |
675 | 682 | | |
676 | 683 | | |
677 | | - | |
| 684 | + | |
678 | 685 | | |
679 | 686 | | |
680 | 687 | | |
| |||
1234 | 1241 | | |
1235 | 1242 | | |
1236 | 1243 | | |
1237 | | - | |
| 1244 | + | |
| 1245 | + | |
1238 | 1246 | | |
1239 | 1247 | | |
1240 | 1248 | | |
| |||
1352 | 1360 | | |
1353 | 1361 | | |
1354 | 1362 | | |
1355 | | - | |
| 1363 | + | |
| 1364 | + | |
1356 | 1365 | | |
1357 | 1366 | | |
1358 | 1367 | | |
| |||
0 commit comments