Commit 2f0924a
authored
Fix process hang in process-group shutdown (#7941)
Removing the file used as the file-store while the process-group is
still active is invalid as it is still in use.
If `reuse_dist_env` is `True` the process group is still active and the
processes will try reading from that file waiting for it to exists. In
the shutdown (`destroy_process_group`) they will wait for all threads to
join but (at least) one is still waiting for that file. This will cause
the process to hang until a PyTorch-internal timeout is reached, which
currently is ~ 5minutes
Solution is to create a unique file. I chose to put it in in `tmpdir`
and add a suffix to differentiate it.
Note that `tmpdir` is not enough as this method is called through the
fixture setup already once so that is not clean when called later in the
test execution
CC @mrwyattii , author of #3850 adding this code
---------
Signed-off-by: Alexander Grund <alexander.grund@tu-dresden.de>1 parent 3bdebc0 commit 2f0924a
1 file changed
Lines changed: 5 additions & 9 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
12 | 12 | | |
13 | 13 | | |
14 | 14 | | |
| 15 | + | |
15 | 16 | | |
16 | 17 | | |
17 | 18 | | |
| |||
336 | 337 | | |
337 | 338 | | |
338 | 339 | | |
339 | | - | |
340 | | - | |
341 | | - | |
342 | 340 | | |
343 | 341 | | |
344 | 342 | | |
345 | 343 | | |
346 | | - | |
347 | | - | |
348 | | - | |
349 | | - | |
350 | | - | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
351 | 347 | | |
352 | 348 | | |
353 | 349 | | |
| |||
357 | 353 | | |
358 | 354 | | |
359 | 355 | | |
360 | | - | |
| 356 | + | |
361 | 357 | | |
362 | 358 | | |
363 | 359 | | |
| |||
0 commit comments