[CI] Add retry if timeout#570
Conversation
There was a problem hiding this comment.
Good approach — retrying on Kit's intermittent startup deadlock is the right call for CI reliability. The refactoring into _run_once is clean and the retry-only-on-timeout logic is correct (non-zero exits still fail fast). A couple of things to consider:
Orphan cleanup between retries: The docstring says "orphaned Kit processes are cleaned up" but I don't see any cleanup code. When subprocess.run hits a timeout, it calls process.kill() which sends SIGKILL to the child PID only — not the whole process group. Since start_new_session=True puts the child in its own group, the grandchildren (Kit shader compiler, GPU workers, etc.) survive as orphans. Before retrying, you'd want to kill the entire process group, e.g.:
except subprocess.TimeoutExpired as exc:
# Kill the entire process group, not just the leader
try:
os.killpg(os.getpgid(exc.cmd_process.pid), signal.SIGKILL)
except (ProcessLookupError, OSError):
passSince subprocess.run doesn't expose the Popen object after timeout, you may need to switch to manual Popen management in _run_once to get access to the PID for group cleanup. Without this, retrying with orphaned GPU-holding processes still running could cause the retry to fail for a different reason.
Minor: The docstring could be updated to remove the "orphan cleanup" mention until that's actually implemented, to keep docs honest.
| times (env ``ISAACLAB_ARENA_SUBPROCESS_MAX_RETRIES``, default 1). | ||
|
|
||
| Args: | ||
| cmd: Command to run (list of strings). |
There was a problem hiding this comment.
Orphan process leak on timeout: subprocess.run calls process.kill() on timeout, which sends SIGKILL to the child PID only. With start_new_session=True, the child's process group (Kit shader compiler, GPU workers, etc.) won't be killed — they'll survive as orphans holding GPU resources.
Consider switching to manual Popen so you can kill the whole process group on timeout:
def _run_once(cmd, env, timeout_sec, capture_output):
kwargs = dict(env=env, start_new_session=True)
if capture_output:
kwargs.update(stdout=subprocess.PIPE, stderr=subprocess.PIPE)
proc = subprocess.Popen(cmd, **kwargs)
try:
stdout, stderr = proc.communicate(timeout=timeout_sec)
except subprocess.TimeoutExpired:
# Kill the entire process group, not just the leader
try:
os.killpg(os.getpgid(proc.pid), signal.SIGKILL)
except (ProcessLookupError, OSError):
pass
proc.wait()
raise
if capture_output:
return subprocess.CompletedProcess(cmd, proc.returncode, stdout, stderr)
return subprocess.CompletedProcess(cmd, proc.returncode)This is especially important since the whole point of the retry is to get a clean slate — retrying with orphaned GPU processes still running might not help.
| ) | ||
| return result | ||
|
|
||
|
|
There was a problem hiding this comment.
nit: The docstring says "orphaned Kit processes are cleaned up" but I don't see cleanup code in the implementation. Consider removing that claim until it's implemented, or adding the process group kill between retries.
## Summary Subprocess-spawning tests hang indefinitely on CI. ## Causes & Fixes ### Problems From Lab: 1. Lab reports "AppLauncher doesnt quit properly after app.close(), app.quit() doesn't help either." 2. Cold startup times for tests using IS can be upwards of 10 min on Lab CI machines. Above issues apply to us, because tests hang during sub-process tests section, between the end of last test and the beginning of the next test. See detailed logs and analysis from reproducing locally [here](#568) ### Fixes 1. `SimulationApp` Force Exit: Skips `app.close()` (which can hang indefinitely in Kit's shutdown path) when the env var `ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1` is set. Calls a new `_kill_child_processes()` helper that walks `/proc` to `SIGKILL` all direct children before doing `os._exit(0)`, preventing orphaned Kit processes from holding GPU resources. 2. `run_subprocess` has a configuarable wall-clock timeouts and process isolation, such that when needed, it could trigger the force exit path above. 3. Add wall-clock timing and logging inside the SimulationApp start method. Keep track of how much startup time is taking on CI. ## Minor fixes 1. Add timing stats into pytest cmds such that it reports the slowests test func at the end of each section. 2. Parametrize multi-config tests: Convert nested for-loops in `test_zero_action_policy_kitchen_pick_and_place` (6 configs) and `test_zero_action_policy_gr1_open_microwave` (3 configs) into `@pytest.mark.parametrize.` Each config gets its own timeout, pass/fail, and timing. 3. Reduce num_envs in gr00t eval_runner test to speed up. ### Local validation With the repro script #568, I do not have local stalling. Log for more details. [repro_20260410_041313.log](https://github.com/user-attachments/files/26620524/repro_20260410_041313.log) ### CI Before -- timeout <img width="1219" height="170" alt="image" src="https://github.com/user-attachments/assets/2f9eabb2-403d-4257-bd84-4da508de7d00" /> ### CI After <img width="1219" height="170" alt="image" src="https://github.com/user-attachments/assets/dbaf2a7d-e3a4-4ad2-85a4-389eae962c1d" /> <img width="1198" height="472" alt="image" src="https://github.com/user-attachments/assets/8a24f1aa-4bcb-4030-b075-09f3885673c2" /> ## TODOs - test_camera_observations takes 10mins to start the app due to Kit cold start. Experimenting with a warm start before tests process here #565 - Kit itself intermittently deadlocks during startup — not because of orphans, but because Kit's internal thread synchronization fails on low-CPU runners. Experimenting with retry here #570
Summary
Kit itself intermittently deadlocks during startup — not because of orphans, but because Kit's internal thread synchronization fails on low-CPU runners. No orphans present when it happens.
Detailed description
when Kit deadlocks, the 900s timeout kills it, and the retry launches a fresh Kit process which almost certainly succeeds (since the deadlock is rare and random).
Without the retry, a single Kit deadlock fails the whole CI run. With the retry, it costs ~15 extra minutes but the test passes.