[CI] Add retry if timeout by xyao-nv · Pull Request #570 · isaac-sim/IsaacLab-Arena

xyao-nv · 2026-04-10T06:12:43Z

Summary

Kit itself intermittently deadlocks during startup — not because of orphans, but because Kit's internal thread synchronization fails on low-CPU runners. No orphans present when it happens.

Detailed description

when Kit deadlocks, the 900s timeout kills it, and the retry launches a fresh Kit process which almost certainly succeeds (since the deadlock is rare and random).
Without the retry, a single Kit deadlock fails the whole CI run. With the retry, it costs ~15 extra minutes but the test passes.

isaaclab-review-bot

Good approach — retrying on Kit's intermittent startup deadlock is the right call for CI reliability. The refactoring into _run_once is clean and the retry-only-on-timeout logic is correct (non-zero exits still fail fast). A couple of things to consider:

Orphan cleanup between retries: The docstring says "orphaned Kit processes are cleaned up" but I don't see any cleanup code. When subprocess.run hits a timeout, it calls process.kill() which sends SIGKILL to the child PID only — not the whole process group. Since start_new_session=True puts the child in its own group, the grandchildren (Kit shader compiler, GPU workers, etc.) survive as orphans. Before retrying, you'd want to kill the entire process group, e.g.:

except subprocess.TimeoutExpired as exc:
    # Kill the entire process group, not just the leader
    try:
        os.killpg(os.getpgid(exc.cmd_process.pid), signal.SIGKILL)
    except (ProcessLookupError, OSError):
        pass

Since subprocess.run doesn't expose the Popen object after timeout, you may need to switch to manual Popen management in _run_once to get access to the PID for group cleanup. Without this, retrying with orphaned GPU-holding processes still running could cause the retry to fail for a different reason.

Minor: The docstring could be updated to remove the "orphan cleanup" mention until that's actually implemented, to keep docs honest.

isaaclab-review-bot · 2026-04-10T06:35:53Z

+    times (env ``ISAACLAB_ARENA_SUBPROCESS_MAX_RETRIES``, default 1).

    Args:
        cmd: Command to run (list of strings).


Orphan process leak on timeout: subprocess.run calls process.kill() on timeout, which sends SIGKILL to the child PID only. With start_new_session=True, the child's process group (Kit shader compiler, GPU workers, etc.) won't be killed — they'll survive as orphans holding GPU resources.

Consider switching to manual Popen so you can kill the whole process group on timeout:

def _run_once(cmd, env, timeout_sec, capture_output): kwargs = dict(env=env, start_new_session=True) if capture_output: kwargs.update(stdout=subprocess.PIPE, stderr=subprocess.PIPE) proc = subprocess.Popen(cmd, **kwargs) try: stdout, stderr = proc.communicate(timeout=timeout_sec) except subprocess.TimeoutExpired: # Kill the entire process group, not just the leader try: os.killpg(os.getpgid(proc.pid), signal.SIGKILL) except (ProcessLookupError, OSError): pass proc.wait() raise if capture_output: return subprocess.CompletedProcess(cmd, proc.returncode, stdout, stderr) return subprocess.CompletedProcess(cmd, proc.returncode)

This is especially important since the whole point of the retry is to get a clean slate — retrying with orphaned GPU processes still running might not help.

isaaclab-review-bot · 2026-04-10T06:35:53Z

+    )
+    return result




nit: The docstring says "orphaned Kit processes are cleaned up" but I don't see cleanup code in the implementation. Consider removing that claim until it's implemented, or adding the process group kill between retries.

## Summary Subprocess-spawning tests hang indefinitely on CI. ## Causes & Fixes ### Problems From Lab: 1. Lab reports "AppLauncher doesnt quit properly after app.close(), app.quit() doesn't help either." 2. Cold startup times for tests using IS can be upwards of 10 min on Lab CI machines. Above issues apply to us, because tests hang during sub-process tests section, between the end of last test and the beginning of the next test. See detailed logs and analysis from reproducing locally [here](#568) ### Fixes 1. `SimulationApp` Force Exit: Skips `app.close()` (which can hang indefinitely in Kit's shutdown path) when the env var `ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1` is set. Calls a new `_kill_child_processes()` helper that walks `/proc` to `SIGKILL` all direct children before doing `os._exit(0)`, preventing orphaned Kit processes from holding GPU resources. 2. `run_subprocess` has a configuarable wall-clock timeouts and process isolation, such that when needed, it could trigger the force exit path above. 3. Add wall-clock timing and logging inside the SimulationApp start method. Keep track of how much startup time is taking on CI. ## Minor fixes 1. Add timing stats into pytest cmds such that it reports the slowests test func at the end of each section. 2. Parametrize multi-config tests: Convert nested for-loops in `test_zero_action_policy_kitchen_pick_and_place` (6 configs) and `test_zero_action_policy_gr1_open_microwave` (3 configs) into `@pytest.mark.parametrize.` Each config gets its own timeout, pass/fail, and timing. 3. Reduce num_envs in gr00t eval_runner test to speed up. ### Local validation With the repro script #568, I do not have local stalling. Log for more details. [repro_20260410_041313.log](https://github.com/user-attachments/files/26620524/repro_20260410_041313.log) ### CI Before -- timeout <img width="1219" height="170" alt="image" src="https://github.com/user-attachments/assets/2f9eabb2-403d-4257-bd84-4da508de7d00" /> ### CI After <img width="1219" height="170" alt="image" src="https://github.com/user-attachments/assets/dbaf2a7d-e3a4-4ad2-85a4-389eae962c1d" /> <img width="1198" height="472" alt="image" src="https://github.com/user-attachments/assets/8a24f1aa-4bcb-4030-b075-09f3885673c2" /> ## TODOs - test_camera_observations takes 10mins to start the app due to Kit cold start. Experimenting with a warm start before tests process here #565 - Kit itself intermittently deadlocks during startup — not because of orphans, but because Kit's internal thread synchronization fails on low-CPU runners. Experimenting with retry here #570

add retry if timeout

cb73120

xyao-nv mentioned this pull request Apr 10, 2026

[CI] Fix CI subprocess test hangs #557

Merged

xyao-nv changed the title ~~Add retry if timeout~~ [CI] Add retry if timeout Apr 10, 2026

isaaclab-review-bot Bot reviewed Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Add retry if timeout#570

[CI] Add retry if timeout#570
xyao-nv wants to merge 1 commit into
xyao/exp/ci_stallfrom
xyao/exp/ci_stall_with_retry

xyao-nv commented Apr 10, 2026 •

edited

Loading

Uh oh!

isaaclab-review-bot Bot left a comment

Uh oh!

isaaclab-review-bot Bot Apr 10, 2026

Uh oh!

isaaclab-review-bot Bot Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xyao-nv commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Detailed description

Uh oh!

isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

isaaclab-review-bot Bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xyao-nv commented Apr 10, 2026 •

edited

Loading