Skip to content

[CI] Add retry if timeout#570

Draft
xyao-nv wants to merge 1 commit into
xyao/exp/ci_stallfrom
xyao/exp/ci_stall_with_retry
Draft

[CI] Add retry if timeout#570
xyao-nv wants to merge 1 commit into
xyao/exp/ci_stallfrom
xyao/exp/ci_stall_with_retry

Conversation

@xyao-nv
Copy link
Copy Markdown
Collaborator

@xyao-nv xyao-nv commented Apr 10, 2026

Summary

Kit itself intermittently deadlocks during startup — not because of orphans, but because Kit's internal thread synchronization fails on low-CPU runners. No orphans present when it happens.

Detailed description

  • when Kit deadlocks, the 900s timeout kills it, and the retry launches a fresh Kit process which almost certainly succeeds (since the deadlock is rare and random).

  • Without the retry, a single Kit deadlock fails the whole CI run. With the retry, it costs ~15 extra minutes but the test passes.

@xyao-nv xyao-nv changed the title Add retry if timeout [CI] Add retry if timeout Apr 10, 2026
Copy link
Copy Markdown
Contributor

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good approach — retrying on Kit's intermittent startup deadlock is the right call for CI reliability. The refactoring into _run_once is clean and the retry-only-on-timeout logic is correct (non-zero exits still fail fast). A couple of things to consider:

Orphan cleanup between retries: The docstring says "orphaned Kit processes are cleaned up" but I don't see any cleanup code. When subprocess.run hits a timeout, it calls process.kill() which sends SIGKILL to the child PID only — not the whole process group. Since start_new_session=True puts the child in its own group, the grandchildren (Kit shader compiler, GPU workers, etc.) survive as orphans. Before retrying, you'd want to kill the entire process group, e.g.:

except subprocess.TimeoutExpired as exc:
    # Kill the entire process group, not just the leader
    try:
        os.killpg(os.getpgid(exc.cmd_process.pid), signal.SIGKILL)
    except (ProcessLookupError, OSError):
        pass

Since subprocess.run doesn't expose the Popen object after timeout, you may need to switch to manual Popen management in _run_once to get access to the PID for group cleanup. Without this, retrying with orphaned GPU-holding processes still running could cause the retry to fail for a different reason.

Minor: The docstring could be updated to remove the "orphan cleanup" mention until that's actually implemented, to keep docs honest.

times (env ``ISAACLAB_ARENA_SUBPROCESS_MAX_RETRIES``, default 1).

Args:
cmd: Command to run (list of strings).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Orphan process leak on timeout: subprocess.run calls process.kill() on timeout, which sends SIGKILL to the child PID only. With start_new_session=True, the child's process group (Kit shader compiler, GPU workers, etc.) won't be killed — they'll survive as orphans holding GPU resources.

Consider switching to manual Popen so you can kill the whole process group on timeout:

def _run_once(cmd, env, timeout_sec, capture_output):
    kwargs = dict(env=env, start_new_session=True)
    if capture_output:
        kwargs.update(stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    proc = subprocess.Popen(cmd, **kwargs)
    try:
        stdout, stderr = proc.communicate(timeout=timeout_sec)
    except subprocess.TimeoutExpired:
        # Kill the entire process group, not just the leader
        try:
            os.killpg(os.getpgid(proc.pid), signal.SIGKILL)
        except (ProcessLookupError, OSError):
            pass
        proc.wait()
        raise
    if capture_output:
        return subprocess.CompletedProcess(cmd, proc.returncode, stdout, stderr)
    return subprocess.CompletedProcess(cmd, proc.returncode)

This is especially important since the whole point of the retry is to get a clean slate — retrying with orphaned GPU processes still running might not help.

)
return result


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The docstring says "orphaned Kit processes are cleaned up" but I don't see cleanup code in the implementation. Consider removing that claim until it's implemented, or adding the process group kill between retries.

xyao-nv added a commit that referenced this pull request Apr 10, 2026
## Summary
Subprocess-spawning tests hang indefinitely on CI.

## Causes & Fixes

### Problems

From Lab:

1. Lab reports "AppLauncher doesnt quit properly after app.close(),
app.quit() doesn't help either."
2. Cold startup times for tests using IS can be upwards of 10 min on Lab
CI machines.

Above issues apply to us, because tests hang during sub-process tests
section, between the end of last test and the beginning of the next
test. See detailed logs and analysis from reproducing locally
[here](#568)

### Fixes

1. `SimulationApp` Force Exit: Skips `app.close()` (which can hang
indefinitely in Kit's shutdown path) when the env var
`ISAACLAB_ARENA_FORCE_EXIT_ON_COMPLETE=1` is set.
Calls a new `_kill_child_processes()` helper that walks `/proc` to
`SIGKILL` all direct children before doing `os._exit(0)`, preventing
orphaned Kit processes from holding GPU resources.

2. `run_subprocess` has a configuarable wall-clock timeouts and process
isolation, such that when needed, it could trigger the force exit path
above.

3. Add wall-clock timing and logging inside the SimulationApp start
method. Keep track of how much startup time is taking on CI.

## Minor fixes

1. Add timing stats into pytest cmds such that it reports the slowests
test func at the end of each section.

2. Parametrize multi-config tests: Convert nested for-loops in
`test_zero_action_policy_kitchen_pick_and_place` (6 configs) and
`test_zero_action_policy_gr1_open_microwave` (3 configs) into
`@pytest.mark.parametrize.` Each config gets its own timeout, pass/fail,
and timing.

3. Reduce num_envs in gr00t eval_runner test to speed up.

### Local validation
With the repro script
#568, I do not have
local stalling.
Log for more details.

[repro_20260410_041313.log](https://github.com/user-attachments/files/26620524/repro_20260410_041313.log)


### CI Before -- timeout
<img width="1219" height="170" alt="image"
src="https://github.com/user-attachments/assets/2f9eabb2-403d-4257-bd84-4da508de7d00"
/>

### CI After
<img width="1219" height="170" alt="image"
src="https://github.com/user-attachments/assets/dbaf2a7d-e3a4-4ad2-85a4-389eae962c1d"
/>
<img width="1198" height="472" alt="image"
src="https://github.com/user-attachments/assets/8a24f1aa-4bcb-4030-b075-09f3885673c2"
/>

## TODOs
- test_camera_observations takes 10mins to start the app due to Kit cold
start. Experimenting with a warm start before tests process here
#565
- Kit itself intermittently deadlocks during startup — not because of
orphans, but because Kit's internal thread synchronization fails on
low-CPU runners. Experimenting with retry here
#570
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant