Add GR00T closed-loop E2E CI job by xyao-nv · Pull Request #675 · isaac-sim/IsaacLab-Arena

xyao-nv · 2026-05-13T02:26:20Z

Summary

Short description of the change (max 50 chars)

Detailed description

What was the reason for the change?
What has been changed?
What is the impact of this change?

isaaclab-review-bot

🤖 Code Review – PR #675: Add GR00T closed-loop E2E CI job

Summary

This PR introduces a new end-to-end CI job that validates the full train → serve → eval pipeline for the GR00T closed-loop policy using a sidecar container pattern. The architecture is thoughtful—baking a bootstrap script into the image while keeping the actual recipe in the repo for PR-reviewability is a solid design choice.

Overall Assessment: 👍 Looks good to merge

🔄 Latest Update (`d8f01bd`)

Major refactoring and infrastructure improvements:

1. Removed monkey-patching in favor of proper recorder architecture

Deleted isaaclab_arena/utils/locomanip_mimic_patch.py - this was patching PreStepActionsRecorder and DataGenerator.generate() at runtime
Created proper G1LocomanipRecorderManagerCfg in isaaclab_arena_g1/g1_env/mdp/recorders/g1_locomanip_recorder_cfg.py
Mimic env configs now use mimic_recorder_config instead of runtime monkey-patching
This is a significant code quality improvement 🎉

2. Compatibility wrapper scripts

.github/scripts/ci_gr00t_bootstrap.sh and .github/scripts/ci_gr00t_train_and_serve.sh are thin forwarders
These exist because the published CI image hardcodes paths that look for scripts under .github/scripts/
Clean solution that maintains backward compatibility while moving the real logic to isaaclab_arena_gr00t/docker/

3. Dockerfile dual-mode build (`ci` vs `training`)

BUILD_MODE arg selects which model to download:
- ci: baked Arena G1 locomanip checkpoint for fast E2E testing
- training: base GR00T-N1.6-3B for fine-tuning
New push_to_ngc.sh script supports both modes with clear CLI interface

4. New OSMO workflow system (`osmo/` restructuring)

Python-based workflow submission replaces manual YAML templates
submit_evaluation_workflow.py with proper argparse CLI
Modular task/workflow architecture (tasks/, workflows/)
Policy types enum (zero_action, replay)
Good separation of concerns and extensibility

5. G1 tensor device fix

-navigate_cmd = self.get_navigation_cmd_from_actions(actions_clone)
+navigate_cmd = self.get_navigation_cmd_from_actions(actions_clone).cpu()

Commands sent to CPU before P-controller processing
Prevents device mismatch errors in locomanip control flow

6. Physics collision fix

Added invisible kinematic shelf support patch for GPU simulation
The imported shelf mesh has uneven collision - small objects can fall through
Clean workaround using procedural_table asset

7. Test infrastructure

New test_gr00t_remote_closedloop_policy_runner.py - E2E smoke test against live server
Properly marked with @pytest.mark.gr00t_remote_e2e
Uses env vars for server config (ISAACLAB_ARENA_GR00T_REMOTE_HOST, etc.)

🔄 Previous Update (`d7c3a01`)

Removed model_path from Arena-side client configs - This is a solid architectural cleanup:

Separation of concerns - The Arena client config no longer references where the model checkpoint lives. That's server-side knowledge only—the client just needs obs/action translation settings and the language instruction.
Files cleaned up:
- 5 YAML policy configs: removed model_path field entirely
- Gr00tClosedloopPolicyConfig: removed the field and its HuggingFace-vs-local validation logic
- Documentation: clarified these configs are for the "Arena GR00T evaluation client"

🏗️ Architecture & Design

Strengths:

The sidecar pattern elegantly solves the chicken-and-egg problem of needing GR00T container + checkout code
Bootstrap polling mechanism is clean and handles the service-before-checkout race correctly
Keeping ci_gr00t_train_and_serve.sh in the repo means changes are PR-reviewable
Removing monkey-patching in favor of proper recorder config is excellent software engineering
OSMO workflow system is well-structured and extensible

Considerations:

The compatibility wrapper scripts are necessary tech debt - document when they can be removed
Version table in README now shows release/0.2.1 and release/0.1.0 - good to keep current

✅ What Works Well

Proper recorder architecture - No more runtime patching of PreStepActionsRecorder
Comprehensive diagnostics - DNS/TCP probes make CI failures debuggable
Defensive shell scripting - set -euo pipefail everywhere
Dual-mode Docker build - CI and training images from same Dockerfile
Test markers - gr00t_remote_e2e properly separates E2E tests
OSMO modularity - Clean task/workflow separation

🔍 Minor Notes

README version table - Added release/0.2.1 and release/0.1.0 entries ✓
pytest.ini - New gr00t_remote_e2e marker registered ✓
Asset scales - Test constants APPLE_SCALE and PLATE_SCALE extracted for clarity ✓

✅ Security & Safety

No secrets in scripts ✓
NGC credentials handled via GitHub secrets ✓
No elevated permissions required ✓

Verdict: This update brings substantial code quality improvements—particularly removing the monkey-patching in favor of proper architecture. The OSMO workflow restructuring and Docker dual-mode builds are well-designed. Ready to merge.

Reviewed by IsaacLab Review Bot 🤖

Label-gated GHA job (gr00t-closedloop / workflow_dispatch) that runs a tiny single-GPU finetune in a sidecar GR00T container, launches the policy server on the resulting checkpoint, and runs Isaac Sim closed-loop eval against it via Gr00tRemoteClosedloopPolicy. Verifies the train -> serve -> eval plumbing on opt-in PRs without paying the cost on every push; the model is barely trained so the assertion is just "ran without error". The sidecar bakes /workspace/ci_bootstrap.sh into the gr00t image so the actual train+serve recipe lives under .github/scripts/ and versions with PRs. The bootstrap polls for the arena workspace bind-mount to populate (services start before checkout) then execs the wrapper. Renames the gr00t image to gr00t1_6_arena_ci to reflect its dual CI/OSMO role. Signed-off-by: Xinjie Yao <xyao@nvidia.com>

The test_gr00t_closedloop_e2e job's if-gate accepts workflow_dispatch events, but workflow_dispatch wasn't listed in the workflow's top-level on: block, so the "Run workflow" button never appeared and the manual escape hatch was unreachable. Add it. Signed-off-by: Xinjie Yao <xyao@nvidia.com>

The pytorch base image hard-pins typer==0.15.2 in /etc/pip/constraint.txt, but huggingface-hub[cli] needs typer>=0.16 (for the suggest_commands kwarg). Without this, the docker build fails on the hf install step. Remove the typer line from the constraint file (same pattern already used for dill and scipy higher in the Dockerfile), then upgrade typer alongside the CLI install. Signed-off-by: Xinjie Yao <xyao@nvidia.com>

Drop the label-gate and workflow_dispatch-only if-condition so the job runs on every pull request and every push to main, matching the trigger model of the other jobs in this workflow. Signed-off-by: Xinjie Yao <xyao@nvidia.com>

Make timeouts on the closed-loop job actually debuggable. Adds an init DNS + TCP probe to confirm the service container is reachable before polling the application protocol; per-iteration logs include the exception class (not just str) plus a periodic TCP probe so logs make clear whether the port is closed or the server is still loading; and on timeout, prints final DNS + TCP probe results and the last PolicyClient traceback. Cap step timeout at 10 min and align loop iter count with it. Signed-off-by: Xinjie Yao <xyao@nvidia.com>

Two compounding bugs were causing the wait step to time out with zero log output. PolicyClient takes a timeout_ms ctor arg but never applies it to the underlying ZMQ REQ socket, so the first call to get_modality_config() blocked indefinitely on recv() and our retry loop never ran. Python's stdout is fully buffered when piped (as in GHA logs), so when the step was killed at the job-level timeout, the buffered diagnostic prints died with it -- making the failure look silent. Use ZMQ directly with explicit RCVTIMEO/SNDTIMEO and a fresh socket per attempt (so a timeout doesn't wedge REQ-REP state). Switch the probe endpoint to ping (lighter than get_modality_config). Force unbuffered stdout via PYTHONUNBUFFERED, python -u, and flush=True on every print. Signed-off-by: Xinjie Yao <xyao@nvidia.com>

Plain python in the gr00t image can't import gr00t pyproject deps (tyro, etc.) because the Dockerfile installs them into /workspace/.venv via uv sync, not into system Python. CI surfaced this as ModuleNotFoundError: tyro at launch_finetune.py:7. Match OSMO finetune.yaml's entry.sh and prefix both calls with uv run; keep exec on the server so SIGTERM at container teardown reaches python. Signed-off-by: Xinjie Yao <xyao@nvidia.com>

Signed-off-by: Xinjie Yao <xyao@nvidia.com>

xyao-nv force-pushed the xyao/gr00t-closedloop-ci-job branch from 659295d to 8e96a00 Compare May 13, 2026 04:42

isaaclab-review-bot Bot reviewed May 13, 2026

View reviewed changes

xyao-nv force-pushed the xyao/gr00t-closedloop-ci-job branch from d68116e to d7c3a01 Compare May 14, 2026 23:53

xyao-nv added 12 commits May 15, 2026 12:07

Run GR00T closed-loop E2E on every PR

b50cd0f

Drop the label-gate and workflow_dispatch-only if-condition so the job runs on every pull request and every push to main, matching the trigger model of the other jobs in this workflow. Signed-off-by: Xinjie Yao <xyao@nvidia.com>

skip

a2c48ab

skip

1d7d931

Fix GR00T CI policy config validation

f966881

Signed-off-by: Xinjie Yao <xyao@nvidia.com>

Remove GR00T model path from policy configs

44fc3fe

Signed-off-by: Xinjie Yao <xyao@nvidia.com>

Modularize GR00T remote CI smoke test

7d6fe6c

Signed-off-by: Xinjie Yao <xyao@nvidia.com>

xyao-nv force-pushed the xyao/gr00t-closedloop-ci-job branch from 76626de to 7d6fe6c Compare May 15, 2026 19:26

xyao-nv added 2 commits May 15, 2026 12:30

disable gr00t docker push

7177b32

Fix GR00T CI sidecar startup

d8f01bd

Signed-off-by: Xinjie Yao <xyao@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GR00T closed-loop E2E CI job#675

Add GR00T closed-loop E2E CI job#675
xyao-nv wants to merge 14 commits into
mainfrom
xyao/gr00t-closedloop-ci-job

xyao-nv commented May 13, 2026

Uh oh!

isaaclab-review-bot Bot left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xyao-nv commented May 13, 2026

Summary

Detailed description

Uh oh!

isaaclab-review-bot Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

🤖 Code Review – PR #675: Add GR00T closed-loop E2E CI job

Summary

🔄 Latest Update (d8f01bd)

1. Removed monkey-patching in favor of proper recorder architecture

2. Compatibility wrapper scripts

3. Dockerfile dual-mode build (ci vs training)

4. New OSMO workflow system (osmo/ restructuring)

5. G1 tensor device fix

6. Physics collision fix

7. Test infrastructure

🔄 Previous Update (d7c3a01)

🏗️ Architecture & Design

✅ What Works Well

🔍 Minor Notes

✅ Security & Safety

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

isaaclab-review-bot Bot left a comment •

edited

Loading

🔄 Latest Update (`d8f01bd`)

3. Dockerfile dual-mode build (`ci` vs `training`)

4. New OSMO workflow system (`osmo/` restructuring)

🔄 Previous Update (`d7c3a01`)