Skip to content

Add GR00T closed-loop E2E CI job#675

Draft
xyao-nv wants to merge 14 commits into
mainfrom
xyao/gr00t-closedloop-ci-job
Draft

Add GR00T closed-loop E2E CI job#675
xyao-nv wants to merge 14 commits into
mainfrom
xyao/gr00t-closedloop-ci-job

Conversation

@xyao-nv
Copy link
Copy Markdown
Collaborator

@xyao-nv xyao-nv commented May 13, 2026

Summary

Short description of the change (max 50 chars)

Detailed description

  • What was the reason for the change?
  • What has been changed?
  • What is the impact of this change?

@xyao-nv xyao-nv force-pushed the xyao/gr00t-closedloop-ci-job branch from 659295d to 8e96a00 Compare May 13, 2026 04:42
Copy link
Copy Markdown
Contributor

@isaaclab-review-bot isaaclab-review-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Code Review – PR #675: Add GR00T closed-loop E2E CI job

Summary

This PR introduces a new end-to-end CI job that validates the full train → serve → eval pipeline for the GR00T closed-loop policy using a sidecar container pattern. The architecture is thoughtful—baking a bootstrap script into the image while keeping the actual recipe in the repo for PR-reviewability is a solid design choice.

Overall Assessment: 👍 Looks good to merge


🔄 Latest Update (d8f01bd)

Major refactoring and infrastructure improvements:

1. Removed monkey-patching in favor of proper recorder architecture

  • Deleted isaaclab_arena/utils/locomanip_mimic_patch.py - this was patching PreStepActionsRecorder and DataGenerator.generate() at runtime
  • Created proper G1LocomanipRecorderManagerCfg in isaaclab_arena_g1/g1_env/mdp/recorders/g1_locomanip_recorder_cfg.py
  • Mimic env configs now use mimic_recorder_config instead of runtime monkey-patching
  • This is a significant code quality improvement 🎉

2. Compatibility wrapper scripts

  • .github/scripts/ci_gr00t_bootstrap.sh and .github/scripts/ci_gr00t_train_and_serve.sh are thin forwarders
  • These exist because the published CI image hardcodes paths that look for scripts under .github/scripts/
  • Clean solution that maintains backward compatibility while moving the real logic to isaaclab_arena_gr00t/docker/

3. Dockerfile dual-mode build (ci vs training)

  • BUILD_MODE arg selects which model to download:
    • ci: baked Arena G1 locomanip checkpoint for fast E2E testing
    • training: base GR00T-N1.6-3B for fine-tuning
  • New push_to_ngc.sh script supports both modes with clear CLI interface

4. New OSMO workflow system (osmo/ restructuring)

  • Python-based workflow submission replaces manual YAML templates
  • submit_evaluation_workflow.py with proper argparse CLI
  • Modular task/workflow architecture (tasks/, workflows/)
  • Policy types enum (zero_action, replay)
  • Good separation of concerns and extensibility

5. G1 tensor device fix

-navigate_cmd = self.get_navigation_cmd_from_actions(actions_clone)
+navigate_cmd = self.get_navigation_cmd_from_actions(actions_clone).cpu()
  • Commands sent to CPU before P-controller processing
  • Prevents device mismatch errors in locomanip control flow

6. Physics collision fix

  • Added invisible kinematic shelf support patch for GPU simulation
  • The imported shelf mesh has uneven collision - small objects can fall through
  • Clean workaround using procedural_table asset

7. Test infrastructure

  • New test_gr00t_remote_closedloop_policy_runner.py - E2E smoke test against live server
  • Properly marked with @pytest.mark.gr00t_remote_e2e
  • Uses env vars for server config (ISAACLAB_ARENA_GR00T_REMOTE_HOST, etc.)

🔄 Previous Update (d7c3a01)

Removed model_path from Arena-side client configs - This is a solid architectural cleanup:

  1. Separation of concerns - The Arena client config no longer references where the model checkpoint lives. That's server-side knowledge only—the client just needs obs/action translation settings and the language instruction.

  2. Files cleaned up:

    • 5 YAML policy configs: removed model_path field entirely
    • Gr00tClosedloopPolicyConfig: removed the field and its HuggingFace-vs-local validation logic
    • Documentation: clarified these configs are for the "Arena GR00T evaluation client"

🏗️ Architecture & Design

Strengths:

  • The sidecar pattern elegantly solves the chicken-and-egg problem of needing GR00T container + checkout code
  • Bootstrap polling mechanism is clean and handles the service-before-checkout race correctly
  • Keeping ci_gr00t_train_and_serve.sh in the repo means changes are PR-reviewable
  • Removing monkey-patching in favor of proper recorder config is excellent software engineering
  • OSMO workflow system is well-structured and extensible

Considerations:

  • The compatibility wrapper scripts are necessary tech debt - document when they can be removed
  • Version table in README now shows release/0.2.1 and release/0.1.0 - good to keep current

✅ What Works Well

  1. Proper recorder architecture - No more runtime patching of PreStepActionsRecorder
  2. Comprehensive diagnostics - DNS/TCP probes make CI failures debuggable
  3. Defensive shell scripting - set -euo pipefail everywhere
  4. Dual-mode Docker build - CI and training images from same Dockerfile
  5. Test markers - gr00t_remote_e2e properly separates E2E tests
  6. OSMO modularity - Clean task/workflow separation

🔍 Minor Notes

  1. README version table - Added release/0.2.1 and release/0.1.0 entries ✓
  2. pytest.ini - New gr00t_remote_e2e marker registered ✓
  3. Asset scales - Test constants APPLE_SCALE and PLATE_SCALE extracted for clarity ✓

✅ Security & Safety

  • No secrets in scripts ✓
  • NGC credentials handled via GitHub secrets ✓
  • No elevated permissions required ✓

Verdict: This update brings substantial code quality improvements—particularly removing the monkey-patching in favor of proper architecture. The OSMO workflow restructuring and Docker dual-mode builds are well-designed. Ready to merge.

Reviewed by IsaacLab Review Bot 🤖

@xyao-nv xyao-nv force-pushed the xyao/gr00t-closedloop-ci-job branch from d68116e to d7c3a01 Compare May 14, 2026 23:53
xyao-nv added 12 commits May 15, 2026 12:07
Label-gated GHA job (gr00t-closedloop / workflow_dispatch) that runs a
tiny single-GPU finetune in a sidecar GR00T container, launches the
policy server on the resulting checkpoint, and runs Isaac Sim
closed-loop eval against it via Gr00tRemoteClosedloopPolicy. Verifies
the train -> serve -> eval plumbing on opt-in PRs without paying the
cost on every push; the model is barely trained so the assertion is
just "ran without error".

The sidecar bakes /workspace/ci_bootstrap.sh into the gr00t image so
the actual train+serve recipe lives under .github/scripts/ and
versions with PRs. The bootstrap polls for the arena workspace
bind-mount to populate (services start before checkout) then execs the
wrapper. Renames the gr00t image to gr00t1_6_arena_ci to reflect its
dual CI/OSMO role.

Signed-off-by: Xinjie Yao <xyao@nvidia.com>
The test_gr00t_closedloop_e2e job's if-gate accepts workflow_dispatch
events, but workflow_dispatch wasn't listed in the workflow's top-level
on: block, so the "Run workflow" button never appeared and the manual
escape hatch was unreachable. Add it.

Signed-off-by: Xinjie Yao <xyao@nvidia.com>
The pytorch base image hard-pins typer==0.15.2 in
/etc/pip/constraint.txt, but huggingface-hub[cli] needs typer>=0.16
(for the suggest_commands kwarg). Without this, the docker build fails
on the hf install step. Remove the typer line from the constraint file
(same pattern already used for dill and scipy higher in the
Dockerfile), then upgrade typer alongside the CLI install.

Signed-off-by: Xinjie Yao <xyao@nvidia.com>
Drop the label-gate and workflow_dispatch-only if-condition so the job
runs on every pull request and every push to main, matching the
trigger model of the other jobs in this workflow.

Signed-off-by: Xinjie Yao <xyao@nvidia.com>
Make timeouts on the closed-loop job actually debuggable. Adds an init
DNS + TCP probe to confirm the service container is reachable before
polling the application protocol; per-iteration logs include the
exception class (not just str) plus a periodic TCP probe so logs make
clear whether the port is closed or the server is still loading; and
on timeout, prints final DNS + TCP probe results and the last
PolicyClient traceback.

Cap step timeout at 10 min and align loop iter count with it.

Signed-off-by: Xinjie Yao <xyao@nvidia.com>
Two compounding bugs were causing the wait step to time out with zero
log output. PolicyClient takes a timeout_ms ctor arg but never applies
it to the underlying ZMQ REQ socket, so the first call to
get_modality_config() blocked indefinitely on recv() and our retry
loop never ran. Python's stdout is fully buffered when piped (as in
GHA logs), so when the step was killed at the job-level timeout, the
buffered diagnostic prints died with it -- making the failure look
silent.

Use ZMQ directly with explicit RCVTIMEO/SNDTIMEO and a fresh socket per
attempt (so a timeout doesn't wedge REQ-REP state). Switch the probe
endpoint to ping (lighter than get_modality_config). Force unbuffered
stdout via PYTHONUNBUFFERED, python -u, and flush=True on every print.

Signed-off-by: Xinjie Yao <xyao@nvidia.com>
Plain python in the gr00t image can't import gr00t pyproject deps
(tyro, etc.) because the Dockerfile installs them into /workspace/.venv
via uv sync, not into system Python. CI surfaced this as
ModuleNotFoundError: tyro at launch_finetune.py:7. Match OSMO
finetune.yaml's entry.sh and prefix both calls with uv run; keep exec
on the server so SIGTERM at container teardown reaches python.

Signed-off-by: Xinjie Yao <xyao@nvidia.com>
Signed-off-by: Xinjie Yao <xyao@nvidia.com>
Signed-off-by: Xinjie Yao <xyao@nvidia.com>
Signed-off-by: Xinjie Yao <xyao@nvidia.com>
@xyao-nv xyao-nv force-pushed the xyao/gr00t-closedloop-ci-job branch from 76626de to 7d6fe6c Compare May 15, 2026 19:26
xyao-nv added 2 commits May 15, 2026 12:30
Signed-off-by: Xinjie Yao <xyao@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant