Add GR00T closed-loop E2E CI job#675
Conversation
659295d to
8e96a00
Compare
There was a problem hiding this comment.
🤖 Code Review – PR #675: Add GR00T closed-loop E2E CI job
Summary
This PR introduces a new end-to-end CI job that validates the full train → serve → eval pipeline for the GR00T closed-loop policy using a sidecar container pattern. The architecture is thoughtful—baking a bootstrap script into the image while keeping the actual recipe in the repo for PR-reviewability is a solid design choice.
Overall Assessment: 👍 Looks good to merge
🔄 Latest Update (d8f01bd)
Major refactoring and infrastructure improvements:
1. Removed monkey-patching in favor of proper recorder architecture
- Deleted
isaaclab_arena/utils/locomanip_mimic_patch.py- this was patchingPreStepActionsRecorderandDataGenerator.generate()at runtime - Created proper
G1LocomanipRecorderManagerCfginisaaclab_arena_g1/g1_env/mdp/recorders/g1_locomanip_recorder_cfg.py - Mimic env configs now use
mimic_recorder_configinstead of runtime monkey-patching - This is a significant code quality improvement 🎉
2. Compatibility wrapper scripts
.github/scripts/ci_gr00t_bootstrap.shand.github/scripts/ci_gr00t_train_and_serve.share thin forwarders- These exist because the published CI image hardcodes paths that look for scripts under
.github/scripts/ - Clean solution that maintains backward compatibility while moving the real logic to
isaaclab_arena_gr00t/docker/
3. Dockerfile dual-mode build (ci vs training)
BUILD_MODEarg selects which model to download:ci: baked Arena G1 locomanip checkpoint for fast E2E testingtraining: base GR00T-N1.6-3B for fine-tuning
- New
push_to_ngc.shscript supports both modes with clear CLI interface
4. New OSMO workflow system (osmo/ restructuring)
- Python-based workflow submission replaces manual YAML templates
submit_evaluation_workflow.pywith proper argparse CLI- Modular task/workflow architecture (
tasks/,workflows/) - Policy types enum (
zero_action,replay) - Good separation of concerns and extensibility
5. G1 tensor device fix
-navigate_cmd = self.get_navigation_cmd_from_actions(actions_clone)
+navigate_cmd = self.get_navigation_cmd_from_actions(actions_clone).cpu()- Commands sent to CPU before P-controller processing
- Prevents device mismatch errors in locomanip control flow
6. Physics collision fix
- Added invisible kinematic shelf support patch for GPU simulation
- The imported shelf mesh has uneven collision - small objects can fall through
- Clean workaround using
procedural_tableasset
7. Test infrastructure
- New
test_gr00t_remote_closedloop_policy_runner.py- E2E smoke test against live server - Properly marked with
@pytest.mark.gr00t_remote_e2e - Uses env vars for server config (
ISAACLAB_ARENA_GR00T_REMOTE_HOST, etc.)
🔄 Previous Update (d7c3a01)
Removed model_path from Arena-side client configs - This is a solid architectural cleanup:
-
Separation of concerns - The Arena client config no longer references where the model checkpoint lives. That's server-side knowledge only—the client just needs obs/action translation settings and the language instruction.
-
Files cleaned up:
- 5 YAML policy configs: removed
model_pathfield entirely Gr00tClosedloopPolicyConfig: removed the field and its HuggingFace-vs-local validation logic- Documentation: clarified these configs are for the "Arena GR00T evaluation client"
- 5 YAML policy configs: removed
🏗️ Architecture & Design
Strengths:
- The sidecar pattern elegantly solves the chicken-and-egg problem of needing GR00T container + checkout code
- Bootstrap polling mechanism is clean and handles the service-before-checkout race correctly
- Keeping
ci_gr00t_train_and_serve.shin the repo means changes are PR-reviewable - Removing monkey-patching in favor of proper recorder config is excellent software engineering
- OSMO workflow system is well-structured and extensible
Considerations:
- The compatibility wrapper scripts are necessary tech debt - document when they can be removed
- Version table in README now shows
release/0.2.1andrelease/0.1.0- good to keep current
✅ What Works Well
- Proper recorder architecture - No more runtime patching of
PreStepActionsRecorder - Comprehensive diagnostics - DNS/TCP probes make CI failures debuggable
- Defensive shell scripting -
set -euo pipefaileverywhere - Dual-mode Docker build - CI and training images from same Dockerfile
- Test markers -
gr00t_remote_e2eproperly separates E2E tests - OSMO modularity - Clean task/workflow separation
🔍 Minor Notes
- README version table - Added
release/0.2.1andrelease/0.1.0entries ✓ - pytest.ini - New
gr00t_remote_e2emarker registered ✓ - Asset scales - Test constants
APPLE_SCALEandPLATE_SCALEextracted for clarity ✓
✅ Security & Safety
- No secrets in scripts ✓
- NGC credentials handled via GitHub secrets ✓
- No elevated permissions required ✓
Verdict: This update brings substantial code quality improvements—particularly removing the monkey-patching in favor of proper architecture. The OSMO workflow restructuring and Docker dual-mode builds are well-designed. Ready to merge.
Reviewed by IsaacLab Review Bot 🤖
d68116e to
d7c3a01
Compare
Label-gated GHA job (gr00t-closedloop / workflow_dispatch) that runs a tiny single-GPU finetune in a sidecar GR00T container, launches the policy server on the resulting checkpoint, and runs Isaac Sim closed-loop eval against it via Gr00tRemoteClosedloopPolicy. Verifies the train -> serve -> eval plumbing on opt-in PRs without paying the cost on every push; the model is barely trained so the assertion is just "ran without error". The sidecar bakes /workspace/ci_bootstrap.sh into the gr00t image so the actual train+serve recipe lives under .github/scripts/ and versions with PRs. The bootstrap polls for the arena workspace bind-mount to populate (services start before checkout) then execs the wrapper. Renames the gr00t image to gr00t1_6_arena_ci to reflect its dual CI/OSMO role. Signed-off-by: Xinjie Yao <xyao@nvidia.com>
The test_gr00t_closedloop_e2e job's if-gate accepts workflow_dispatch events, but workflow_dispatch wasn't listed in the workflow's top-level on: block, so the "Run workflow" button never appeared and the manual escape hatch was unreachable. Add it. Signed-off-by: Xinjie Yao <xyao@nvidia.com>
The pytorch base image hard-pins typer==0.15.2 in /etc/pip/constraint.txt, but huggingface-hub[cli] needs typer>=0.16 (for the suggest_commands kwarg). Without this, the docker build fails on the hf install step. Remove the typer line from the constraint file (same pattern already used for dill and scipy higher in the Dockerfile), then upgrade typer alongside the CLI install. Signed-off-by: Xinjie Yao <xyao@nvidia.com>
Drop the label-gate and workflow_dispatch-only if-condition so the job runs on every pull request and every push to main, matching the trigger model of the other jobs in this workflow. Signed-off-by: Xinjie Yao <xyao@nvidia.com>
Make timeouts on the closed-loop job actually debuggable. Adds an init DNS + TCP probe to confirm the service container is reachable before polling the application protocol; per-iteration logs include the exception class (not just str) plus a periodic TCP probe so logs make clear whether the port is closed or the server is still loading; and on timeout, prints final DNS + TCP probe results and the last PolicyClient traceback. Cap step timeout at 10 min and align loop iter count with it. Signed-off-by: Xinjie Yao <xyao@nvidia.com>
Two compounding bugs were causing the wait step to time out with zero log output. PolicyClient takes a timeout_ms ctor arg but never applies it to the underlying ZMQ REQ socket, so the first call to get_modality_config() blocked indefinitely on recv() and our retry loop never ran. Python's stdout is fully buffered when piped (as in GHA logs), so when the step was killed at the job-level timeout, the buffered diagnostic prints died with it -- making the failure look silent. Use ZMQ directly with explicit RCVTIMEO/SNDTIMEO and a fresh socket per attempt (so a timeout doesn't wedge REQ-REP state). Switch the probe endpoint to ping (lighter than get_modality_config). Force unbuffered stdout via PYTHONUNBUFFERED, python -u, and flush=True on every print. Signed-off-by: Xinjie Yao <xyao@nvidia.com>
Plain python in the gr00t image can't import gr00t pyproject deps (tyro, etc.) because the Dockerfile installs them into /workspace/.venv via uv sync, not into system Python. CI surfaced this as ModuleNotFoundError: tyro at launch_finetune.py:7. Match OSMO finetune.yaml's entry.sh and prefix both calls with uv run; keep exec on the server so SIGTERM at container teardown reaches python. Signed-off-by: Xinjie Yao <xyao@nvidia.com>
Signed-off-by: Xinjie Yao <xyao@nvidia.com>
Signed-off-by: Xinjie Yao <xyao@nvidia.com>
Signed-off-by: Xinjie Yao <xyao@nvidia.com>
76626de to
7d6fe6c
Compare
Signed-off-by: Xinjie Yao <xyao@nvidia.com>
Summary
Short description of the change (max 50 chars)
Detailed description