fix(onboard): reject Jetson sandbox GPU passthrough#3965
Conversation
📝 WalkthroughWalkthroughGPU sandbox passthrough on Jetson is blocked: Jetson-specific messages added, sandbox GPU mode defaults to disabled on Jetson unless forced, preflight validation is centralized and checks CDI specs, and gateway reconciliation aborts early for Jetson passthrough requests. ChangesJetson GPU Passthrough Unsupported Blocking
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)
1210-1232: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winCollapse the duplicated sandbox-GPU preflight path.
config.errorsis now validated in three places across the same flow, and the resolve/opt-out logic is duplicated between first-run and--resume. That already pushedsrc/lib/onboard.tsover the onboard-entrypoint budget, and it makes future Jetson behavior easier to drift.♻️ One way to centralize it
function validateSandboxGpuPreflight(config: SandboxGpuConfig): void { - validateSandboxGpuConfigErrors(config); if (!config.sandboxGpuEnabled) return; if (process.platform !== "linux") return; ... } +function resolveSandboxGpuPreflightState( + gpu: ReturnType<typeof nim.detectGpu>, + options: { + flag: SandboxGpuFlag; + device: string | null; + optedOutGpuPassthrough?: boolean; + noGpu?: boolean; + persistedGpuPassthrough?: boolean; + gpuRequested?: boolean; + }, +) { + const sandboxGpuConfig = resolveSandboxGpuConfig(gpu, { + flag: options.flag, + device: options.device, + }); + validateSandboxGpuConfigErrors(sandboxGpuConfig); + return { + sandboxGpuConfig, + optedOutGpuPassthrough: + options.optedOutGpuPassthrough === true || + options.noGpu === true || + (options.gpuRequested !== true && options.persistedGpuPassthrough === false) || + !sandboxGpuConfig.sandboxGpuEnabled, + }; +}Also applies to: 3311-3321, 9374-9392
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 1210 - 1232, Centralize sandbox GPU preflight validation by removing duplicated config.errors checks and duplicate resolve/opt-out logic: make validateSandboxGpuConfigErrors the single validator invoked once at the onboarding entrypoint (instead of calling it inside validateSandboxGpuPreflight and at the other duplicate sites), refactor validateSandboxGpuConfigErrors to return an error list or throw instead of calling process.exit directly, and update validateSandboxGpuPreflight (and the other duplicated blocks around the same flow) to only perform platform/CDI detection and remediation messaging using the validated config; ensure references to validateSandboxGpuPreflight and validateSandboxGpuConfigErrors are updated so no duplicate error-handling paths remain.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
9511-9516: Run the core onboard E2Es for this Jetson gateway-reuse handoff.This call now decides gateway reuse with Jetson platform context threaded in, so I'd still want the core onboard flows exercised before merge, especially
cloud-e2e,sandbox-operations-e2e,rebuild-openclaw-e2e, andopenshell-gateway-upgrade-e2e.As per coding guidelines,
src/lib/onboard.ts: This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow. E2E test recommendation includescloud-e2e,sandbox-operations-e2e,rebuild-openclaw-e2e, andopenshell-gateway-upgrade-e2e.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 9511 - 9516, The change to gateway reuse decision now passes Jetson context into reconcileGatewayGpuReuseForGpuIntent (call sites referencing gatewayReuseState, gpuPassthrough, GATEWAY_NAME, recordedSandboxName/requestedSandboxName and hostGpuPlatform), so before merging run the core onboard end-to-end suites to validate the full sandbox flows: cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e, and openshell-gateway-upgrade-e2e; fix any failures exposed by those tests (investigate reconcileGatewayGpuReuseForGpuIntent behavior with Jetson gpu?.platform values and adjust logic or tests accordingly) and only merge once all listed E2Es pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 1210-1232: Centralize sandbox GPU preflight validation by removing
duplicated config.errors checks and duplicate resolve/opt-out logic: make
validateSandboxGpuConfigErrors the single validator invoked once at the
onboarding entrypoint (instead of calling it inside validateSandboxGpuPreflight
and at the other duplicate sites), refactor validateSandboxGpuConfigErrors to
return an error list or throw instead of calling process.exit directly, and
update validateSandboxGpuPreflight (and the other duplicated blocks around the
same flow) to only perform platform/CDI detection and remediation messaging
using the validated config; ensure references to validateSandboxGpuPreflight and
validateSandboxGpuConfigErrors are updated so no duplicate error-handling paths
remain.
---
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 9511-9516: The change to gateway reuse decision now passes Jetson
context into reconcileGatewayGpuReuseForGpuIntent (call sites referencing
gatewayReuseState, gpuPassthrough, GATEWAY_NAME,
recordedSandboxName/requestedSandboxName and hostGpuPlatform), so before merging
run the core onboard end-to-end suites to validate the full sandbox flows:
cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e, and
openshell-gateway-upgrade-e2e; fix any failures exposed by those tests
(investigate reconcileGatewayGpuReuseForGpuIntent behavior with Jetson
gpu?.platform values and adjust logic or tests accordingly) and only merge once
all listed E2Es pass.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 92bed78d-1b8a-4b0c-8208-6fe808258d64
📒 Files selected for processing (7)
src/lib/onboard.tssrc/lib/onboard/gateway-gpu-passthrough.test.tssrc/lib/onboard/gateway-gpu-passthrough.tssrc/lib/onboard/gpu-recovery.test.tssrc/lib/onboard/gpu-recovery.tssrc/lib/onboard/sandbox-gpu-mode.test.tssrc/lib/onboard/sandbox-gpu-mode.ts
d0a5a63 to
ace53da
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/onboard/gpu-recovery.ts (1)
110-116: ⚡ Quick winSkip registry lookup when Jetson is already known unsupported.
reportGpuPassthroughRecoverystill evaluatesloadNames()even whenunsupportedPlatform === "jetson", but that branch never uses names. Short-circuiting avoids unnecessary registry IO and keeps the early-reject path tighter.Proposed refactor
export function reportGpuPassthroughRecovery( emit: (line: string) => void, loadNames: () => string[] = getRegisteredSandboxNamesForGpuRecovery, options: GpuPassthroughRecoveryOptions = {}, ): void { - for (const line of gpuPassthroughRecoveryLines(loadNames(), options)) emit(line); + const names = + options.unsupportedPlatform === "jetson" ? null : loadNames(); + for (const line of gpuPassthroughRecoveryLines(names, options)) emit(line); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/gpu-recovery.ts` around lines 110 - 116, Short-circuit the registry lookup in reportGpuPassthroughRecovery: if options.unsupportedPlatform === "jetson" do not call loadNames() (which triggers registry IO); instead call gpuPassthroughRecoveryLines with an empty names array (or otherwise iterate its output) and emit those lines, otherwise call loadNames() as today. Update reportGpuPassthroughRecovery to conditionally obtain names only when unsupportedPlatform is not "jetson" (referencing reportGpuPassthroughRecovery, loadNames, getRegisteredSandboxNamesForGpuRecovery, gpuPassthroughRecoveryLines, and GpuPassthroughRecoveryOptions).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/onboard/gpu-recovery.ts`:
- Around line 110-116: Short-circuit the registry lookup in
reportGpuPassthroughRecovery: if options.unsupportedPlatform === "jetson" do not
call loadNames() (which triggers registry IO); instead call
gpuPassthroughRecoveryLines with an empty names array (or otherwise iterate its
output) and emit those lines, otherwise call loadNames() as today. Update
reportGpuPassthroughRecovery to conditionally obtain names only when
unsupportedPlatform is not "jetson" (referencing reportGpuPassthroughRecovery,
loadNames, getRegisteredSandboxNamesForGpuRecovery, gpuPassthroughRecoveryLines,
and GpuPassthroughRecoveryOptions).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1034411c-0705-4748-911e-536e862853b2
📒 Files selected for processing (8)
src/lib/onboard.tssrc/lib/onboard/gateway-gpu-passthrough.test.tssrc/lib/onboard/gateway-gpu-passthrough.tssrc/lib/onboard/gpu-recovery.test.tssrc/lib/onboard/gpu-recovery.tssrc/lib/onboard/sandbox-gpu-mode.test.tssrc/lib/onboard/sandbox-gpu-mode.tssrc/lib/onboard/sandbox-gpu-preflight.ts
ace53da to
361578c
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (3)
src/lib/onboard/sandbox-gpu-mode.ts (1)
48-49: 💤 Low valueConsider clarifying whether Jetson GPU passthrough is permanently unsupported.
The comment states that GPU sandbox passthrough "does not currently work on Jetson," which might suggest this is a temporary limitation. Based on the error messages
JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGEand the PR objectives, this appears to be an architectural constraint rather than a temporary gap. If Jetson GPU passthrough is permanently unsupported, consider removing "currently" to avoid setting incorrect expectations.📝 Suggested clarification
- // GPU sandbox passthrough does not currently work on Jetson; keep auto/default - // behavior on the CPU sandbox path unless the user explicitly forces GPU. + // GPU sandbox passthrough is not supported on Jetson/Tegra; keep auto/default + // behavior on the CPU sandbox path unless the user explicitly forces GPU.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/sandbox-gpu-mode.ts` around lines 48 - 49, Update the comment in sandbox-gpu-mode.ts to remove the word "currently" and state that Jetson GPU sandbox passthrough is unsupported (architectural constraint) rather than a temporary limitation; also ensure any related messaging/constants like JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGE reflect this permanent status so the comment and messages are consistent with the code's behavior.src/lib/onboard/sandbox-gpu-mode.test.ts (1)
110-114: 💤 Low valueConsider checking both error messages for consistency.
The test for
jetsonFlagEnableverifies the unsupported message but not the workaround message, while thejetsonExplicitEnabletest above checks both. For consistency and completeness, consider asserting bothJETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGEandJETSON_SANDBOX_GPU_WORKAROUND_MESSAGEare present.✅ Suggested assertion addition
const jetsonFlagEnable = resolveSandboxGpuConfig(jetson, { flag: "enable", env: {} }); expect(jetsonFlagEnable.mode).toBe("1"); expect(jetsonFlagEnable.errors.join("\n")).toContain( JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGE, ); +expect(jetsonFlagEnable.errors.join("\n")).toContain( + JETSON_SANDBOX_GPU_WORKAROUND_MESSAGE, +);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/sandbox-gpu-mode.test.ts` around lines 110 - 114, The test for jetsonFlagEnable should assert both JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGE and JETSON_SANDBOX_GPU_WORKAROUND_MESSAGE are present in the errors like the jetsonExplicitEnable test does; update the assertion for resolveSandboxGpuConfig(...) stored in jetsonFlagEnable to check that jetsonFlagEnable.errors.join("\n") contains both JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGE and JETSON_SANDBOX_GPU_WORKAROUND_MESSAGE so the two tests are consistent.src/lib/onboard/gateway-gpu-passthrough.ts (1)
144-147: Run targeted onboarding E2E coverage for this early-exit path.Given this changes control flow before gateway inspection/cleanup, I’d run at least
cloud-e2e,sandbox-operations-e2e, andchannels-stop-start-e2efrom the recommended nightly matrix on this branch.As per coding guidelines, “E2E test recommendation: cloud-e2e … sandbox-operations-e2e … channels-stop-start-e2e …”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard/gateway-gpu-passthrough.ts` around lines 144 - 147, This early-exit path (when gpuPassthrough && hostGpuPlatform === "jetson" in gateway-gpu-passthrough.ts) bypasses gateway inspection/cleanup, so add targeted E2E coverage: create/enable tests in the cloud-e2e, sandbox-operations-e2e, and channels-stop-start-e2e suites that trigger gpuPassthrough=true and hostGpuPlatform="jetson" (via env/fixture) and assert that reportGpuPassthroughRecovery is invoked and the process exits as expected, and also verify whether gateway inspection/cleanup behavior is executed or intentionally skipped; ensure tests fail if the early exit accidentally skips required cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/onboard/gateway-gpu-passthrough.ts`:
- Around line 144-147: This early-exit path (when gpuPassthrough &&
hostGpuPlatform === "jetson" in gateway-gpu-passthrough.ts) bypasses gateway
inspection/cleanup, so add targeted E2E coverage: create/enable tests in the
cloud-e2e, sandbox-operations-e2e, and channels-stop-start-e2e suites that
trigger gpuPassthrough=true and hostGpuPlatform="jetson" (via env/fixture) and
assert that reportGpuPassthroughRecovery is invoked and the process exits as
expected, and also verify whether gateway inspection/cleanup behavior is
executed or intentionally skipped; ensure tests fail if the early exit
accidentally skips required cleanup.
In `@src/lib/onboard/sandbox-gpu-mode.test.ts`:
- Around line 110-114: The test for jetsonFlagEnable should assert both
JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGE and JETSON_SANDBOX_GPU_WORKAROUND_MESSAGE
are present in the errors like the jetsonExplicitEnable test does; update the
assertion for resolveSandboxGpuConfig(...) stored in jetsonFlagEnable to check
that jetsonFlagEnable.errors.join("\n") contains both
JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGE and JETSON_SANDBOX_GPU_WORKAROUND_MESSAGE
so the two tests are consistent.
In `@src/lib/onboard/sandbox-gpu-mode.ts`:
- Around line 48-49: Update the comment in sandbox-gpu-mode.ts to remove the
word "currently" and state that Jetson GPU sandbox passthrough is unsupported
(architectural constraint) rather than a temporary limitation; also ensure any
related messaging/constants like JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGE reflect
this permanent status so the comment and messages are consistent with the code's
behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: acaca63e-a1c5-4033-bd27-c5ee19a4bdd1
📒 Files selected for processing (8)
src/lib/onboard.tssrc/lib/onboard/gateway-gpu-passthrough.test.tssrc/lib/onboard/gateway-gpu-passthrough.tssrc/lib/onboard/gpu-recovery.test.tssrc/lib/onboard/gpu-recovery.tssrc/lib/onboard/sandbox-gpu-mode.test.tssrc/lib/onboard/sandbox-gpu-mode.tssrc/lib/onboard/sandbox-gpu-preflight.ts
Jetson and Tegra hosts cannot use the OpenShell sandbox GPU passthrough path, so keep auto/default mode on CPU and fail forced GPU requests before gateway or sandbox creation. Recovery hints now direct Jetson users to CPU mode instead of destroy/recreate loops. Fixes NVIDIA#3710 Fixes NVIDIA#3473 Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
361578c to
480d048
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)
3283-3292: Run the recommended onboard E2Es for this path.This touches core preflight and gateway-reuse behavior in
src/lib/onboard.ts, so I'd run the targeted onboard E2Es before merge—at leastcloud-e2e,sandbox-operations-e2e, andopenshell-gateway-upgrade-e2e.As per coding guidelines "src/lib/onboard.ts: This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow."
Also applies to: 9346-9364, 9483-9497
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/lib/onboard.ts` around lines 3283 - 3292, This change touches core onboarding/preflight logic (functions nim.detectGpu, resolveSandboxGpuConfig, exitOnSandboxGpuConfigErrors and the optedOutGpuPassthrough condition in src/lib/onboard.ts) so before merging run the targeted onboard end-to-end suites: cloud-e2e, sandbox-operations-e2e, and openshell-gateway-upgrade-e2e (also verify the other modified regions around the file referenced in the review). Fix any failures uncovered by these E2Es (adjust config handling, flags, or gating logic in the referenced functions) and re-run until all three suites pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 3283-3292: This change touches core onboarding/preflight logic
(functions nim.detectGpu, resolveSandboxGpuConfig, exitOnSandboxGpuConfigErrors
and the optedOutGpuPassthrough condition in src/lib/onboard.ts) so before
merging run the targeted onboard end-to-end suites: cloud-e2e,
sandbox-operations-e2e, and openshell-gateway-upgrade-e2e (also verify the other
modified regions around the file referenced in the review). Fix any failures
uncovered by these E2Es (adjust config handling, flags, or gating logic in the
referenced functions) and re-run until all three suites pass.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f50e3734-0711-474c-b14a-f330b0c1f23e
📒 Files selected for processing (8)
src/lib/onboard.tssrc/lib/onboard/gateway-gpu-passthrough.test.tssrc/lib/onboard/gateway-gpu-passthrough.tssrc/lib/onboard/gpu-recovery.test.tssrc/lib/onboard/gpu-recovery.tssrc/lib/onboard/sandbox-gpu-mode.test.tssrc/lib/onboard/sandbox-gpu-mode.tssrc/lib/onboard/sandbox-gpu-preflight.ts
Summary
Jetson/Tegra hosts now stay on CPU sandbox mode for default/auto GPU behavior, and explicit sandbox GPU passthrough requests fail early with Jetson-specific guidance before gateway or sandbox creation. This prevents the forced-GPU gateway recovery loop shared by the Jetson/Tegra reports.
Related Issue
Fixes #3710 (#3710)
Fixes #3473 (#3473)
Changes
autoand unset sandbox GPU modes on the CPU path unless GPU passthrough is explicitly forced.Type of Change
Verification
Targeted verification passed; broad local test hooks have unrelated pre-existing/full-suite failures documented below.
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Passed locally:
npx vitest run src/lib/onboard/sandbox-gpu-mode.test.ts src/lib/onboard/gpu-recovery.test.ts src/lib/onboard/gateway-gpu-passthrough.test.tsnpx vitest run test/onboard.test.ts src/lib/onboard/preflight.test.tsnpm run typecheck:clinpm run build:clinpx @biomejs/biome check src/lib/onboard.ts src/lib/onboard/gateway-gpu-passthrough.ts src/lib/onboard/gpu-recovery.ts src/lib/onboard/sandbox-gpu-mode.ts src/lib/onboard/gateway-gpu-passthrough.test.ts src/lib/onboard/gpu-recovery.test.ts src/lib/onboard/sandbox-gpu-mode.test.tsNEMOCLAW_NONINTERACTIVE=1 node ./bin/nemoclaw.js onboard --fresh --non-interactive --yes --yes-i-accept-third-party-software --gpu --name codex-jetson-reproexits before gateway/sandbox creation on this non-GPU x86 host.codex review -c sandbox_mode="danger-full-access" --uncommittedreported no actionable findings after fixes.Known local failures:
npm testfails intest/fetch-guard-patch-regression.test.tsbecause the test executes a local removal of/usr/local/lib/node_modules/openclawand hitsPermission denied; this reproduces in isolation and is unrelated to Jetson onboarding.test/cli.test.tsruns intermittently timed out in CLI dispatch tests under load; the affected tests passed when isolated (sandbox inspection help uses native oclif usage,rejects malformed logs flags before calling OpenShell, andchannels mutation dry-run paths dispatch through oclif).npx prek run --all-filesfails only at thetest-clihook for the same fetch-guard failure plus one full-suite CLI timeout; non-test hooks and plugin tests passed.Signed-off-by: Yimo Jiang yimoj@nvidia.com
Summary by CodeRabbit
Bug Fixes
Improvements
New Features