Skip to content

fix(onboard): reject Jetson sandbox GPU passthrough#3965

Open
yimoj wants to merge 1 commit into
NVIDIA:mainfrom
yimoj:fix/3710-jetson-gpu-passthrough
Open

fix(onboard): reject Jetson sandbox GPU passthrough#3965
yimoj wants to merge 1 commit into
NVIDIA:mainfrom
yimoj:fix/3710-jetson-gpu-passthrough

Conversation

@yimoj
Copy link
Copy Markdown
Contributor

@yimoj yimoj commented May 21, 2026

Summary

Jetson/Tegra hosts now stay on CPU sandbox mode for default/auto GPU behavior, and explicit sandbox GPU passthrough requests fail early with Jetson-specific guidance before gateway or sandbox creation. This prevents the forced-GPU gateway recovery loop shared by the Jetson/Tegra reports.

Related Issue

Fixes #3710 (#3710)
Fixes #3473 (#3473)

Changes

  • Keep Jetson/Tegra auto and unset sandbox GPU modes on the CPU path unless GPU passthrough is explicitly forced.
  • Reject forced Jetson/Tegra sandbox GPU passthrough before CDI, gateway reuse, gateway creation, or sandbox creation.
  • Replace Jetson CPU-only gateway recovery advice with CPU-mode guidance instead of destroy/recreate instructions.
  • Add focused tests for Jetson forced/default GPU mode and Jetson gateway recovery messaging.

Type of Change

  • Code change (feature, bug fix, or refactor)
  • Code change with doc updates
  • Doc only (prose changes, no code sample modifications)
  • Doc only (includes code sample changes)

Verification

Targeted verification passed; broad local test hooks have unrelated pre-existing/full-suite failures documented below.

  • npx prek run --all-files passes
  • npm test passes
  • Tests added or updated for new or changed behavior
  • No secrets, API keys, or credentials committed
  • Docs updated for user-facing behavior changes
  • make docs builds without warnings (doc changes only)
  • Doc pages follow the style guide (doc changes only)
  • New doc pages include SPDX header and frontmatter (new pages only)

Passed locally:

  • npx vitest run src/lib/onboard/sandbox-gpu-mode.test.ts src/lib/onboard/gpu-recovery.test.ts src/lib/onboard/gateway-gpu-passthrough.test.ts
  • npx vitest run test/onboard.test.ts src/lib/onboard/preflight.test.ts
  • npm run typecheck:cli
  • npm run build:cli
  • npx @biomejs/biome check src/lib/onboard.ts src/lib/onboard/gateway-gpu-passthrough.ts src/lib/onboard/gpu-recovery.ts src/lib/onboard/sandbox-gpu-mode.ts src/lib/onboard/gateway-gpu-passthrough.test.ts src/lib/onboard/gpu-recovery.test.ts src/lib/onboard/sandbox-gpu-mode.test.ts
  • Worktree CLI repro: NEMOCLAW_NONINTERACTIVE=1 node ./bin/nemoclaw.js onboard --fresh --non-interactive --yes --yes-i-accept-third-party-software --gpu --name codex-jetson-repro exits before gateway/sandbox creation on this non-GPU x86 host.
  • Jetson seam checks confirm forced GPU returns the unsupported Jetson/Tegra error and default Jetson mode stays CPU.
  • codex review -c sandbox_mode="danger-full-access" --uncommitted reported no actionable findings after fixes.

Known local failures:

  • npm test fails in test/fetch-guard-patch-regression.test.ts because the test executes a local removal of /usr/local/lib/node_modules/openclaw and hits Permission denied; this reproduces in isolation and is unrelated to Jetson onboarding.
  • Full-suite test/cli.test.ts runs intermittently timed out in CLI dispatch tests under load; the affected tests passed when isolated (sandbox inspection help uses native oclif usage, rejects malformed logs flags before calling OpenShell, and channels mutation dry-run paths dispatch through oclif).
  • npx prek run --all-files fails only at the test-cli hook for the same fetch-guard failure plus one full-suite CLI timeout; non-test hooks and plugin tests passed.

Signed-off-by: Yimo Jiang yimoj@nvidia.com

Summary by CodeRabbit

  • Bug Fixes

    • GPU passthrough on Jetson is now treated as unsupported and surfaces clear, early error messaging; gateway reuse aborts immediately to avoid unsafe operations.
  • Improvements

    • Stronger sandbox-GPU validation during setup and resume with immediate failures for invalid configs.
    • Jetson-specific guidance forces CPU fallback by default and provides clearer workaround instructions.
  • New Features

    • Linux-only GPU preflight checks that validate container runtime CDI specs and print actionable remediation steps.

Review Change Stack

@yimoj yimoj self-assigned this May 21, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

📝 Walkthrough

Walkthrough

GPU sandbox passthrough on Jetson is blocked: Jetson-specific messages added, sandbox GPU mode defaults to disabled on Jetson unless forced, preflight validation is centralized and checks CDI specs, and gateway reconciliation aborts early for Jetson passthrough requests.

Changes

Jetson GPU Passthrough Unsupported Blocking

Layer / File(s) Summary
Jetson GPU unsupported messaging constants and recovery options
src/lib/onboard/sandbox-gpu-mode.ts, src/lib/onboard/gpu-recovery.ts, src/lib/onboard/gpu-recovery.test.ts
Jetson-specific GPU sandbox passthrough error and workaround messages are exported; GpuPassthroughRecoveryOptions enables platform-specific recovery messaging used by gpuPassthroughRecoveryLines and reportGpuPassthroughRecovery. Tests validate Jetson-specific hints.
Sandbox GPU mode resolution with Jetson detection
src/lib/onboard/sandbox-gpu-mode.ts, src/lib/onboard/sandbox-gpu-mode.test.ts
resolveSandboxGpuMode forces GPU mode to "0" when Jetson is detected and env is unset/"auto", preserves explicit overrides, and resolveSandboxGpuConfig appends Jetson-specific errors when GPU mode is "1". Tests updated for these behaviors.
GPU preflight config validation refactoring
src/lib/onboard.ts, src/lib/onboard/sandbox-gpu-preflight.ts
Onboard preflight now resolves sandboxGpuConfig early and calls exitOnSandboxGpuConfigErrors immediately; CDI-spec existence and preflight validation are delegated to sandbox-gpu-preflight.ts. Resume preflight mirrors this pattern.
Gateway GPU reconciliation with Jetson blocking
src/lib/onboard/gateway-gpu-passthrough.ts, src/lib/onboard/gateway-gpu-passthrough.test.ts, src/lib/onboard.ts
reconcileGatewayGpuReuseForGpuIntent accepts `hostGpuPlatform?: NvidiaPlatform

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3670: Related gateway GPU reuse/reconciliation updates that this PR extends with host platform handling.
  • NVIDIA/NemoClaw#3618: Related Jetson GPU passthrough behavior and sandbox config adjustments.
  • NVIDIA/NemoClaw#3181: Related preflight changes around optedOutGpuPassthrough and onboarding messaging.

Suggested labels

NemoClaw CLI, fix, Platform: ARM64, v0.0.47

Suggested reviewers

  • ericksoa
  • cv

Poem

🐇 I hopped through the code with a curious zap,
Jetson said "stop" — so I drew up a map.
Mode set to zero, recovery lines neat,
Messages guide users to a safer seat.
Hooray for clarity — I munch on a carrot and nap.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title directly describes the main change: rejecting Jetson sandbox GPU passthrough. This aligns with the primary objective of ensuring Jetson hosts remain on CPU sandbox mode and preventing forced GPU passthrough.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/lib/onboard.ts (1)

1210-1232: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Collapse the duplicated sandbox-GPU preflight path.

config.errors is now validated in three places across the same flow, and the resolve/opt-out logic is duplicated between first-run and --resume. That already pushed src/lib/onboard.ts over the onboard-entrypoint budget, and it makes future Jetson behavior easier to drift.

♻️ One way to centralize it
 function validateSandboxGpuPreflight(config: SandboxGpuConfig): void {
-  validateSandboxGpuConfigErrors(config);
   if (!config.sandboxGpuEnabled) return;
   if (process.platform !== "linux") return;
   ...
 }
 
+function resolveSandboxGpuPreflightState(
+  gpu: ReturnType<typeof nim.detectGpu>,
+  options: {
+    flag: SandboxGpuFlag;
+    device: string | null;
+    optedOutGpuPassthrough?: boolean;
+    noGpu?: boolean;
+    persistedGpuPassthrough?: boolean;
+    gpuRequested?: boolean;
+  },
+) {
+  const sandboxGpuConfig = resolveSandboxGpuConfig(gpu, {
+    flag: options.flag,
+    device: options.device,
+  });
+  validateSandboxGpuConfigErrors(sandboxGpuConfig);
+  return {
+    sandboxGpuConfig,
+    optedOutGpuPassthrough:
+      options.optedOutGpuPassthrough === true ||
+      options.noGpu === true ||
+      (options.gpuRequested !== true && options.persistedGpuPassthrough === false) ||
+      !sandboxGpuConfig.sandboxGpuEnabled,
+  };
+}

Also applies to: 3311-3321, 9374-9392

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 1210 - 1232, Centralize sandbox GPU
preflight validation by removing duplicated config.errors checks and duplicate
resolve/opt-out logic: make validateSandboxGpuConfigErrors the single validator
invoked once at the onboarding entrypoint (instead of calling it inside
validateSandboxGpuPreflight and at the other duplicate sites), refactor
validateSandboxGpuConfigErrors to return an error list or throw instead of
calling process.exit directly, and update validateSandboxGpuPreflight (and the
other duplicated blocks around the same flow) to only perform platform/CDI
detection and remediation messaging using the validated config; ensure
references to validateSandboxGpuPreflight and validateSandboxGpuConfigErrors are
updated so no duplicate error-handling paths remain.
🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

9511-9516: Run the core onboard E2Es for this Jetson gateway-reuse handoff.

This call now decides gateway reuse with Jetson platform context threaded in, so I'd still want the core onboard flows exercised before merge, especially cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e, and openshell-gateway-upgrade-e2e.

As per coding guidelines, src/lib/onboard.ts: This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow. E2E test recommendation includes cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e, and openshell-gateway-upgrade-e2e.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 9511 - 9516, The change to gateway reuse
decision now passes Jetson context into reconcileGatewayGpuReuseForGpuIntent
(call sites referencing gatewayReuseState, gpuPassthrough, GATEWAY_NAME,
recordedSandboxName/requestedSandboxName and hostGpuPlatform), so before merging
run the core onboard end-to-end suites to validate the full sandbox flows:
cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e, and
openshell-gateway-upgrade-e2e; fix any failures exposed by those tests
(investigate reconcileGatewayGpuReuseForGpuIntent behavior with Jetson
gpu?.platform values and adjust logic or tests accordingly) and only merge once
all listed E2Es pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/lib/onboard.ts`:
- Around line 1210-1232: Centralize sandbox GPU preflight validation by removing
duplicated config.errors checks and duplicate resolve/opt-out logic: make
validateSandboxGpuConfigErrors the single validator invoked once at the
onboarding entrypoint (instead of calling it inside validateSandboxGpuPreflight
and at the other duplicate sites), refactor validateSandboxGpuConfigErrors to
return an error list or throw instead of calling process.exit directly, and
update validateSandboxGpuPreflight (and the other duplicated blocks around the
same flow) to only perform platform/CDI detection and remediation messaging
using the validated config; ensure references to validateSandboxGpuPreflight and
validateSandboxGpuConfigErrors are updated so no duplicate error-handling paths
remain.

---

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 9511-9516: The change to gateway reuse decision now passes Jetson
context into reconcileGatewayGpuReuseForGpuIntent (call sites referencing
gatewayReuseState, gpuPassthrough, GATEWAY_NAME,
recordedSandboxName/requestedSandboxName and hostGpuPlatform), so before merging
run the core onboard end-to-end suites to validate the full sandbox flows:
cloud-e2e, sandbox-operations-e2e, rebuild-openclaw-e2e, and
openshell-gateway-upgrade-e2e; fix any failures exposed by those tests
(investigate reconcileGatewayGpuReuseForGpuIntent behavior with Jetson
gpu?.platform values and adjust logic or tests accordingly) and only merge once
all listed E2Es pass.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 92bed78d-1b8a-4b0c-8208-6fe808258d64

📥 Commits

Reviewing files that changed from the base of the PR and between dc63189 and d0a5a63.

📒 Files selected for processing (7)
  • src/lib/onboard.ts
  • src/lib/onboard/gateway-gpu-passthrough.test.ts
  • src/lib/onboard/gateway-gpu-passthrough.ts
  • src/lib/onboard/gpu-recovery.test.ts
  • src/lib/onboard/gpu-recovery.ts
  • src/lib/onboard/sandbox-gpu-mode.test.ts
  • src/lib/onboard/sandbox-gpu-mode.ts

@yimoj yimoj force-pushed the fix/3710-jetson-gpu-passthrough branch from d0a5a63 to ace53da Compare May 21, 2026 03:18
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/onboard/gpu-recovery.ts (1)

110-116: ⚡ Quick win

Skip registry lookup when Jetson is already known unsupported.

reportGpuPassthroughRecovery still evaluates loadNames() even when unsupportedPlatform === "jetson", but that branch never uses names. Short-circuiting avoids unnecessary registry IO and keeps the early-reject path tighter.

Proposed refactor
 export function reportGpuPassthroughRecovery(
   emit: (line: string) => void,
   loadNames: () => string[] = getRegisteredSandboxNamesForGpuRecovery,
   options: GpuPassthroughRecoveryOptions = {},
 ): void {
-  for (const line of gpuPassthroughRecoveryLines(loadNames(), options)) emit(line);
+  const names =
+    options.unsupportedPlatform === "jetson" ? null : loadNames();
+  for (const line of gpuPassthroughRecoveryLines(names, options)) emit(line);
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/gpu-recovery.ts` around lines 110 - 116, Short-circuit the
registry lookup in reportGpuPassthroughRecovery: if options.unsupportedPlatform
=== "jetson" do not call loadNames() (which triggers registry IO); instead call
gpuPassthroughRecoveryLines with an empty names array (or otherwise iterate its
output) and emit those lines, otherwise call loadNames() as today. Update
reportGpuPassthroughRecovery to conditionally obtain names only when
unsupportedPlatform is not "jetson" (referencing reportGpuPassthroughRecovery,
loadNames, getRegisteredSandboxNamesForGpuRecovery, gpuPassthroughRecoveryLines,
and GpuPassthroughRecoveryOptions).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/onboard/gpu-recovery.ts`:
- Around line 110-116: Short-circuit the registry lookup in
reportGpuPassthroughRecovery: if options.unsupportedPlatform === "jetson" do not
call loadNames() (which triggers registry IO); instead call
gpuPassthroughRecoveryLines with an empty names array (or otherwise iterate its
output) and emit those lines, otherwise call loadNames() as today. Update
reportGpuPassthroughRecovery to conditionally obtain names only when
unsupportedPlatform is not "jetson" (referencing reportGpuPassthroughRecovery,
loadNames, getRegisteredSandboxNamesForGpuRecovery, gpuPassthroughRecoveryLines,
and GpuPassthroughRecoveryOptions).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1034411c-0705-4748-911e-536e862853b2

📥 Commits

Reviewing files that changed from the base of the PR and between d0a5a63 and ace53da.

📒 Files selected for processing (8)
  • src/lib/onboard.ts
  • src/lib/onboard/gateway-gpu-passthrough.test.ts
  • src/lib/onboard/gateway-gpu-passthrough.ts
  • src/lib/onboard/gpu-recovery.test.ts
  • src/lib/onboard/gpu-recovery.ts
  • src/lib/onboard/sandbox-gpu-mode.test.ts
  • src/lib/onboard/sandbox-gpu-mode.ts
  • src/lib/onboard/sandbox-gpu-preflight.ts

@yimoj yimoj force-pushed the fix/3710-jetson-gpu-passthrough branch from ace53da to 361578c Compare May 21, 2026 03:24
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
src/lib/onboard/sandbox-gpu-mode.ts (1)

48-49: 💤 Low value

Consider clarifying whether Jetson GPU passthrough is permanently unsupported.

The comment states that GPU sandbox passthrough "does not currently work on Jetson," which might suggest this is a temporary limitation. Based on the error messages JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGE and the PR objectives, this appears to be an architectural constraint rather than a temporary gap. If Jetson GPU passthrough is permanently unsupported, consider removing "currently" to avoid setting incorrect expectations.

📝 Suggested clarification
-  // GPU sandbox passthrough does not currently work on Jetson; keep auto/default
-  // behavior on the CPU sandbox path unless the user explicitly forces GPU.
+  // GPU sandbox passthrough is not supported on Jetson/Tegra; keep auto/default
+  // behavior on the CPU sandbox path unless the user explicitly forces GPU.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/sandbox-gpu-mode.ts` around lines 48 - 49, Update the comment
in sandbox-gpu-mode.ts to remove the word "currently" and state that Jetson GPU
sandbox passthrough is unsupported (architectural constraint) rather than a
temporary limitation; also ensure any related messaging/constants like
JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGE reflect this permanent status so the
comment and messages are consistent with the code's behavior.
src/lib/onboard/sandbox-gpu-mode.test.ts (1)

110-114: 💤 Low value

Consider checking both error messages for consistency.

The test for jetsonFlagEnable verifies the unsupported message but not the workaround message, while the jetsonExplicitEnable test above checks both. For consistency and completeness, consider asserting both JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGE and JETSON_SANDBOX_GPU_WORKAROUND_MESSAGE are present.

✅ Suggested assertion addition
 const jetsonFlagEnable = resolveSandboxGpuConfig(jetson, { flag: "enable", env: {} });
 expect(jetsonFlagEnable.mode).toBe("1");
 expect(jetsonFlagEnable.errors.join("\n")).toContain(
   JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGE,
 );
+expect(jetsonFlagEnable.errors.join("\n")).toContain(
+  JETSON_SANDBOX_GPU_WORKAROUND_MESSAGE,
+);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/sandbox-gpu-mode.test.ts` around lines 110 - 114, The test
for jetsonFlagEnable should assert both JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGE
and JETSON_SANDBOX_GPU_WORKAROUND_MESSAGE are present in the errors like the
jetsonExplicitEnable test does; update the assertion for
resolveSandboxGpuConfig(...) stored in jetsonFlagEnable to check that
jetsonFlagEnable.errors.join("\n") contains both
JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGE and JETSON_SANDBOX_GPU_WORKAROUND_MESSAGE
so the two tests are consistent.
src/lib/onboard/gateway-gpu-passthrough.ts (1)

144-147: Run targeted onboarding E2E coverage for this early-exit path.

Given this changes control flow before gateway inspection/cleanup, I’d run at least cloud-e2e, sandbox-operations-e2e, and channels-stop-start-e2e from the recommended nightly matrix on this branch.

As per coding guidelines, “E2E test recommendation: cloud-e2e … sandbox-operations-e2e … channels-stop-start-e2e …”.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard/gateway-gpu-passthrough.ts` around lines 144 - 147, This
early-exit path (when gpuPassthrough && hostGpuPlatform === "jetson" in
gateway-gpu-passthrough.ts) bypasses gateway inspection/cleanup, so add targeted
E2E coverage: create/enable tests in the cloud-e2e, sandbox-operations-e2e, and
channels-stop-start-e2e suites that trigger gpuPassthrough=true and
hostGpuPlatform="jetson" (via env/fixture) and assert that
reportGpuPassthroughRecovery is invoked and the process exits as expected, and
also verify whether gateway inspection/cleanup behavior is executed or
intentionally skipped; ensure tests fail if the early exit accidentally skips
required cleanup.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/onboard/gateway-gpu-passthrough.ts`:
- Around line 144-147: This early-exit path (when gpuPassthrough &&
hostGpuPlatform === "jetson" in gateway-gpu-passthrough.ts) bypasses gateway
inspection/cleanup, so add targeted E2E coverage: create/enable tests in the
cloud-e2e, sandbox-operations-e2e, and channels-stop-start-e2e suites that
trigger gpuPassthrough=true and hostGpuPlatform="jetson" (via env/fixture) and
assert that reportGpuPassthroughRecovery is invoked and the process exits as
expected, and also verify whether gateway inspection/cleanup behavior is
executed or intentionally skipped; ensure tests fail if the early exit
accidentally skips required cleanup.

In `@src/lib/onboard/sandbox-gpu-mode.test.ts`:
- Around line 110-114: The test for jetsonFlagEnable should assert both
JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGE and JETSON_SANDBOX_GPU_WORKAROUND_MESSAGE
are present in the errors like the jetsonExplicitEnable test does; update the
assertion for resolveSandboxGpuConfig(...) stored in jetsonFlagEnable to check
that jetsonFlagEnable.errors.join("\n") contains both
JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGE and JETSON_SANDBOX_GPU_WORKAROUND_MESSAGE
so the two tests are consistent.

In `@src/lib/onboard/sandbox-gpu-mode.ts`:
- Around line 48-49: Update the comment in sandbox-gpu-mode.ts to remove the
word "currently" and state that Jetson GPU sandbox passthrough is unsupported
(architectural constraint) rather than a temporary limitation; also ensure any
related messaging/constants like JETSON_SANDBOX_GPU_UNSUPPORTED_MESSAGE reflect
this permanent status so the comment and messages are consistent with the code's
behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: acaca63e-a1c5-4033-bd27-c5ee19a4bdd1

📥 Commits

Reviewing files that changed from the base of the PR and between ace53da and 361578c.

📒 Files selected for processing (8)
  • src/lib/onboard.ts
  • src/lib/onboard/gateway-gpu-passthrough.test.ts
  • src/lib/onboard/gateway-gpu-passthrough.ts
  • src/lib/onboard/gpu-recovery.test.ts
  • src/lib/onboard/gpu-recovery.ts
  • src/lib/onboard/sandbox-gpu-mode.test.ts
  • src/lib/onboard/sandbox-gpu-mode.ts
  • src/lib/onboard/sandbox-gpu-preflight.ts

Jetson and Tegra hosts cannot use the OpenShell sandbox GPU passthrough path, so keep auto/default mode on CPU and fail forced GPU requests before gateway or sandbox creation. Recovery hints now direct Jetson users to CPU mode instead of destroy/recreate loops.

Fixes NVIDIA#3710

Fixes NVIDIA#3473

Signed-off-by: Yimo Jiang <yimoj@nvidia.com>
@yimoj yimoj force-pushed the fix/3710-jetson-gpu-passthrough branch from 361578c to 480d048 Compare May 21, 2026 03:32
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/lib/onboard.ts (1)

3283-3292: Run the recommended onboard E2Es for this path.

This touches core preflight and gateway-reuse behavior in src/lib/onboard.ts, so I'd run the targeted onboard E2Es before merge—at least cloud-e2e, sandbox-operations-e2e, and openshell-gateway-upgrade-e2e.

As per coding guidelines "src/lib/onboard.ts: This file contains core onboarding logic. Changes here affect the full sandbox creation and configuration flow."

Also applies to: 9346-9364, 9483-9497

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/lib/onboard.ts` around lines 3283 - 3292, This change touches core
onboarding/preflight logic (functions nim.detectGpu, resolveSandboxGpuConfig,
exitOnSandboxGpuConfigErrors and the optedOutGpuPassthrough condition in
src/lib/onboard.ts) so before merging run the targeted onboard end-to-end
suites: cloud-e2e, sandbox-operations-e2e, and openshell-gateway-upgrade-e2e
(also verify the other modified regions around the file referenced in the
review). Fix any failures uncovered by these E2Es (adjust config handling,
flags, or gating logic in the referenced functions) and re-run until all three
suites pass.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/lib/onboard.ts`:
- Around line 3283-3292: This change touches core onboarding/preflight logic
(functions nim.detectGpu, resolveSandboxGpuConfig, exitOnSandboxGpuConfigErrors
and the optedOutGpuPassthrough condition in src/lib/onboard.ts) so before
merging run the targeted onboard end-to-end suites: cloud-e2e,
sandbox-operations-e2e, and openshell-gateway-upgrade-e2e (also verify the other
modified regions around the file referenced in the review). Fix any failures
uncovered by these E2Es (adjust config handling, flags, or gating logic in the
referenced functions) and re-run until all three suites pass.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f50e3734-0711-474c-b14a-f330b0c1f23e

📥 Commits

Reviewing files that changed from the base of the PR and between 361578c and 480d048.

📒 Files selected for processing (8)
  • src/lib/onboard.ts
  • src/lib/onboard/gateway-gpu-passthrough.test.ts
  • src/lib/onboard/gateway-gpu-passthrough.ts
  • src/lib/onboard/gpu-recovery.test.ts
  • src/lib/onboard/gpu-recovery.ts
  • src/lib/onboard/sandbox-gpu-mode.test.ts
  • src/lib/onboard/sandbox-gpu-mode.ts
  • src/lib/onboard/sandbox-gpu-preflight.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

v0.0.49 Release target

Projects

None yet

2 participants