Skip to content

Revert "fix(snapshot): use gateway metadata for VM-driver health checks (#3784)"#3929

Closed
jyaunches wants to merge 1 commit into
mainfrom
revert/pr-3784-snapshot-gateway-metadata
Closed

Revert "fix(snapshot): use gateway metadata for VM-driver health checks (#3784)"#3929
jyaunches wants to merge 1 commit into
mainfrom
revert/pr-3784-snapshot-gateway-metadata

Conversation

@jyaunches
Copy link
Copy Markdown
Contributor

@jyaunches jyaunches commented May 20, 2026

Summary

Verification

  • git revert --signoff --no-edit 36491d2fdab5f575c0e4b1a799346247a6742242
  • git diff --check HEAD^ HEAD

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced gateway status detection for Docker-based sandboxes, improving snapshot operation reliability
  • Tests

    • Strengthened regression tests for snapshot operations with unavailable gateways

Review Change Stack

…ks (#3784)"

This reverts commit 36491d2.

Signed-off-by: Julie Yaunches <jyaunches@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 20, 2026

📝 Walkthrough

Walkthrough

This PR updates sandbox snapshot gateway detection for Docker-based sandboxes. The code replaces metadata-based health checks with a Docker-specific probe that parses openshell status output for a "Connected" signal, and simplifies the test helper to inline stopped gateway environment setup for regression testing.

Changes

Docker gateway probe and test updates

Layer / File(s) Summary
Docker gateway probe and imports
src/lib/actions/sandbox/snapshot.ts
Adds probeDockerDriverGatewayRunning() to check openshell status output for "Status: Connected" after ANSI stripping, and routes probeGatewayRunning through this Docker-specific path when the sandbox uses the docker driver. Removes isGatewayHealthy import and reorganizes state registry imports to support the updated probing logic.
Stopped gateway test environment
test/snapshot-gateway-guard.test.ts
Rewrites makeStoppedGatewayEnv to inline all temp HOME setup, sandbox registry JSON creation, and fake openshell/docker shell scripts with stale sandbox list and docker inspect returning false, replacing prior helper function dependencies for #2673 regression coverage.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~18 minutes

Possibly related PRs

  • NVIDIA/NemoClaw#3784: Both PRs change src/lib/actions/sandbox/snapshot.ts gateway-running probing logic and adjust test/snapshot-gateway-guard.test.ts around stopped/healthy gateway guard behavior.

Suggested labels

fix, NemoClaw CLI, OpenShell

Poem

🐰 A snapshot once checked gateways by reputation,
But Docker demands direct observation—
Status Connected, the rabbit espies,
With ANSI stripped and tests by our side. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: reverting a specific previous commit about snapshot gateway metadata health checks.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch revert/pr-3784-snapshot-gateway-metadata

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint skipped: no ESLint configuration detected in root package.json. To enable, add eslint to devDependencies.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

E2E Advisor Recommendation

Required E2E: snapshot-commands-e2e
Optional E2E: sandbox-operations-e2e, sandbox-survival-e2e

Dispatch hint: snapshot-commands-e2e

Auto-dispatched E2E: snapshot-commands-e2e via nightly-e2e.yaml at 74db9b7532bb9de910eb08884789f2f0c1fd7f6cnightly run

Workflow run

Full advisor summary

E2E Recommendation Advisor

Base: origin/main
Head: HEAD
Confidence: high

Required E2E

  • snapshot-commands-e2e (medium (~30 min, live sandbox, NVIDIA_API_KEY)): Required because the changed command implementation is the snapshot create/restore gate itself. This existing E2E validates install, live sandbox snapshot create/list/restore, targeted restore, and restored workspace state against a real OpenShell sandbox.

Optional E2E

  • sandbox-operations-e2e (high (~60 min, live sandbox)): Useful adjacent confidence for gateway/sandbox lifecycle behavior because the change alters how snapshot code decides whether the gateway is really running; this E2E exercises gateway kill/recovery, registry rebuild, sandbox status/list/logs, and multi-sandbox isolation.
  • sandbox-survival-e2e (medium (~30 min, live sandbox)): Optional lower-cost gateway restart recovery check covering onboard, inference, gateway stop/start, and sandbox survival; relevant to gateway liveness semantics but less directly tied to snapshot commands than snapshot-commands-e2e.

New E2E recommendations

  • snapshot gateway stopped negative path (high): Existing snapshot-commands-e2e covers the happy-path snapshot lifecycle, but this PR specifically fixes/changes rejection when OpenShell reports stale sandbox state while the gateway is stopped. Add a real E2E phase that stops/kills the gateway container after onboarding and asserts snapshot create and snapshot restore fail with the intended guard message rather than trusting stale openshell sandbox list.
    • Suggested test: Extend test/e2e/test-snapshot-commands.sh or add a dedicated snapshot gateway guard E2E that stops the gateway and validates snapshot create/restore rejection.
  • VM-driver snapshot compatibility (medium): The VM-driver positive unit test was removed, and current Linux Docker snapshot E2E will not prove that macOS/VM-driver gateways without the legacy openshell-cluster-nemoclaw container can still create snapshots. If VM-driver snapshots are supported, add platform coverage or a scenario that exercises snapshot create on that driver.
    • Suggested test: Add a VM-driver/macOS snapshot create E2E or scenario coverage for healthy non-legacy-cluster gateways.

Dispatch hint

  • Workflow: nightly-e2e.yaml
  • jobs input: snapshot-commands-e2e

@github-actions
Copy link
Copy Markdown
Contributor

PR Review Advisor

Recommendation: blocked
Confidence: high
Analyzed HEAD: 74db9b7532bb9de910eb08884789f2f0c1fd7f6c
Findings: 4 blocker(s), 2 warning(s), 0 suggestion(s)

This is an automated advisory review. A human maintainer must make the final merge decision.

Limitations: Review used trusted PR metadata, status context data, linked issue/comment text, and the provided diff; no PR scripts, tests, package-manager commands, or workflows were executed.; Current CI and E2E jobs were still pending at the time of review.; No E2E Advisor recommendation comment was available in the provided context.; CodeRabbit review for the current PR was still pending, so final automated review findings may not yet be reflected.; The current PR is a revert; human maintainers should confirm the reason for reverting #3784 and whether reopening the #3567 behavior is intended.

Workflow run

Full advisor summary

PR Review Advisor

Base: origin/main
Head: HEAD
Analyzed SHA: 74db9b7532bb9de910eb08884789f2f0c1fd7f6c
Recommendation: blocked
Confidence: high

This revert is currently blocked by pending CI/mergeability and appears to reintroduce the known macOS VM-driver snapshot failure by removing the VM gateway metadata probe and its regression test.

Gate status

  • CI: pending — Status rollup for head SHA 74db9b7 shows multiple incomplete contexts, including E2E recommendation, wsl-e2e, macos-e2e, CodeQL, unit-vitest-linux, build-sandbox-images, build-sandbox-images-arm64, checks, ShellCheck SARIF, and CodeRabbit.
  • Mergeability: fail — GitHub GraphQL reports mergeStateStatus=BLOCKED for PR Revert "fix(snapshot): use gateway metadata for VM-driver health checks (#3784)" #3929.
  • Review threads: warning — GraphQL reviewThreads.nodes is empty, but CodeRabbit's current PR comment says it is still processing new changes and the CodeRabbit status context is PENDING.
  • Risky code tested: fail — The changed production file is a sandbox snapshot runtime path. The PR deletes the VM-driver snapshot regression test, and required E2E jobs have not passed for the current head SHA.

🔴 Blockers

  • Revert appears to reintroduce the VM-driver snapshot liveness bug (src/lib/actions/sandbox/snapshot.ts:217): The current implementation only uses the OpenShell status-based probe when registry entry openshellDriver is "docker". VM-driver sandboxes now fall through to the legacy openshell-cluster-nemoclaw Docker container inspection path, which is the behavior PR fix(snapshot): use gateway metadata for VM-driver health checks #3784 said failed on macOS Apple Silicon VM-driver sandboxes without the legacy cluster container.
    • Recommendation: Do not merge until the intended behavior is clarified. If VM-driver snapshot creation should remain supported, preserve a VM-specific metadata health path or provide a replacement fix with regression coverage. If the revert is intentional despite the regression, document the reason and the accepted user impact.
    • Evidence: probeGatewayRunning checks if (entry?.openshellDriver === "docker") and otherwise inspects openshell-cluster-nemoclaw; the reverted diff removed usesGatewayMetadataProbe(driver) => driver === "docker" || driver === "vm" and isGatewayHealthy(...).
  • VM-driver regression test was removed with no replacement (test/snapshot-gateway-guard.test.ts:104): The PR deletes the test that simulated a healthy VM-driver gateway with no legacy cluster container and asserted snapshot create --name baseline succeeds. That was the direct regression coverage for the behavior being reverted.
    • Recommendation: Restore or replace the VM-driver snapshot test if VM-driver support remains expected. At minimum, add a negative/compatibility test that proves the reverted behavior is intentionally accepted.
    • Evidence: The diff removes makeHealthyVmGatewayEnv(...) and the snapshot VM-driver gateway guard suite, including snapshot create accepts healthy macOS VM-driver gateways without legacy cluster container.
  • Required sandbox/runtime E2E coverage has not passed for the head SHA: Snapshot gateway liveness is runtime and sandbox lifecycle behavior that unit mocks cannot fully prove, especially across macOS VM-driver and WSL/Docker paths. The relevant E2E checks are still in progress or unavailable for the current head SHA.
    • Recommendation: Wait for the E2E recommendation, macos-e2e, and wsl-e2e jobs to complete successfully for 74db9b7, and verify the E2E Advisor recommendation before merge.
    • Evidence: Status rollup shows E2E recommendation=IN_PROGRESS, macos-e2e=IN_PROGRESS, and wsl-e2e=IN_PROGRESS. No E2E Advisor comments were found.
  • Merge is blocked with pending required checks: The PR is not currently mergeable and has pending CI contexts, so it fails hard gate checks independent of code review findings.
    • Recommendation: Wait for all required status checks to finish and resolve the blocked merge state before considering merge.
    • Evidence: GitHub reports mergeStateStatus=BLOCKED; 13 status contexts appear pending in the trusted context.

🟡 Warnings

🔵 Suggestions

  • None.

Acceptance coverage

Security review

  • pass — 1. Secrets and Credentials: No hardcoded secrets, API keys, passwords, tokens, PEM files, or credential material are introduced in the changed production or test files.
  • pass — 2. Input Validation and Data Sanitization: No new user-controlled parsing surface is added. Existing sandbox name handling and command argument construction are not expanded by this revert.
  • pass — 3. Authentication and Authorization: No endpoints, authentication checks, authorization logic, token validation, or privilege boundaries are added or modified.
  • pass — 4. Dependencies and Third-Party Libraries: No new dependencies, package manifests, registries, or version pins are changed.
  • warning — 5. Error Handling and Logging: The reverted VM-driver behavior can emit the generic Failed to query live sandbox state from OpenShell error for healthy VM-driver gateways, which is not a secret leak but can obscure diagnosis and recovery.
  • pass — 6. Cryptography and Data Protection: Not applicable — no cryptographic operations, algorithms, key handling, or data protection mechanisms are changed.
  • pass — 7. Configuration and Security Headers: No HTTP endpoints, CORS/CSP settings, Dockerfiles, container privileges, exposed ports, or security configuration defaults are changed.
  • warning — 8. Security Testing: The PR preserves stopped legacy gateway guard tests but deletes VM-driver runtime regression coverage. For sandbox lifecycle code, the missing macOS/VM path coverage is a security-posture concern because health-check regressions can affect fail-closed/fail-open assumptions.
  • warning — 9. Holistic Security Posture: The change appears to fail closed for VM-driver snapshot creation rather than introduce an obvious sandbox escape, SSRF bypass, credential leak, or policy bypass. However, it degrades sandbox lifecycle reliability and removes coverage in a high-risk area, so it should not be treated as clean without E2E validation.

Test / E2E status

  • Test depth: e2e_required — This PR modifies src/lib/actions/sandbox/snapshot.ts, a sandbox runtime/lifecycle path. Unit tests with mocked openshell and docker are useful but cannot prove macOS VM-driver, Docker-driver, WSL, gateway metadata, and legacy cluster behavior across real environments. The PR also deletes the specific VM-driver regression test.
  • E2E Advisor: not_found (not found)
  • Required E2E jobs: E2E recommendation, macos-e2e, wsl-e2e
  • Missing for analyzed SHA: E2E recommendation, macos-e2e, wsl-e2e

✅ What looks good

Review completeness

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/snapshot-gateway-guard.test.ts (1)

65-67: ⚡ Quick win

Exercise the docker-driver gateway probe path in this fixture.

Line 65-Line 67 currently omits openshellDriver: "docker", so this setup can validate the fallback inspect path instead of the new docker-specific probe branch in snapshot.ts.

Suggested fixture update
       sandboxes: {
         alpha: {
           name: "alpha",
+          openshellDriver: "docker",
           model: "test-model",
           provider: "nvidia-prod",
           gpuEnabled: false,
           policies: [],
         },
@@
     [
       "#!/bin/sh",
+      'if [ "$1" = "status" ]; then',
+      '  printf "Status: Disconnected\\n"',
+      "  exit 0",
+      "fi",
       'if [ "$1" = "sandbox" ] && [ "$2" = "list" ]; then',
       '  printf "NAME STATUS\\nalpha Ready\\n"',
       "  exit 0",
       "fi",
       "exit 0",

Also applies to: 83-89

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/snapshot-gateway-guard.test.ts` around lines 65 - 67, Update the test
fixture objects used by the snapshot gateway guard so they exercise the
docker-specific probe branch: add openshellDriver: "docker" to the "alpha"
fixture object (the one with name: "alpha", model: "test-model") referenced in
this test and likewise add openshellDriver: "docker" to the other fixture block
around lines 83-89; this ensures snapshot.ts's docker-driver gateway probe path
is executed instead of the fallback inspect path when running the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/snapshot-gateway-guard.test.ts`:
- Around line 65-67: Update the test fixture objects used by the snapshot
gateway guard so they exercise the docker-specific probe branch: add
openshellDriver: "docker" to the "alpha" fixture object (the one with name:
"alpha", model: "test-model") referenced in this test and likewise add
openshellDriver: "docker" to the other fixture block around lines 83-89; this
ensures snapshot.ts's docker-driver gateway probe path is executed instead of
the fallback inspect path when running the test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b086b367-db9a-48df-a636-8b829f9a1228

📥 Commits

Reviewing files that changed from the base of the PR and between 95cc0de and 74db9b7.

📒 Files selected for processing (2)
  • src/lib/actions/sandbox/snapshot.ts
  • test/snapshot-gateway-guard.test.ts

@github-actions
Copy link
Copy Markdown
Contributor

Selective E2E Results — ❌ Some jobs failed

Run: 26189690588
Target ref: 74db9b7532bb9de910eb08884789f2f0c1fd7f6c
Workflow ref: main
Requested jobs: snapshot-commands-e2e
Summary: 0 passed, 1 failed, 0 skipped

Job Result
snapshot-commands-e2e ❌ failure

Failed jobs: snapshot-commands-e2e. Check run artifacts for logs.

@ericksoa ericksoa closed this May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants