fix(channels): teardown session preset + sandbox auth files on remove#4001
fix(channels): teardown session preset + sandbox auth files on remove#4001laitingsheng wants to merge 3 commits into
Conversation
Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
📝 WalkthroughWalkthroughAdds in-sandbox durable per-channel cleanup (exec + SSH fallback), removes channel names from ChangesChannel Removal Full Teardown
Sequence Diagram(s)sequenceDiagram
participant CLI as NemoClaw CLI
participant RemoveFlow as removeSandboxChannel
participant Exec as openshell sandbox exec
participant SSH as openshell sandbox ssh
participant Registry as Sandboxes Registry
participant Session as onboardSession
participant Rebuilder as promptAndRebuild
CLI->>RemoveFlow: request channels remove <channel>
RemoveFlow->>Exec: run "rm -rf <agent-session-dir> && echo 'Cleared in-sandbox'"
alt exec success
Exec-->>RemoveFlow: success sentinel
else exec null / unreachable
RemoveFlow->>SSH: perform SSH rm -rf attempt
SSH-->>RemoveFlow: success / failure
end
alt success (QR)
RemoveFlow->>Registry: remove channel from messagingChannels
Registry-->>RemoveFlow: ack
RemoveFlow->>Session: drop channel from session.policyPresets
Session-->>RemoveFlow: ack
RemoveFlow->>Rebuilder: queue promptAndRebuild
else failure (QR)
RemoveFlow-->>CLI: abort with non-zero exit (do not rebuild)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Comment |
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
PR Review AdvisorRecommendation: blocked This is an automated advisory review. A human maintainer must make the final merge decision. Limitations: Review was read-only; no tests, package-manager commands, sandbox commands, PR scripts, or E2E jobs were executed by this advisor.; Analysis relied on the provided deterministic GitHub/CI context and the supplied diff, with targeted read-only inspection of changed files.; CI and E2E results were still pending for the specified head SHA 9ce04c1.; The E2E Advisor comment found in PR discussion auto-dispatched required jobs for d371b33; no required E2E pass evidence was available for the current head SHA.; A selective E2E result for prior SHA d371b33 reported channels-stop-start-e2e failure, but that result is not for the requested head SHA 9ce04c1.; No live WhatsApp pairing/removal/rebuild/DM evidence was available to verify the final user-visible acceptance condition.; Open PR overlap data indicates docs/reference/commands.mdx is concurrently touched by multiple active PRs, so final drift should be rechecked before merge. Full advisor summaryPR Review AdvisorBase: The PR targets the live channel-removal bug path and adds useful teardown tests/docs, but merge is blocked by pending CI/blocked merge state, missing required E2E evidence for head 9ce04c1, and the repository monolith-growth gate. Gate status
🔴 Blockers
🟡 Warnings
🔵 Suggestions
Acceptance coverage
Security review
Test / E2E status
✅ What looks good
Review completeness
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
test/channels-remove-full-teardown.test.ts (2)
221-261: ⚡ Quick winAlign Telegram fixture state and assert full channel unregister outcome.
This case only asserts
session.policyPresets. Because preamblesessionStore.messagingChannelsis still["whatsapp"], the test can miss regressions in channel list teardown for the removed Telegram channel.Proposed patch
const ctx = module.exports; +ctx.sessionStore.messagingChannels = ["telegram"]; const registryOverride = require(${JSON.stringify(path.join(repoRoot, "dist", "lib", "state/registry.js"))}); registryOverride.getSandbox = () => ({ name: "test-sb", agent: "openclaw", @@ await ctx.channelModule.removeSandboxChannel("test-sb", { channel: "telegram" }); process.stdout.write("\\n__RESULT__" + JSON.stringify({ sessionPolicyPresets: ctx.sessionStore.policyPresets, + sessionMessagingChannels: ctx.sessionStore.messagingChannels, }) + "\\n"); @@ assert.deepEqual( payload.sessionPolicyPresets, ["npm", "pypi", "brew"], "other presets must remain after removing a token-based channel", ); + assert.deepEqual( + payload.sessionMessagingChannels, + [], + "messagingChannels should be empty after removing the only channel", + );🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/channels-remove-full-teardown.test.ts` around lines 221 - 261, The test only asserts session.policyPresets after calling ctx.channelModule.removeSandboxChannel("test-sb", { channel: "telegram" }) but does not align or assert the sandbox/session messaging channel list; ensure the Telegram fixture state matches the preamble by setting sessionStore.messagingChannels (or ctx.sessionStore.messagingChannels) to include "telegram" before invoking removeSandboxChannel and add assertions that the session's messagingChannels no longer include "telegram" (and that other channels remain unchanged) in addition to the existing session.policyPresets checks so the test fails on regressions that leave a removed channel registered.
27-32: ⚡ Quick winSanitize inherited messaging env vars before spawning the helper script.
Forwarding full
process.envhere can leak CI/local messaging credentials into the spawned run and make channel-removal assertions flaky. Strip messaging-related keys first, then applyextraEnvfor test-specific overrides.Proposed patch
function runScript(scriptBody: string, extraEnv: Record<string, string> = {}): SpawnSyncReturns<string> { const tmpDir = fs.mkdtempSync(path.join(os.tmpdir(), "nemoclaw-3998-")); const scriptPath = path.join(tmpDir, "script.js"); fs.writeFileSync(scriptPath, scriptBody); + const sanitizedEnv: NodeJS.ProcessEnv = { ...process.env }; + for (const key of Object.keys(sanitizedEnv)) { + if (/^(DISCORD_|TELEGRAM_)/.test(key)) delete sanitizedEnv[key]; + } const result = spawnSync(process.execPath, [scriptPath], { cwd: repoRoot, encoding: "utf-8", env: { - ...process.env, + ...sanitizedEnv, HOME: tmpDir, NEMOCLAW_NON_INTERACTIVE: "1", ...extraEnv, }, timeout: 15000,Based on learnings: for hermetic messaging-channel spawned tests in this repo, remove inherited unrelated messaging env vars such as
DISCORD_*andTELEGRAM_*.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/channels-remove-full-teardown.test.ts` around lines 27 - 32, The test currently spreads the entire process.env into the spawned helper (env: { ...process.env, HOME: tmpDir, NEMOCLAW_NON_INTERACTIVE: "1", ...extraEnv }) which can leak CI/local messaging credentials; instead create a sanitized env object by cloning process.env, delete or filter out messaging-related keys (e.g. keys matching prefixes like "DISCORD_" and "TELEGRAM_"), then use that sanitizedEnv when composing the final env (set HOME: tmpDir and NEMOCLAW_NON_INTERACTIVE plus ...extraEnv) so messaging vars are stripped before the spawn.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/channels-remove-full-teardown.test.ts`:
- Around line 221-261: The test only asserts session.policyPresets after calling
ctx.channelModule.removeSandboxChannel("test-sb", { channel: "telegram" }) but
does not align or assert the sandbox/session messaging channel list; ensure the
Telegram fixture state matches the preamble by setting
sessionStore.messagingChannels (or ctx.sessionStore.messagingChannels) to
include "telegram" before invoking removeSandboxChannel and add assertions that
the session's messagingChannels no longer include "telegram" (and that other
channels remain unchanged) in addition to the existing session.policyPresets
checks so the test fails on regressions that leave a removed channel registered.
- Around line 27-32: The test currently spreads the entire process.env into the
spawned helper (env: { ...process.env, HOME: tmpDir, NEMOCLAW_NON_INTERACTIVE:
"1", ...extraEnv }) which can leak CI/local messaging credentials; instead
create a sanitized env object by cloning process.env, delete or filter out
messaging-related keys (e.g. keys matching prefixes like "DISCORD_" and
"TELEGRAM_"), then use that sanitizedEnv when composing the final env (set HOME:
tmpDir and NEMOCLAW_NON_INTERACTIVE plus ...extraEnv) so messaging vars are
stripped before the spawn.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d72ec710-b09d-4414-9b09-6fc8dcaa95af
📒 Files selected for processing (2)
src/lib/actions/sandbox/policy-channel.tstest/channels-remove-full-teardown.test.ts
…solation Address review on #4001: - Refuse to proceed to rebuild when the in-sandbox cleanup for a QR-paired channel fails. Without this guard, a failed `rm -rf` of the auth blob would still queue a rebuild, the state_dirs backup would re-capture the Baileys session, and #3998 would recur. Move the cleanup ahead of the registry/policy/session mutations so an early bail leaves the sandbox in a state where a re-run can complete the removal cleanly. Token-based channels keep the prior best-effort cleanup since token revocation already breaks the bot. - Add a failure-path test asserting that QR-channel cleanup failure exits non-zero, does NOT queue a rebuild, and leaves the registry, policy preset, and session.policyPresets unchanged. - Stop the test subprocess from inheriting messaging-channel env vars (`TELEGRAM_*`, `DISCORD_*`, `SLACK_*`, `WECHAT_*`, `WEIXIN_*`, `WHATSAPP_*`). A local/CI shell with one of these set would otherwise perturb the channel path taken inside the spawned test. - Fix the telegram fixture: registry now reports the channel under test (`telegram`, not stale `whatsapp`) and the test asserts that `messagingChannels` is emptied via `updateSandbox` alongside the session.policyPresets strip. - Document the new behaviour in `docs/reference/commands.mdx` and `docs/manage-sandboxes/messaging-channels.mdx` — the destructive in-sandbox cleanup path per agent, and the early-bail contract for QR-paired channels when the sandbox is unreachable. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
|
🌿 Preview your docs: https://nvidia-preview-pr-4001.docs.buildwithfern.com/nemoclaw |
Selective E2E Results — ✅ All requested jobs passedRun: 26228799819
|
Selective E2E Results — ❌ Some jobs failedRun: 26231106795
|
Nightly E2E (test/e2e/test-channels-stop-start.sh) regressed on the previous commit because `executeSandboxExecCommand` can return null on a transient openshell wrapper hiccup (banner/marker race) even when the underlying sandbox is reachable. Treating that null as "cleanup failed" aborted `channels remove whatsapp` on every live-sandbox run. Wrap rm with a NEMOCLAW_CHANNEL_CLEAR_OK sentinel and check stdout for that exact string instead of trusting the wrapper's status. Try `openshell sandbox exec` first and fall back to `executeSandboxCommand` (SSH) if the exec path does not surface the sentinel — mirrors the existing pattern in process-recovery.ts:286-296 for gateway recovery. Both paths must fail before we abort the remove. Add a regression test covering the SSH-fallback recovery path; rename and stiffen the abort test to require BOTH transports to fail. Signed-off-by: Tinson Lai <tinsonl@nvidia.com>
There was a problem hiding this comment.
🧹 Nitpick comments (2)
docs/manage-sandboxes/messaging-channels.mdx (1)
163-163: ⚡ Quick winSplit into separate sentences and use active voice.
This line has two style issues:
One sentence per line: The semicolon connects two independent clauses that should be on separate lines to improve diff readability. As per coding guidelines, flag when multiple sentences appear on the same line.
Passive voice: "stay untouched" is passive. Rewrite using active voice, for example: "the command leaves the registry, policy preset, and
session.policyPresetsunchanged" or "the command does not modify the registry, policy preset, orsession.policyPresets".Proposed rewrite
Split the semicolon-joined clauses and convert to active voice:
-If neither transport can reach a running sandbox for a QR-paired channel, the command exits non-zero and asks you to start the sandbox and re-run; the registry, policy preset, and `session.policyPresets` stay untouched so a re-run can complete the removal cleanly. +If neither transport can reach a running sandbox for a QR-paired channel, the command exits non-zero and asks you to start the sandbox and re-run. +The command does not modify the registry, policy preset, or `session.policyPresets` so a re-run can complete the removal cleanly.As per coding guidelines: "One sentence per line in source (makes diffs readable)" and "Active voice required. Flag passive constructions."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/manage-sandboxes/messaging-channels.mdx` at line 163, Split the semicolon-joined clause into two separate sentences (one per line) and convert the passive phrase "stay untouched" to active voice; for example, change the sentence to: "If neither transport can reach a running sandbox for a QR-paired channel, the command exits non-zero and asks you to start the sandbox and re-run. The command leaves the registry, policy preset, and `session.policyPresets` unchanged so a re-run can complete the removal cleanly." Update the line containing `session.policyPresets` accordingly.test/channels-remove-full-teardown.test.ts (1)
310-327: ⚡ Quick winAssert SSH fallback was actually attempted in the dual-failure test.
This case says “both exec and SSH cleanup fail,” but it only verifies
sandboxExecCalls. A regression that skips SSH fallback would still pass.Proposed test tightening
process.stdout.write("\\n__RESULT__" + JSON.stringify({ sandboxExecCalls: ctx.sandboxExecCalls, + sandboxSshCalls: ctx.sandboxSshCalls, sessionPolicyPresets: ctx.sessionStore.policyPresets, removedPresets: ctx.removedPresets, registryUpdates: ctx.registryUpdates, callOrder: ctx.callOrder, exitCode: ctx.getExitCode(), @@ process.stdout.write("\\n__RESULT__" + JSON.stringify({ sandboxExecCalls: ctx.sandboxExecCalls, + sandboxSshCalls: ctx.sandboxSshCalls, sessionPolicyPresets: ctx.sessionStore.policyPresets, removedPresets: ctx.removedPresets, registryUpdates: ctx.registryUpdates, callOrder: ctx.callOrder, exitCode: ctx.getExitCode(), @@ const cleanupCalls = payload.sandboxExecCalls.filter((c: { command: string }) => c.command.startsWith("rm -rf"), ); assert.equal(cleanupCalls.length, 1, "expected the rm -rf attempt that failed"); + assert.equal( + payload.sandboxSshCalls.length, + 1, + "SSH fallback must also be attempted before aborting", + );Also applies to: 341-366
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/channels-remove-full-teardown.test.ts` around lines 310 - 327, The test currently only asserts sandbox exec attempts (ctx.sandboxExecCalls) for the dual-failure "both exec and SSH cleanup fail" scenario; add an assertion that the SSH fallback was invoked by checking the SSH-call tracking structure (e.g., ctx.sshExecCalls or whatever variable records SSH attempts) after the run and include the same check in the parallel block around lines 341-366 so the test fails if SSH fallback was skipped; locate the teardown/result printing blocks that reference ctx.sandboxExecCalls, ctx.removedPresets, ctx.registryUpdates, ctx.callOrder and augment them to also include/assert ctx.sshExecCalls (or the actual SSH-tracking symbol) to verify SSH was attempted.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@docs/manage-sandboxes/messaging-channels.mdx`:
- Line 163: Split the semicolon-joined clause into two separate sentences (one
per line) and convert the passive phrase "stay untouched" to active voice; for
example, change the sentence to: "If neither transport can reach a running
sandbox for a QR-paired channel, the command exits non-zero and asks you to
start the sandbox and re-run. The command leaves the registry, policy preset,
and `session.policyPresets` unchanged so a re-run can complete the removal
cleanly." Update the line containing `session.policyPresets` accordingly.
In `@test/channels-remove-full-teardown.test.ts`:
- Around line 310-327: The test currently only asserts sandbox exec attempts
(ctx.sandboxExecCalls) for the dual-failure "both exec and SSH cleanup fail"
scenario; add an assertion that the SSH fallback was invoked by checking the
SSH-call tracking structure (e.g., ctx.sshExecCalls or whatever variable records
SSH attempts) after the run and include the same check in the parallel block
around lines 341-366 so the test fails if SSH fallback was skipped; locate the
teardown/result printing blocks that reference ctx.sandboxExecCalls,
ctx.removedPresets, ctx.registryUpdates, ctx.callOrder and augment them to also
include/assert ctx.sshExecCalls (or the actual SSH-tracking symbol) to verify
SSH was attempted.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 58129e74-7856-47cd-8b70-1b5cc9f73262
📒 Files selected for processing (4)
docs/manage-sandboxes/messaging-channels.mdxdocs/reference/commands.mdxsrc/lib/actions/sandbox/policy-channel.tstest/channels-remove-full-teardown.test.ts
✅ Files skipped from review due to trivial changes (1)
- docs/reference/commands.mdx
Selective E2E Results — ✅ All requested jobs passedRun: 26236091097
|
Summary
nemoclaw <name> channels remove <channel>left two pieces of state behind, so the channel resurrected itself on the next rebuild —policy-listre-listed the channel's preset and the in-sandbox bot library (Baileys for WhatsApp) auto-reconnected from saved auth files. The operator sawchannels statusempty but DMs still got a reply.This change strips the channel from
session.policyPresetsso onboard --resume's preset reconciliation cannot re-apply the preset, andrm -rfs the channel's durable state inside the sandbox before queuing the rebuild so the state_dirs backup snapshots an empty dir.Related Issue
Fixes #3998
Changes
src/lib/actions/sandbox/policy-channel.ts:dropChannelFromSessionPolicyPresets()— strips the channel fromsession.policyPresetsso policy-selection.ts:184 does not target it on resume.clearSandboxChannelDurableState()— agent-aware sandbox-execrm -rfof the channel's durable state dir. Hermes (platforms/umbrella) gets${configDir}/platforms/<channel>/; OpenClaw (channel name declared as its own state_dir) gets${configDir}/<channel>/. Path is validated to live under/sandbox/.<x>/with no..segments before interpolation into the rm command.removeSandboxChannelafterremoveChannelPresetIfPresentand beforepromptAndRebuildso the rebuild's backup excludes the auth files.test/channels-remove-full-teardown.test.ts: regression test — covers OpenClaw + Hermes whatsapp removal (session.policyPresets stripped, sandboxrm -rfissued for the right path, ordering invariant: clear precedes rebuild) and a token-based channel (telegram) confirming the session-preset strip applies symmetrically.Type of Change
Verification
npx prek run --all-filespassesnpm testpassesmake docsbuilds without warnings (doc changes only)Signed-off-by: Tinson Lai tinsonl@nvidia.com
Summary by CodeRabbit
Bug Fixes
Tests
Documentation