Abandon Script#1226
Conversation
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. I'm Claude, leaving this review on Jim's behalf. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
jimmyp
left a comment
There was a problem hiding this comment.
Need to address these comments
cb9761e to
583eb46
Compare
583eb46 to
89077d9
Compare
|
Rebased on top of the new foundational PR #1236 (async migration of SilentProcessRunner). The diff is now focused on the abandon feature itself — the async-migration plumbing has moved to the base PR. Previous head was preserved at git tag claude-safety-2026-05-25-pre-split. |
89077d9 to
0c23f2e
Compare
…, lighthouse comments, async tests - Strip SilentProcessRunner.cs to signature-only async change (move polyfill, EnableRaisingEvents, and await WaitForExitAsync to #1226 abandon PR) - Apply lighthouse comment pattern uniformly to all sync-over-async sites including test scaffolding and the second/third sites in LinuxServiceConfigurator - Improve LinuxServiceConfigurator justification: name the sync interface chain (IServiceConfigurator -> AbstractCommand -> ICommand -> Topshelf) and the no-SynchronizationContext property of Main/Topshelf worker threads - Move SilentProcessRunnerFixture sync-over-async comment to actual call site - Convert LinuxConfigureServiceHelperFixture test methods to async Task, eliminating sync-over-async entirely in that file Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0c23f2e to
e5ca633
Compare
AbandonScript was waiting up to 5 seconds inside the RPC handler for the running script to reach Complete state before returning the response. The server (Octopus) polls GetStatus afterwards anyway (same pattern as the cancel flow), so the synchronous wait on the Tentacle side just burns RPC time without adding behavior. Integration tests updated to assert against polled GetStatus instead of directly against the abandon response state. Addresses #1226 review: ScriptServiceV2.cs:153 ("Which callers. I dont think we want to do this"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Explains why the abandoned-script path tolerates Delete() failure (process still alive, may hold open handles) while the normal completion path surfaces failures. Addresses #1226 review: ScriptServiceV2.cs:176 ('Add some comments'). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- RunningScript: differentiate abandoned (log "Script execution abandoned." + AbandonedExitCode) from cancelled (existing behavior). The abandon-specific catch checks abandonToken.IsCancellationRequested so the cancel path is unaffected. - SilentProcessRunner: extract SafelyCancelOutputAndErrorRead helper used by both the abandon catch and the normal completion path. Keeps the two paths consistent (a missed CancelXxxRead leaves async readers firing during dispose). - ClientAndTentacle.CreateScriptServiceV2Client: documented why this helper exists (test needs direct RPC access; TentacleClient deliberately doesn't expose AbandonScript today). - SilentProcessRunnerFixture: documented the 2s elapsed-time assertion (it's the contract: abandon returns promptly without waiting for the underlying process. Without the bound, the test would silently lose the entire contract). - CapabilitiesServiceV2Test: documented why version==null means "latest tentacle under test" (vs older S3-fetched builds for backwards-compat coverage). - ClientScriptExecutionAbandon: split into two focused tests (cancel-fails-then-abandon, and mutex-release). Deleted the multi-level-deep test: the abandon mechanism doesn't touch the script process tree, so process tree depth is irrelevant to the abandon path (Hitman with the env var short- circuits before any kill regardless of depth). - ScriptBuilder: removed AppendRaw (was only used by the deleted multi-level test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
07bd272 to
4ecdfba
Compare
|
Rebased on top of #1236's Pattern 1 refactor. The abandon-feature commits now build on a cleaner async migration (single bridges instead of multiple in-method GetAwaiter calls). Previous head preserved at tag claude-safety-2026-05-26-pre-pattern1-rebase. |
a70c72a to
ace3984
Compare
|
|
||
| // Abandon first (resolves the wait to -48), then cancel is also requested. | ||
| abandonCts.Cancel(); | ||
| cancelCts.Cancel(); |
There was a problem hiding this comment.
I'm not sure what we are testing here. Explain it to me
| // ClientScriptExecutionAbandon.AbandonScript_WithNoPriorCancel_KillsTheProcess, and the | ||
| // un-killable (survives) case by AbandonScript_WhenCancelFailsToKillProcess. (We do NOT | ||
| // disable kill here: setting that env var is process-wide and leaks into the Tentacle | ||
| // subprocesses other integration tests spawn.) |
| // Abandon stops waiting and returns the distinct AbandonedExitCode. Abandon also | ||
| // best-effort-kills the process — the kill itself is asserted by | ||
| // ClientScriptExecutionAbandon.AbandonScript_WithNoPriorCancel_KillsTheProcess, and the | ||
| // un-killable (survives) case by AbandonScript_WhenCancelFailsToKillProcess. (We do NOT |
There was a problem hiding this comment.
Where are the other test cases. Do we need more coverage here?
| namespace Octopus.Tentacle.Tests.Integration | ||
| { | ||
| [IntegrationTestTimeout] | ||
| public class CapabilitiesServiceV2Test : IntegrationTest |
There was a problem hiding this comment.
Dont we need a test for our new capability here?
| await FluentActions.Invoking(async () => await stuckExecution).Should().ThrowAsync<OperationCanceledException>(); | ||
|
|
||
| // Outcome: a second FullIsolation script with the same mutex now runs — only possible if abandon released it. | ||
| var secondStartFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "second-start"); |
There was a problem hiding this comment.
This should be a second test
| File.WriteAllText(releaseFile, ""); | ||
|
|
||
| // Cancel path only — never NotSupportedException. | ||
| await FluentActions.Invoking(async () => await execution).Should().ThrowAsync<OperationCanceledException>(); |
There was a problem hiding this comment.
What is this asserting?
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Delete the dead abandon-filtered catch in RunningScript; a script cancelled while still acquiring the isolation mutex reports CanceledExitCode. Document that at both the script-service (ScriptServiceV2Fixture) and client (ClientScriptExecutionAbandon) levels, with a comment marking the re-add point. - Re-add the capabilities list-length guard as an exact-set assertion, abandon-aware via Version.HasAbandonScript, in CapabilitiesServiceV2Test. - Extract EnsureProcessKilled test helper that asserts the kill so a leaked process surfaces loudly on CI; fold in TryKillGrandchild; drop the redundant AbandonToken_WhenCancelAlsoRequested runner test. - Comment cleanups: name the abandon-token consumption site in ScriptServiceV2, remove the SupportsAbandon comment, revert SafelyWaitForAllOutput to main's. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ild's exit) The asserting EnsureProcessKilled failed CancellationToken_WhenUnixGrandchildHoldsRedirectedPipes on the Ubuntu CI agent: the grandchild is reparented to init, so it isn't our child and Process.WaitForExit can't reliably observe it exit. Keep the asserting helper for the abandon test (our child), revert the grandchild sites to best-effort TryKillGrandchild. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| : await RunScriptAsync(shellPath, writer, runningScriptToken, abandonToken); | ||
| } | ||
| } | ||
| // Fires when the caller cancelled the script and the underlying process honored the cancellation token. |
There was a problem hiding this comment.
This doesnt make sense. Explain it again
| } | ||
| expected.Add(nameof(IScriptServiceV2)); | ||
| if (version.HasAbandonScript()) | ||
| expected.Add(nameof(ScriptServiceV2.AbandonScriptAsync)); |
There was a problem hiding this comment.
I'm not sure this is right
|
|
||
| var tentacleClient = clientTentacle.TentacleClient; | ||
|
|
||
| var scriptExecution = Task.Run(async () => await tentacleClient.ExecuteScript(firstCommand, CancellationToken)); |
There was a problem hiding this comment.
This should be start script not execute script
There was a problem hiding this comment.
Fix all comments in the test as you go. All tests except ExecuteScript_WhenCancellationStaysPending_EscalatesToAbandon_AndReleasesMutex should use start and not execute
There was a problem hiding this comment.
We should have tests for both start and execute paths
| await tentacleClient.AbandonScript(firstCommand.ScriptTicket, CancellationToken); | ||
|
|
||
| var (result, _) = await scriptExecution; | ||
| result.ExitCode.Should().Be(ScriptExitCodes.AbandonedExitCode); |
There was a problem hiding this comment.
Also assert the process is still running in all of these tests so we know we're actually setting the test up right
|
|
||
| [Test] | ||
| [TentacleConfigurations(scriptServiceToTest: ScriptServiceVersionToTest.Version2)] | ||
| public async Task ExecuteScript_WhenAScriptWaitingOnTheMutexIsCancelledAndAbandoned_ReturnsCancelleExitCode(TentacleConfigurationTestCase tentacleConfigurationTestCase) |
There was a problem hiding this comment.
Change this to make explicit calls to start, cancel and abandon. We shouldnt need TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTION either
…estructure Abandon support was decided by ScriptServiceVersion == V2, but old V2 Tentacles (v7.1, v8.0) are V2 yet predate the abandon verb. Determine it from the advertised AbandonScript capability: - Re-add client CapabilitiesResponseV2.HasAbandonScript(); the selector surfaces it; the orchestrator captures it once on the start result (CommandContext.SupportsAbandon) and gates escalation on it. Removed SupportsAbandon from ScriptServiceVersion (it can't carry it - the record is value-compared in ScriptExecutorFactory). Selector unit test covers all three cases. Test feedback: - CapabilitiesServiceV2Test: focused decorator/K8s tests back to Contain/NotContain; cross-version AbandonScriptIsAdvertisedOnlyByVersionsThatSupportIt proves old versions don't advertise abandon. - SilentProcessRunnerFixture: renamed to AbandonToken_WhenProcessCanBeKilled_..., assert the process is gone, trim the comment. - ClientScriptExecutionAbandon: drive explicit StartScript/CancelScript/AbandonScript; assert the process is still running (stuck) vs gone (killable) via PID; rename for the killable/stuck split; clarify the V1 no-escalation comment. - RunningScript: correct the cancel-catch comment (OCE comes from mutex Acquire, not the process). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
025b768 to
86ef03a
Compare
A force-push from a stale base overwrote 4 'Apply suggestion from @jimmyp' commits (025b768). Restoring them, reconciled with the capability-gating rewrite: - SilentProcessRunner: restore the swallow/abandon-wait comment wording verbatim. - KubernetesScriptServiceV1Executor: drop 'Hung pods are recovered by deleting the pod instead.' - ClientScriptExecutionAbandon: reword the no-prior-cancel releaseFile comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| if (version.HasScriptServiceV2()) | ||
| { | ||
| capabilities.Should().Contain(nameof(IScriptServiceV2)); | ||
| expectedCapabilitiesCount++; |
There was a problem hiding this comment.
Add these counts back
…hAbandon) + review fixes #1 Move abandon-support back onto the version (option A). The selector picks ScriptServiceVersion2WithAbandon when the Tentacle advertised AbandonScript, else ScriptServiceVersion2; the factory routes both to the V2 executor; the executor stamps the selected version onto the CommandContext, so the orchestrator gates on ScripServiceVersionUsed.SupportsAbandon. Reverts the CommandContext flag / capture-once / selector tuple. Capability string is nameof(IScriptServiceV2.AbandonScript) on both sides (server advertise + client check, unit fixture, integration test). Deleted the mocked selector unit test (redundant with the version approach). #2 CapabilitiesServiceV2Test: restore the count assertions (abandon-aware). #3 ClientScriptExecutionAbandon: capture each spawned PID and force-kill in TearDown so nothing leaks on CI. #4 SilentProcessRunnerFixture: add AbandonToken_WhenProcessCanNotBeKilled (asserts the process is still running, then cleans it up). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| } | ||
|
|
||
| if (scriptServiceToUse == ScriptServiceVersion.ScriptServiceVersion2) | ||
| if (scriptServiceToUse == ScriptServiceVersion.ScriptServiceVersion2 || scriptServiceToUse == ScriptServiceVersion.ScriptServiceVersion2WithAbandon) |
There was a problem hiding this comment.
Lets create ScriptServiceVersion.IsV2 to centralise this and remove the comment below
…NotSupported assertion - Add ScriptServiceVersion.IsV2 (true for both V2 variants); ScriptExecutorFactory dispatches on it and the explanatory comment is gone. - Split the escalation test: ExecuteScript_WhenCancellationStaysPending_EscalatesToAbandon now just proves the run terminates; the mutex-release outcome is its own test, ..._AndReleasesTheIsolationMutex. - V1 test: assert the run does NOT throw NotSupportedException (rather than asserting it throws OCE) and rename to ExecuteScript_OnV1Tentacle_DoesNotEscalateToNotSupportedAbandon. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…fect The mutex-release test was a strict superset of _EscalatesToAbandon (both asserted OCE). Drop the redundant OCE assertion from _AndReleasesTheIsolationMutex so it asserts only the mutex release (second script runs); _EscalatesToAbandon keeps the run-terminates (OCE) assertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Why
Tentacle needs to release the script-isolation mutex even when a script can't be cancelled. Philips is running into situations where CrowdStrike and Rapid7 contention freezes the script causing cancel commands to not be able to complete.
This PR adds an
AbandonScriptcommand that causes Tentacle to stop monitoring the process and release the mutex, write a line into the script log and accept new work. If we can kill the OS process we do, but if we can't we stop waiting on it and leave it running.What changed
AbandonScriptRPC onIScriptServiceV2/IAsyncScriptServiceV2/IAsyncClientScriptServiceV2: fires the abandon token and returns the current status of the scriptRunningScriptlogs the abandonment and returns-48(AbandonedExitCode); the isolation-mutexusingunwinds, releasing it.SilentProcessRunnerbest-effort kills the process on abandon, then stops waiting.CompleteScriptAsynctolerates aworkspace.Deletefailure for an abandoned script (the still running script can hold file handles etc)TentacleClient.ExecuteScriptnow has an optionalabandonAfterCancellationPendingForparamete once cancellation has been pending that long it abandons the script.AbandonScriptAsynccapability.TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTIONwhich makesHitmana no-op instead of killing the script process.Before and after
Cancel only helps if the process actually dies. When it doesn't, the mutex stays held and every later deployment to that Tentacle queues forever. Today the only fix is an RDP onto the machine to kill the script or a reboot of the machine. Abandon changes that. After the cancel timeout the server abandons the script and the Tentacle releases the mutex, so work flows again.
Before
flowchart TD A["CancelScript fires the cancel token"] --> B["cancel.Register runs DoOurBestToCleanUp:<br/>Hitman.TryKillProcessAndChildrenRecursively, then process.Close()"] B --> C{"What happens to the OS process?"} C -->|"killed cleanly"| D["process.WaitForExit() returns"] C -->|"child dies, reparented grandchild holds the pipes"| E["process.Close() releases the redirected pipes"] E --> D C -->|"hung script won't die (CrowdStrike + Rapid7)"| F["process.WaitForExit() blocks forever<br/>isolation mutex never released; every later deploy queues"] D --> G["drain stdout/stderr, SafelyGetExitCode, return the process exit code"]After
flowchart TD A["CancelScript fires the cancel token"] --> B["cancel.Register runs DoOurBestToCleanUp:<br/>Hitman.TryKillProcessAndChildrenRecursively, then process.Close()"] B --> C{"What happens to the OS process?"} C -->|"killed cleanly"| D["waitForExit task completes"] C -->|"child dies, reparented grandchild holds the pipes"| E["process.Close() releases the redirected pipes"] E --> D C -->|"hung script won't die"| W["await Task.WhenAny(waitForExit, WaitForAbandon(abandon)) stays pending"] D --> N["WhenAny returns, abandon not set: drain, return the process exit code"] W --> AB["AbandonScript fires the abandon token<br/>(server escalates after abandonAfterCancellationPendingFor)"] AB --> X["WhenAny returns via WaitForAbandon: DoOurBestToCleanUp once more (best-effort),<br/>log 'Tentacle has abandoned this script', return AbandonedExitCode (-48)"] X --> R["isolation mutex released, work flows again;<br/>OS process left running if it still won't die"]Risks & mitigations
Machines may start collecting stuck script processes that aren't monitored by tentacle. Any real mitigation of this is beyond our control which is why we have added product documentation around this feature to make users aware of their responisibilities.
Workspace files may leak on Windows. However the workspace gets cleaned up on Tentacle restart or by out-of-band cleanup mechanisms.
When abandoning scripts that aren't fully frozen, but maybe just slow to respond there's a race between the script completing naturally and abandonment, the script can finish between the time we call abandon and abandon completing. If the script exits in this gap, we dont make any guarantees we will return the correct error code. This is a natural race condition for the user clicking cancel button and the script completing anyway, so there is no real justification to mitigate this
During CI builds we start scripts and simulate them hanging, which could result in stray processes hanging around on build agents. We make best attempts to clean these up
For Kubernetes agent owners
K8s agents do NOT advertise abandon. The capability set only includes
AbandonScriptAsyncfor non-Kubernetes Tentacles. K8s agents continue to advertise their own capability set without abandon support.Why: K8s scripts run in pods. A hung pod is recoverable by deleting it, which the server can do through the existing K8s lifecycle. The Philips-style hang doesn't translate to k8s pods.
How to review
[JIM_BOT.EXE v2.13]