Abandon Script by jimmyp · Pull Request #1226 · OctopusDeploy/OctopusTentacle

jimmyp · 2026-05-21T03:56:59Z

Why

Tentacle needs to release the script-isolation mutex even when a script can't be cancelled. Philips is running into situations where CrowdStrike and Rapid7 contention freezes the script causing cancel commands to not be able to complete.

This PR adds an AbandonScript command that causes Tentacle to stop monitoring the process and release the mutex, write a line into the script log and accept new work. If we can kill the OS process we do, but if we can't we stop waiting on it and leave it running.

What changed

AbandonScript RPC on IScriptServiceV2 / IAsyncScriptServiceV2 / IAsyncClientScriptServiceV2: fires the abandon token and returns the current status of the script
RunningScript logs the abandonment and returns -48 (AbandonedExitCode); the isolation-mutex using unwinds, releasing it.
SilentProcessRunner best-effort kills the process on abandon, then stops waiting.
CompleteScriptAsync tolerates a workspace.Delete failure for an abandoned script (the still running script can hold file handles etc)
TentacleClient.ExecuteScript now has an optional abandonAfterCancellationPendingFor paramete once cancellation has been pending that long it abandons the script.
Non-Kubernetes Tentacles advertise the AbandonScriptAsync capability.
For simulating frozen scripts I had to add TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTION which makes Hitman a no-op instead of killing the script process.

Before and after

Cancel only helps if the process actually dies. When it doesn't, the mutex stays held and every later deployment to that Tentacle queues forever. Today the only fix is an RDP onto the machine to kill the script or a reboot of the machine. Abandon changes that. After the cancel timeout the server abandons the script and the Tentacle releases the mutex, so work flows again.

Before

flowchart TD
    A["CancelScript fires the cancel token"] --> B["cancel.Register runs DoOurBestToCleanUp:<br/>Hitman.TryKillProcessAndChildrenRecursively, then process.Close()"]
    B --> C{"What happens to the OS process?"}
    C -->|"killed cleanly"| D["process.WaitForExit() returns"]
    C -->|"child dies, reparented grandchild holds the pipes"| E["process.Close() releases the redirected pipes"]
    E --> D
    C -->|"hung script won't die (CrowdStrike + Rapid7)"| F["process.WaitForExit() blocks forever<br/>isolation mutex never released; every later deploy queues"]
    D --> G["drain stdout/stderr, SafelyGetExitCode, return the process exit code"]

After

flowchart TD
    A["CancelScript fires the cancel token"] --> B["cancel.Register runs DoOurBestToCleanUp:<br/>Hitman.TryKillProcessAndChildrenRecursively, then process.Close()"]
    B --> C{"What happens to the OS process?"}
    C -->|"killed cleanly"| D["waitForExit task completes"]
    C -->|"child dies, reparented grandchild holds the pipes"| E["process.Close() releases the redirected pipes"]
    E --> D
    C -->|"hung script won't die"| W["await Task.WhenAny(waitForExit, WaitForAbandon(abandon)) stays pending"]
    D --> N["WhenAny returns, abandon not set: drain, return the process exit code"]
    W --> AB["AbandonScript fires the abandon token<br/>(server escalates after abandonAfterCancellationPendingFor)"]
    AB --> X["WhenAny returns via WaitForAbandon: DoOurBestToCleanUp once more (best-effort),<br/>log 'Tentacle has abandoned this script', return AbandonedExitCode (-48)"]
    X --> R["isolation mutex released, work flows again;<br/>OS process left running if it still won't die"]

Risks & mitigations

Machines may start collecting stuck script processes that aren't monitored by tentacle. Any real mitigation of this is beyond our control which is why we have added product documentation around this feature to make users aware of their responisibilities.

Workspace files may leak on Windows. However the workspace gets cleaned up on Tentacle restart or by out-of-band cleanup mechanisms.

When abandoning scripts that aren't fully frozen, but maybe just slow to respond there's a race between the script completing naturally and abandonment, the script can finish between the time we call abandon and abandon completing. If the script exits in this gap, we dont make any guarantees we will return the correct error code. This is a natural race condition for the user clicking cancel button and the script completing anyway, so there is no real justification to mitigate this

During CI builds we start scripts and simulate them hanging, which could result in stray processes hanging around on build agents. We make best attempts to clean these up

For Kubernetes agent owners

K8s agents do NOT advertise abandon. The capability set only includes AbandonScriptAsync for non-Kubernetes Tentacles. K8s agents continue to advertise their own capability set without abandon support.

Why: K8s scripts run in pods. A hung pod is recoverable by deleting it, which the server can do through the existing K8s lifecycle. The Philips-style hang doesn't translate to k8s pods.

How to review

Is the abandon behaviour right? (mutex released, OS process killed if we can and left alone if we can't, log message honest)
Given this is a significant change to Tentacle, is there anything else we can do to add safety nets or reduce risk?

[JIM_BOT.EXE v2.13]

jimmyp · 2026-05-22T02:11:15Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

I'm Claude, leaving this review on Jim's behalf.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

jimmyp

Need to address these comments

jimmyp · 2026-05-25T02:38:04Z

Rebased on top of the new foundational PR #1236 (async migration of SilentProcessRunner). The diff is now focused on the abandon feature itself — the async-migration plumbing has moved to the base PR. Previous head was preserved at git tag claude-safety-2026-05-25-pre-split.

…, lighthouse comments, async tests - Strip SilentProcessRunner.cs to signature-only async change (move polyfill, EnableRaisingEvents, and await WaitForExitAsync to #1226 abandon PR) - Apply lighthouse comment pattern uniformly to all sync-over-async sites including test scaffolding and the second/third sites in LinuxServiceConfigurator - Improve LinuxServiceConfigurator justification: name the sync interface chain (IServiceConfigurator -> AbstractCommand -> ICommand -> Topshelf) and the no-SynchronizationContext property of Main/Topshelf worker threads - Move SilentProcessRunnerFixture sync-over-async comment to actual call site - Convert LinuxConfigureServiceHelperFixture test methods to async Task, eliminating sync-over-async entirely in that file Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

AbandonScript was waiting up to 5 seconds inside the RPC handler for the running script to reach Complete state before returning the response. The server (Octopus) polls GetStatus afterwards anyway (same pattern as the cancel flow), so the synchronous wait on the Tentacle side just burns RPC time without adding behavior. Integration tests updated to assert against polled GetStatus instead of directly against the abandon response state. Addresses #1226 review: ScriptServiceV2.cs:153 ("Which callers. I dont think we want to do this"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Explains why the abandoned-script path tolerates Delete() failure (process still alive, may hold open handles) while the normal completion path surfaces failures. Addresses #1226 review: ScriptServiceV2.cs:176 ('Add some comments'). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- RunningScript: differentiate abandoned (log "Script execution abandoned." + AbandonedExitCode) from cancelled (existing behavior). The abandon-specific catch checks abandonToken.IsCancellationRequested so the cancel path is unaffected. - SilentProcessRunner: extract SafelyCancelOutputAndErrorRead helper used by both the abandon catch and the normal completion path. Keeps the two paths consistent (a missed CancelXxxRead leaves async readers firing during dispose). - ClientAndTentacle.CreateScriptServiceV2Client: documented why this helper exists (test needs direct RPC access; TentacleClient deliberately doesn't expose AbandonScript today). - SilentProcessRunnerFixture: documented the 2s elapsed-time assertion (it's the contract: abandon returns promptly without waiting for the underlying process. Without the bound, the test would silently lose the entire contract). - CapabilitiesServiceV2Test: documented why version==null means "latest tentacle under test" (vs older S3-fetched builds for backwards-compat coverage). - ClientScriptExecutionAbandon: split into two focused tests (cancel-fails-then-abandon, and mutex-release). Deleted the multi-level-deep test: the abandon mechanism doesn't touch the script process tree, so process tree depth is irrelevant to the abandon path (Hitman with the env var short- circuits before any kill regardless of depth). - ScriptBuilder: removed AppendRaw (was only used by the deleted multi-level test). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jimmyp · 2026-05-26T03:08:09Z

Rebased on top of #1236's Pattern 1 refactor. The abandon-feature commits now build on a cleaner async migration (single bridges instead of multiple in-method GetAwaiter calls). Previous head preserved at tag claude-safety-2026-05-26-pre-pattern1-rebase.

jimmyp · 2026-06-03T22:49:40Z

+
+            // Abandon first (resolves the wait to -48), then cancel is also requested.
+            abandonCts.Cancel();
+            cancelCts.Cancel();


I'm not sure what we are testing here. Explain it to me

jimmyp · 2026-06-03T22:49:49Z

+            // ClientScriptExecutionAbandon.AbandonScript_WithNoPriorCancel_KillsTheProcess, and the
+            // un-killable (survives) case by AbandonScript_WhenCancelFailsToKillProcess. (We do NOT
+            // disable kill here: setting that env var is process-wide and leaks into the Tentacle
+            // subprocesses other integration tests spawn.)


remove this comment

jimmyp · 2026-06-03T22:51:20Z

+            // Abandon stops waiting and returns the distinct AbandonedExitCode. Abandon also
+            // best-effort-kills the process — the kill itself is asserted by
+            // ClientScriptExecutionAbandon.AbandonScript_WithNoPriorCancel_KillsTheProcess, and the
+            // un-killable (survives) case by AbandonScript_WhenCancelFailsToKillProcess. (We do NOT


Where are the other test cases. Do we need more coverage here?

jimmyp · 2026-06-03T22:53:14Z

 namespace Octopus.Tentacle.Tests.Integration
 {
    [IntegrationTestTimeout]
    public class CapabilitiesServiceV2Test : IntegrationTest


Dont we need a test for our new capability here?

jimmyp · 2026-06-04T02:49:18Z

+            await FluentActions.Invoking(async () => await stuckExecution).Should().ThrowAsync<OperationCanceledException>();
+
+            // Outcome: a second FullIsolation script with the same mutex now runs — only possible if abandon released it.
+            var secondStartFile = Path.Combine(clientTentacle.TemporaryDirectory.DirectoryPath, "second-start");


This should be a second test

jimmyp · 2026-06-04T02:50:17Z

+            File.WriteAllText(releaseFile, "");
+
+            // Cancel path only — never NotSupportedException.
+            await FluentActions.Invoking(async () => await execution).Should().ThrowAsync<OperationCanceledException>();


What is this asserting?

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Delete the dead abandon-filtered catch in RunningScript; a script cancelled while still acquiring the isolation mutex reports CanceledExitCode. Document that at both the script-service (ScriptServiceV2Fixture) and client (ClientScriptExecutionAbandon) levels, with a comment marking the re-add point. - Re-add the capabilities list-length guard as an exact-set assertion, abandon-aware via Version.HasAbandonScript, in CapabilitiesServiceV2Test. - Extract EnsureProcessKilled test helper that asserts the kill so a leaked process surfaces loudly on CI; fold in TryKillGrandchild; drop the redundant AbandonToken_WhenCancelAlsoRequested runner test. - Comment cleanups: name the abandon-token consumption site in ScriptServiceV2, remove the SupportsAbandon comment, revert SafelyWaitForAllOutput to main's. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ild's exit) The asserting EnsureProcessKilled failed CancellationToken_WhenUnixGrandchildHoldsRedirectedPipes on the Ubuntu CI agent: the grandchild is reparented to init, so it isn't our child and Process.WaitForExit can't reliably observe it exit. Keep the asserting helper for the abandon test (our child), revert the grandchild sites to best-effort TryKillGrandchild. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jimmyp · 2026-06-04T23:18:59Z

+                                : await RunScriptAsync(shellPath, writer, runningScriptToken, abandonToken);
                        }
                    }
+                    // Fires when the caller cancelled the script and the underlying process honored the cancellation token.


This doesnt make sense. Explain it again

jimmyp · 2026-06-04T23:27:27Z

-            }
+                expected.Add(nameof(IScriptServiceV2));
+            if (version.HasAbandonScript())
+                expected.Add(nameof(ScriptServiceV2.AbandonScriptAsync));


I'm not sure this is right

jimmyp · 2026-06-04T23:28:23Z

+
+            var tentacleClient = clientTentacle.TentacleClient;
+
+            var scriptExecution = Task.Run(async () => await tentacleClient.ExecuteScript(firstCommand, CancellationToken));


This should be start script not execute script

Fix all comments in the test as you go. All tests except ExecuteScript_WhenCancellationStaysPending_EscalatesToAbandon_AndReleasesMutex should use start and not execute

We should have tests for both start and execute paths

jimmyp · 2026-06-04T23:29:48Z

+            await tentacleClient.AbandonScript(firstCommand.ScriptTicket, CancellationToken);
+
+            var (result, _) = await scriptExecution;
+            result.ExitCode.Should().Be(ScriptExitCodes.AbandonedExitCode);


Also assert the process is still running in all of these tests so we know we're actually setting the test up right

jimmyp · 2026-06-04T23:35:32Z

+
+        [Test]
+        [TentacleConfigurations(scriptServiceToTest: ScriptServiceVersionToTest.Version2)]
+        public async Task ExecuteScript_WhenAScriptWaitingOnTheMutexIsCancelledAndAbandoned_ReturnsCancelleExitCode(TentacleConfigurationTestCase tentacleConfigurationTestCase)


Change this to make explicit calls to start, cancel and abandon. We shouldnt need TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTION either

…estructure Abandon support was decided by ScriptServiceVersion == V2, but old V2 Tentacles (v7.1, v8.0) are V2 yet predate the abandon verb. Determine it from the advertised AbandonScript capability: - Re-add client CapabilitiesResponseV2.HasAbandonScript(); the selector surfaces it; the orchestrator captures it once on the start result (CommandContext.SupportsAbandon) and gates escalation on it. Removed SupportsAbandon from ScriptServiceVersion (it can't carry it - the record is value-compared in ScriptExecutorFactory). Selector unit test covers all three cases. Test feedback: - CapabilitiesServiceV2Test: focused decorator/K8s tests back to Contain/NotContain; cross-version AbandonScriptIsAdvertisedOnlyByVersionsThatSupportIt proves old versions don't advertise abandon. - SilentProcessRunnerFixture: renamed to AbandonToken_WhenProcessCanBeKilled_..., assert the process is gone, trim the comment. - ClientScriptExecutionAbandon: drive explicit StartScript/CancelScript/AbandonScript; assert the process is still running (stuck) vs gone (killable) via PID; rename for the killable/stuck split; clarify the V1 no-escalation comment. - RunningScript: correct the cancel-catch comment (OCE comes from mutex Acquire, not the process). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@jimmyp

A force-push from a stale base overwrote 4 'Apply suggestion from @jimmyp' commits (025b768). Restoring them, reconciled with the capability-gating rewrite: - SilentProcessRunner: restore the swallow/abandon-wait comment wording verbatim. - KubernetesScriptServiceV1Executor: drop 'Hung pods are recovered by deleting the pod instead.' - ClientScriptExecutionAbandon: reword the no-prior-cancel releaseFile comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jimmyp · 2026-06-05T05:07:51Z

            if (version.HasScriptServiceV2())
-            {
                capabilities.Should().Contain(nameof(IScriptServiceV2));
-                expectedCapabilitiesCount++;


Add these counts back

…hAbandon) + review fixes #1 Move abandon-support back onto the version (option A). The selector picks ScriptServiceVersion2WithAbandon when the Tentacle advertised AbandonScript, else ScriptServiceVersion2; the factory routes both to the V2 executor; the executor stamps the selected version onto the CommandContext, so the orchestrator gates on ScripServiceVersionUsed.SupportsAbandon. Reverts the CommandContext flag / capture-once / selector tuple. Capability string is nameof(IScriptServiceV2.AbandonScript) on both sides (server advertise + client check, unit fixture, integration test). Deleted the mocked selector unit test (redundant with the version approach). #2 CapabilitiesServiceV2Test: restore the count assertions (abandon-aware). #3 ClientScriptExecutionAbandon: capture each spawned PID and force-kill in TearDown so nothing leaks on CI. #4 SilentProcessRunnerFixture: add AbandonToken_WhenProcessCanNotBeKilled (asserts the process is still running, then cleans it up). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jimmyp · 2026-06-05T05:50:36Z

            }

-            if (scriptServiceToUse == ScriptServiceVersion.ScriptServiceVersion2)
+            if (scriptServiceToUse == ScriptServiceVersion.ScriptServiceVersion2 || scriptServiceToUse == ScriptServiceVersion.ScriptServiceVersion2WithAbandon)


Lets create ScriptServiceVersion.IsV2 to centralise this and remove the comment below

…NotSupported assertion - Add ScriptServiceVersion.IsV2 (true for both V2 variants); ScriptExecutorFactory dispatches on it and the explanatory comment is gone. - Split the escalation test: ExecuteScript_WhenCancellationStaysPending_EscalatesToAbandon now just proves the run terminates; the mutex-release outcome is its own test, ..._AndReleasesTheIsolationMutex. - V1 test: assert the run does NOT throw NotSupportedException (rather than asserting it throws OCE) and rename to ExecuteScript_OnV1Tentacle_DoesNotEscalateToNotSupportedAbandon. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…fect The mutex-release test was a strict superset of _EscalatesToAbandon (both asserted OCE). Drop the redundant OCE assertion from _AndReleasesTheIsolationMutex so it asserts only the mutex release (second script runs); _EscalatesToAbandon keeps the run-terminates (OCE) assertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jimmyp requested a review from a team as a code owner May 21, 2026 03:57

jimmyp commented May 21, 2026

View reviewed changes