Skip to content

[PR2 / DO NOT MERGE until servers support abandon] Async WaitForExitAsync; grandchildren require abandon#1244

Draft
jimmyp wants to merge 61 commits into
mainfrom
jimpelletier/eft-3295-pr2-async-unlink
Draft

[PR2 / DO NOT MERGE until servers support abandon] Async WaitForExitAsync; grandchildren require abandon#1244
jimmyp wants to merge 61 commits into
mainfrom
jimpelletier/eft-3295-pr2-async-unlink

Conversation

@jimmyp
Copy link
Copy Markdown
Contributor

@jimmyp jimmyp commented Jun 1, 2026

Follow-up to #1226. Do not merge until #1226 has shipped and all customer servers know how to abandon a script.

Switches SilentProcessRunner from #1226's synchronous WaitForExit() + Task.WhenAny(abandon) to async WaitForExitAsync, for performance (no thread blocked per running script).

With the async wait, process.Close() can't be used to unblock the grandchild-holds-pipes case (it races the Exited event the async wait depends on), so:

  • cancel no longer unsticks a re-parented grandchild on its own — abandon is required (cancel-then-abandon).
  • cancel returns the real kill exit code (137/143) instead of -1.

This is why it must wait for #1226: until servers reliably call abandon, a grandchild-holding script could no longer be unstuck by cancel alone.

The two grandchild tests are flipped to assert the cancel-then-abandon contract.

🤖 Generated with Claude Code

jimmyp and others added 30 commits May 25, 2026 12:17
Replaces the sync WaitForExit() with await WaitForExitAsync(cancel).
The cancel token is passed directly so the existing cancel semantics
are preserved: cancel firing throws OCE from the await and unwinds.
DoOurBestToCleanUp continues to fire on cancel via cancel.Register
exactly as it did in the sync version.

Adds a net48 polyfill for WaitForExitAsync using Process.Exited +
TaskCompletionSource.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ISilentProcessRunner, SilentProcessRunnerWrapper, and the
SilentProcessRunnerExtended helpers now return Task<int> and call
SilentProcessRunner.ExecuteCommandAsync directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CommandLineRunner.Execute is consumed by Octopus.Manager.Tentacle
(a WPF app), so the public method stays sync. It now blocks on the
new SilentProcessRunner.ExecuteCommandAsync via GetAwaiter().GetResult()
— safe because the WPF callers dispatch through ThreadPool.QueueUserWorkItem,
which has no synchronisation context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RunningScript now awaits SilentProcessRunner.ExecuteCommandAsync
through a new RunScriptAsync helper. The monitored-startup path
also awaits the async helper inside Task.Run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The six immediate sync callers of SilentProcessRunner now go through
ExecuteCommandAsync(...).GetAwaiter().GetResult():
  - Octopus.Manager.Tentacle PowerShellPrerequisite (WPF installer)
  - KubernetesDirectoryInformationProvider (IMemoryCache factory)
  - SystemCtlHelper (2 sites — start and sudo retry)
  - LinuxServiceConfigurator (3 sites — chmod, systemctl probe, sudo probe)
  - WindowsServiceConfigurator (sc.exe wrapper)

Each site gets a comment explaining why it must be sync and why
blocking on a thread-pool worker is deadlock-safe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates Kubernetes integration test setup helpers, PowerShell
startup-detection tests, integration support, and Linux test
fixtures to call the new async ExecuteCommandAsync API.

Tests that don't await directly (NUnit static helpers, cache
factories) block on .GetAwaiter().GetResult() and document why
it's deadlock-safe on the test threadpool.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrites the six sync-boundary comments to name the canonical pattern
("sync-over-async"), link Stephen Cleary's "Don't Block on Async Code"
reference, and keep the "we are here / we do this / safe because"
structure. Removes em-dashes per style preference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, lighthouse comments, async tests

- Strip SilentProcessRunner.cs to signature-only async change (move polyfill,
  EnableRaisingEvents, and await WaitForExitAsync to #1226 abandon PR)
- Apply lighthouse comment pattern uniformly to all sync-over-async sites
  including test scaffolding and the second/third sites in LinuxServiceConfigurator
- Improve LinuxServiceConfigurator justification: name the sync interface chain
  (IServiceConfigurator -> AbstractCommand -> ICommand -> Topshelf) and the
  no-SynchronizationContext property of Main/Topshelf worker threads
- Move SilentProcessRunnerFixture sync-over-async comment to actual call site
- Convert LinuxConfigureServiceHelperFixture test methods to async Task,
  eliminating sync-over-async entirely in that file

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…daries

Consolidates 9 sync-over-async sites into 5 single-method bridges, each
of which is a tiny sync wrapper over a private async implementation. All
internal helpers become fully async, removing the in-method GetAwaiter
calls in SystemCtlHelper and LinuxServiceConfigurator entirely.

Bridges:
- PowerShellPrerequisite.Check
- KubernetesDirectoryInformationProvider.GetPathUsedBytes
- LinuxServiceConfigurator.ConfigureService (replaces 5 prior bridges)
- WindowsServiceConfigurator.ConfigureService
- CommandLineRunner.Execute (also exposes ExecuteAsync for async callers)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address PR review on #1236: the sync-over-async hop at the
IKubernetesDirectoryInformationProvider boundary was carrying two
consumers — a sync override (EnsureDiskHasEnoughFreeSpace) and an async
one (CreateScriptContainer). Expose async siblings (GetPathUsedBytesAsync
on the provider, GetStorageInformationAsync on KubernetesPhysicalFileSystem)
so the async caller awaits directly. The sync GetPathUsedBytes and
GetStorageInformation remain for the IOctopusFileSystem override, and the
lighthouse comment now names that exact consumer instead of waving at a
non-existent background sweeper.
Address PR review on #1236:
- Plainer two-section "Why this is sync / Why blocking is safe" format
  across all sync-over-async sites; tests add a "Why low risk" line per
  reviewer ask.
- KubernetesDirectoryInformationProvider, PowerShellPrerequisite,
  Linux/WindowsServiceConfigurator, KubernetesAgentInstaller,
  SilentProcessRunnerFixture, LinuxTestUserPrincipal (previously had
  no justification, now does).
- LinuxTentacleFetcher.ExtractTarGzip: both callers are already async, so
  flip the helper to ExtractTarGzipAsync and remove the sync-over-async
  entirely. Update NugetTentacleFetcher.ExtractTentacle to await.
- Fix WhenSettingUpPollingTentacle_TelemetryEventShouldBeSent: builder
  was stubbing only the sync Execute overloads, so the ExecuteAsync
  switch in ReviewAndRunScriptTabViewModel returned a default false
  Task and the telemetry callback never fired. Stub both overloads.
Address PR review on #1236: the sync and async GetStorageInformation
variants had duplicated body. Factor the bytesTotal lookup + tuple
assembly into a private sync helper so the sync/async pair we need stays
DRY.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Adds CancellationToken abandon parameter
- Switches the await from sync body to await WaitForExitAsync(abandon)
- Returns AbandonedExitCode when abandon fires before process exits
- Adds net48 polyfill (WaitForExitAsyncNetFramework)
- Sets process.EnableRaisingEvents = true
- Removes process.Close() race in DoOurBestToCleanUp
- Adds SafelyCancelOutputAndErrorRead helper
- Adds TentacleDebugDisableProcessKill test affordance to Hitman

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four caller sites that bridge sync interfaces to the (now-takes-abandon)
ExecuteCommandAsync pass CancellationToken.None for the new abandon
parameter (these callers don't need abandon semantics):
- PowerShellPrerequisite.CheckPowerShellIsInstalledAsync
- KubernetesDirectoryInformationProvider.GetDriveBytesUsingDuAsync
- WindowsServiceConfigurator.ScAsync
- CommandLineRunner.ExecuteAsync

SystemCtlHelper and LinuxServiceConfigurator call the CommandLineInvocation
extension overloads, which don't expose an abandon parameter — they default
to CancellationToken.None internally, so no edits needed there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… cancelled

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ace cleanup

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t helper

Test infrastructure (Kubernetes setup, tentacle fetcher, PowerShell tests)
and the IAsyncClientScriptServiceV2 test decorator now thread the new
abandon parameter through. Also extracts SafelyCancelOutputAndErrorRead in
SilentProcessRunner so the normal-completion path and the abandon-catch
path stay in sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CapabilitiesServiceV2 was advertising nameof(IAsyncClientScriptServiceV2
.AbandonScriptAsync) which evaluates to the method name "AbandonScriptAsync",
not the intended capability name. Both the unit test fixture
(CapabilitiesServiceV2Fixture) and the EFT-3295 commit subject expect
"AbandonScriptV2"; the integration test in CapabilitiesServiceV2Test had the
same nameof() mistake.

Replace with the literal "AbandonScriptV2" in production and the integration
test, drop the now-unused IAsyncClientScriptServiceV2 import. Verify against
the server's expected capability name before merging.
Address PR review on #1226: the method waits for whatever PID you pass it,
not specifically a grandchild process. Renamed to WaitForPidFileAsync and
updated call sites. Also updated the timeout-exception message inside the
method to drop grandchild-specific wording.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR review on #1226:
- ScriptServiceV2.StartScriptAsync: tighter comment on the double-call
  guard at the script-state check (per Jim's suggestion).
- ScriptServiceV2.AbandonScriptAsync: rewrite the abandon-effect comment
  to spell out the WaitForExitAsync return path and the server's
  follow-up GetStatus pattern.
- ScriptServiceV2.CompleteScriptAsync: remove a stale comment on the
  running-script bookkeeping disposal, expand the abandon-aware
  workspace.Delete comment, and link to the customer docs and
  AbandonScriptCommandV2 contract.
- ScriptServiceV2.WasAbandoned: extracted helper covering the
  state-store-load + AbandonedExitCode check so the conditional in
  CompleteScriptAsync reads cleanly.
- RunningScript.Execute: condensed the three catch-clause comments into
  one-liners explaining why each fires (abandon vs cancel vs mutex
  timeout) rather than restating the behaviour.
…ests

Address PR review on #1226:
- Production: replace the "AbandonScriptV2" literal with
  nameof(ScriptServiceV2.AbandonScriptAsync) on the non-Kubernetes
  capabilities path, matching the nameof pattern used for every other
  capability string in the codebase. Wire string is now "AbandonScriptAsync".
- Unit fixture: same nameof in both AbandonScript-related assertions.
- Drop the redundant .Count.Should().Be(N) assertions; BeEquivalentTo
  already checks element count.
- Integration test: revert the AbandonScript assertions that were added
  inside CapabilitiesFromAnOlderTentacleWhichHasNoCapabilitiesService_...
  and CapabilitiesServiceDoesNotReturnKubernetesScriptServiceForNonKubernetesTentacle.
  Each test's scope is its own concern; AbandonScript advertisement gets a
  dedicated test (LatestTentacle_AdvertisesAbandonScriptCapability),
  modeled on the K8s-specific test pattern in the same file.
PR review on #1226:
- Rewrote the outputResetEvent.Wait comment per Jim's suggestion (explains
  the EOF wait semantics, the no-pipe-close behaviour, and the grandchild
  case).
- Clarified the 'workspace log writer' comment by naming the failure mode
  explicitly (late OutputDataReceived after the workspace was disposed
  would throw ObjectDisposedException).
- Expanded the process.Close() removal comment with the full old-sync /
  new-async explanation Jim wrote, including a link to the Microsoft docs
  on Process.WaitForExitAsync output-processing behaviour.
- Renamed TentacleDebugDisableProcessKill to
  TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTION so the variable's
  danger is obvious to anyone reading it; updated the declaration in
  EnvironmentVariables and the references in SilentProcessRunner's Hitman
  helper and the ClientScriptExecutionAbandon integration tests.
PR review on #1226: the call sites that didn't have a token chain to
forward were passing `cancel: CancellationToken.None, abandon:
CancellationToken.None` explicitly to ExecuteCommandAsync. The
signature already defaults both, so the explicit None is noise. Drop
it at every site that doesn't have a real token to pass.

Leaves the defaults in place so CA2016 can be enabled later to catch
the 'I have a token in scope and forgot to forward it' case at the
analyzer level.
Address PR review on #1226: the inline #if NETFRAMEWORK / #else block
at the WaitForExit call site, plus the separate WaitForExitAsyncNetFramework
helper, are collapsed into a single WaitForProcessExitAsync method whose
body contains the pragma. The call site reads as one ordinary await; the
comments above it refer to WaitForProcessExitAsync as the named place
where we block, matching the long process.Close() removal comment.
Address PR review on #1226: ClientAndTentacle previously exposed the
HalibutRuntime so the abandon integration test could construct its own
IAsyncClientScriptServiceV2 proxy and call AbandonScriptAsync directly.
TentacleClient now surfaces AbandonScript as a first-class method, the
test uses it, and the direct-Halibut exposure is rolled back.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jimmyp and others added 28 commits May 28, 2026 14:00
Address PR review on #1226: both 'when workspace.Delete fails' tests
were constructing a second ScriptServiceV2 with a mock workspace
factory only for CompleteScriptAsync, while StartScriptAsync /
AbandonScriptAsync had run on the fixture-level service. Because
running-script state is per-instance, the new service didn't know
the script existed, and the test never actually reached the
abandon-aware Delete tolerance code.

Refactored both tests to run on a single test-specific ScriptServiceV2
instance built with a DeleteThrowingScriptWorkspaceFactory decorator
that forwards every member through to a real workspace except Delete,
which throws. The decorator replaces the NSubstitute-based helper so
any workspace member StartScript or RunningScript may call passes
through to the real implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR review on #1226:
- Move process.EnableRaisingEvents = true inside #if NETFRAMEWORK so it
  only fires on the netframework target where the polyfill needs it.
  Microsoft docs confirm Process.WaitForExitAsync sets the flag itself
  on .NET 5+, so the unconditional assignment was only doing real work
  on the framework target.
- Drop EFT-3295 ticket reference from the abandon-token comment; just
  say "the abandon feature".
- Extract the netframework polyfill body from WaitForProcessExitAsync
  into a separate WaitForProcessExitAsyncNetFrameworkPolyfill method so
  both pragma branches are one-liners.
- Shrink the long process.Close() removal comment to a 3-line pointer;
  the full history lives in a PR comment on the line.
…xposure

PR review on #1226: abandon no longer needs the direct Halibut bypass
now that TentacleClient.AbandonScript exists. Refactored the two
ClientScriptExecutionAbandon test sites to call TentacleClient.CancelScript
and TentacleClient.GetStatus through new ScriptTicket-based extension
helpers in TentacleClientExtensionMethods (which synthesize a CommandContext
for ScriptServiceVersion2 internally, since the tests are interleaving
RPCs alongside an in-flight ExecuteScript task and don't carry a
CommandContext from a prior orchestrated call). Removed
CreateScriptServiceV2Client from ClientAndTentacle along with the now-unused
Halibut and ScriptServiceV2 contract usings. The halibutRuntime field stays
because LegacyTentacleClientBuilder still uses it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR review on #1226: replaced the [SetUp]-method approach for
constructing the SUT with a ScriptServiceV2Builder. Defaults match
what SetUp did; tests opt into mock overrides via .WithX(...) chained
calls. The CompleteScript_*WhenWorkspaceDeleteFails* tests now build
their SUT through the builder with the throwing-Delete factory and
mock log injected via the new With methods.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop the explanatory comment block above the TentacleClient
  CancelScript/GetStatus extension overloads per the latest review note.
- Move process.EnableRaisingEvents = true off the process configuration
  block and into WaitForProcessExitAsyncNetFrameworkPolyfill itself, so
  the pragma stops polluting ExecuteCommandAsync's body. The polyfill
  only runs on netframework, so EnableRaisingEvents only fires there;
  on .NET 8+ Process.WaitForExitAsync sets the flag itself.
Address PR review on #1226: source is cleaner without the pointer
comment. The full historical reasoning lives in a top-level PR comment
on #1226 (tagged for Luke's attention) so anyone reconsidering this in
the future has to find it via the PR history rather than the source.
- Delete the two plan files under docs/superpowers/plans/. Per Luke
  and Jim: plans are merge artefacts, specs stay (Jim is asking Luke
  to experiment with keeping specs for intent-discovery).
- Drop the unused `using System.Threading;` from
  KubernetesDirectoryInformationProvider.cs (leftover from the earlier
  cancel-token-then-removed pass).
- Assert that the sleep process is still running in the abandon test
  before the finally-block force-kills it. We don't care in production
  whether the script keeps running after abandon; the assertion is
  test-fixture confidence that we exercised the abandon path rather
  than accidentally cancelling.
Address PR review on #1226: Luke flagged the middle-layer abandon test
as not pulling its weight given coverage at the top (TentacleClient
integration) and bottom (SilentProcessRunner unit) layers. Deleted the
test plus the WaitForPidFileAsync / SafelyReadPidFile helpers that
only it used.
Address PR review on #1226: 7-line explanation shrunk to a 2-line
intent statement.
CapabilitiesFromAnOlderTentacle... and CapabilitiesServiceDoesNotReturnKubernetesScriptServiceForNonKubernetesTentacle:
were asserting capabilities.Count.Should().Be(expectedCapabilitiesCount)
without accounting for the new AbandonScript advertisement on Latest.
The Count check was belt-and-braces; the per-capability Contain/NotContain
assertions still carry the test's purpose, and the new dedicated
LatestTentacle_AdvertisesAbandonScriptCapability covers the abandon
advertisement on its own. Drop the Count assertion.

CancellationToken_WhenUnixGrandchildHoldsRedirectedPipes_ShouldNotHang:
the 30s task.Wait bound was hitting on slow Linux CI agents. Worst case
expected is ~10s; a real regression hangs indefinitely. Bump to 60s,
which preserves the regression-detection property without flaking on
slow agents.
Previous push only bumped the Unix task.Wait. CI showed 10 builds
still failing on grandchild-pipes tests — most likely the asymmetric
30s bounds (Windows task.Wait, Unix WaitForPidFile) were still being
hit on slow CI agents. Make both grandchild tests symmetric at 60s on
the PID-file wait and the cancel-completion wait. Regression-detection
property preserved because a real regression hangs indefinitely.
WaitForExitAsync on net8.0 waits for stream EOF after the Exited event
fires. A re-parented grandchild holding our redirected stdout/stderr open
prevents EOF, so the await hangs indefinitely on the cancel path.

Close() releases the pipe handles immediately, matching what main does
for the sync version. The cancel.Register callback runs Hitman.Kill +
Close, the readers see EOF, and the await returns.

Local Unix grandchild test goes from 60s+ hang to 80ms pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Process.WaitForExitAsync on net8.0 waits for stream EOF after the Exited
event fires (per the dotnet/runtime source). A re-parented grandchild
that inherits our redirected stdout/stderr holds the pipes open, EOF
never arrives, and the await hangs.

Adding process.Close() to cancel cleanup (matching main's sync version)
broke things differently: Close clears the Process object's stream
fields AND aborts the in-flight WaitForExitAsync, which broke
ShouldCancelPing and the AbandonScript_WhenCancelFailsToKillProcess
tests.

Instead, pass a linked cancel∪abandon token to WaitForExitAsync. When
cancel fires, Hitman.Kill runs synchronously via cancel.Register, the
linked token cancels the await, and we fall through to the existing
bounded stream-drain cleanup. The brief WaitForExit(5000) ensures the
process has actually terminated before SafelyGetExitCode reads
process.ExitCode.

Local Unix grandchild test: 62ms (was 60s+ hang).
Local ShouldCancelPing: still passes (2s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arantees

The previous single-catch on the linked-token await hijacked the abandon
path: cancel firing first caused the cancel-OCE catch to win, even when
the user-intended sequence was cancel-then-abandon. The AbandonScript
integration tests broke as a result.

Split into two catches with different exit-code guarantees:

  * Abandon catch: returns AbandonedExitCode unconditionally. Abandon
    is inherently a race against natural script exit — if the script
    happened to finish at the same moment we got the abandon signal,
    reporting "abandoned" is still semantically right.

  * Cancel catch: briefly WaitForExit(5000) so the kill takes effect,
    then falls through to SafelyGetExitCode for the real exit code (137
    on Linux SIGKILL, 1 on Windows TerminateProcess). Only falls back
    to AbandonedExitCode if the kill genuinely didn't land within 5s
    (e.g., Hitman disabled by test, or process is stuck) — the only
    honest signal available when there's no real exit code yet.

Local Unix grandchild test: 115ms (was 60s+ hang).
Local ShouldCancelPing: 2s with real exit code 137 preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Behavior-preserving cleanup of the cancel OperationCanceledException catch
in SilentProcessRunner. No change to returned exit codes for any path.

- Extract the 5s magic number into CancelKillGraceMilliseconds.
- Remove the nested try/catch around WaitForExit(int): on a started,
  non-detached Process it cannot throw, so the empty catch was dead
  defensiveness (and a nested try inside a catch).
- Use WaitForExit(int)'s own bool result instead of a second HasExited read.
- Rewrite comments to cite, per decision, the test that requires it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… exit

Cancel no longer breaks the WaitForExitAsync wait. It kills via cancel.Register
and waits for the process to exit naturally, returning the real exit code. Only
the abandon token breaks the wait early (-> AbandonedExitCode), leaving the OS
process running.

Consequence: a script that won't exit -- genuinely stuck, or a re-parented
grandchild holding the redirected pipes so stream EOF never arrives -- keeps
waiting until the caller abandons it. Cancel alone can no longer stop such a
script; cancel-then-abandon is required.

- Drop the linked (abandon + cancel) token, the cancel catch, and the bounded grace.
- Flip both grandchild tests to cancel-then-abandon: cancel alone keeps waiting,
  abandon stops it and returns AbandonedExitCode.

The linked-token alternative is preserved on branch
jimpelletier/eft-3295-cancel-linked-token.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cancel behaves exactly as main: kill + Close() in cancel.Register, so cancel --
including the re-parented-grandchild-holds-pipes case -- is handled by Close
releasing the pipe handles. No async, no server dependency, no behaviour change
to cancel.

Abandon is the only addition: we race the blocking process.WaitForExit() against
a task that completes when the abandon token fires (Task.WhenAny). Abandon wins
-> AbandonedExitCode, OS process left running. A genuinely un-killable script
hangs the wait until abandon fires.

First of two PRs. PR2 switches to async WaitForExitAsync (perf) and makes
grandchildren require abandon, once all servers know how to abandon.

- Restore Close() in DoOurBestToCleanUp; remove the async WaitForExitAsync
  wrapper/polyfill (PR1 uses sync WaitForExit).
- Grandchild tests assert cancel alone does not hang (Close handles it).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oad is .NET 5+)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fold the PR1/PR2 split doc into the main abandon spec and update it to the
current design: best-effort-kill-on-abandon (AbandonScriptAsync -> Cancel() +
Abandon()), the sync-WhenAny (PR1) vs async-WaitForExitAsync (PR2) split, the
ExecuteScript abandonAfterCancellationPendingFor param, and the ADR-42 "not
killed" reversal. Deletes the now-absorbed split doc.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Abandon now best-effort-kills before releasing: the runner's abandon branch calls
DoOurBestToCleanUp (kill, idempotent) then returns AbandonedExitCode. A process
survives abandon only when the kill genuinely can't land (stuck / re-parented
grandchild). Closes the abuse vector where abandon could leave a killable process
running unmanaged.

The runner keys the abandoned result on `abandon.IsCancellationRequested` (not on
which WhenAny task won), so it is deterministic when both cancel and abandon fire.

NOTE — deviation from the spec's Section 2: the spec proposed AbandonScriptAsync
calling Cancel() then Abandon() (kill via the cancel machinery). That has a race —
cancel's kill resolves the wait to the cancel exit code before the abandon token is
observed (returned -1 instead of -48, proven by AbandonScript_WithNoPriorCancel).
Doing the kill sequentially in the runner's abandon branch is race-free. Spec to be
updated to match.

Tests: AbandonScript_WithNoPriorCancel_KillsTheProcess (abandon with no prior cancel
kills, -48); AbandonToken_WhenBothTokensFire (deterministic -48); AbandonToken
fixture tests no longer set the kill-disable env var process-wide (it leaked into
integration-test Tentacle subprocesses and caused flakiness).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…to AbandonScript

New optional parameter on TentacleClient.ExecuteScript (and ITentacleClient /
ObservingScriptOrchestrator):

    TimeSpan? abandonAfterCancellationPendingFor = null

The orchestrator's cancellation poll loop records when the CancellationToken first
fired; once elapsed crosses the threshold AND the AbandonScriptV2 capability is
advertised, it calls AbandonScript instead of CancelScript. Capability absent ->
keep cancelling (skip cleanly). The capability string is "AbandonScriptAsync"
(nameof(ScriptServiceV2.AbandonScriptAsync)), what CapabilitiesServiceV2 advertises.
ScriptExecutionResult.ExitCode (-48) is the source of truth; no shared Contracts
constant.

AbandonScript added to IScriptExecutor: ScriptServiceV2Executor does the real RPC +
capability check; V1 / Kubernetes executors fall back to CancelScript.

Tests: ObservingScriptOrchestratorAbandonTests, ScriptServiceV2ExecutorAbandonTests,
CapabilitiesResponseV2ExtensionMethodsTests, and integration
ExecuteScript_WhenCancellationStaysPending_EscalatesToAbandon.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…, not via AbandonScriptAsync calling Cancel()

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rebuilt on top of the current PR1 (#1226): Change 1 (abandon best-effort-kills)
and Change 2 (ExecuteScript abandon-timeout). Swaps the runner from PR1's
synchronous WaitForExit + Task.WhenAny to async WaitForExitAsync (perf: no thread
blocked per running script). With the async wait, Close() can't unblock the
grandchild case (it races the Exited event WaitForExitAsync depends on), so
DoOurBestToCleanUp no longer Close()s and cancel alone can no longer unstick a
re-parented grandchild -- abandon is required. The abandon branch still
best-effort-kills (Change 1) before returning -48. Grandchild fixture tests
flipped from ShouldNotHang (cancel-via-Close) to CancelThenAbandon_StopsWaiting.

DO NOT MERGE until PR1 has shipped and all customer servers know how to abandon,
otherwise a grandchild-holding script can no longer be unstuck by cancel alone.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jimmyp jimmyp force-pushed the jimpelletier/eft-3295-pr2-async-unlink branch from d84e9e7 to d137217 Compare June 1, 2026 23:15
…-48)

Same race as PR1: firing cancel then abandon and demanding -48 races on a killable
process (cancel's kill makes WaitForExitAsync complete with the killed code before
abandon is observed). The race does not exist in production -- cancel and abandon are
sequential RPCs, abandon sent only after cancel fails to unstick the script. Renamed
and fire abandon first for determinism; deterministic stuck case covered by the
integration test AbandonScript_WhenCancelFailsToKillProcess.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant