[PR2 / DO NOT MERGE until servers support abandon] Async WaitForExitAsync; grandchildren require abandon#1244
Draft
jimmyp wants to merge 61 commits into
Draft
[PR2 / DO NOT MERGE until servers support abandon] Async WaitForExitAsync; grandchildren require abandon#1244jimmyp wants to merge 61 commits into
jimmyp wants to merge 61 commits into
Conversation
Replaces the sync WaitForExit() with await WaitForExitAsync(cancel). The cancel token is passed directly so the existing cancel semantics are preserved: cancel firing throws OCE from the await and unwinds. DoOurBestToCleanUp continues to fire on cancel via cancel.Register exactly as it did in the sync version. Adds a net48 polyfill for WaitForExitAsync using Process.Exited + TaskCompletionSource. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ISilentProcessRunner, SilentProcessRunnerWrapper, and the SilentProcessRunnerExtended helpers now return Task<int> and call SilentProcessRunner.ExecuteCommandAsync directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CommandLineRunner.Execute is consumed by Octopus.Manager.Tentacle (a WPF app), so the public method stays sync. It now blocks on the new SilentProcessRunner.ExecuteCommandAsync via GetAwaiter().GetResult() — safe because the WPF callers dispatch through ThreadPool.QueueUserWorkItem, which has no synchronisation context. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RunningScript now awaits SilentProcessRunner.ExecuteCommandAsync through a new RunScriptAsync helper. The monitored-startup path also awaits the async helper inside Task.Run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The six immediate sync callers of SilentProcessRunner now go through ExecuteCommandAsync(...).GetAwaiter().GetResult(): - Octopus.Manager.Tentacle PowerShellPrerequisite (WPF installer) - KubernetesDirectoryInformationProvider (IMemoryCache factory) - SystemCtlHelper (2 sites — start and sudo retry) - LinuxServiceConfigurator (3 sites — chmod, systemctl probe, sudo probe) - WindowsServiceConfigurator (sc.exe wrapper) Each site gets a comment explaining why it must be sync and why blocking on a thread-pool worker is deadlock-safe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates Kubernetes integration test setup helpers, PowerShell startup-detection tests, integration support, and Linux test fixtures to call the new async ExecuteCommandAsync API. Tests that don't await directly (NUnit static helpers, cache factories) block on .GetAwaiter().GetResult() and document why it's deadlock-safe on the test threadpool. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrites the six sync-boundary comments to name the canonical pattern
("sync-over-async"), link Stephen Cleary's "Don't Block on Async Code"
reference, and keep the "we are here / we do this / safe because"
structure. Removes em-dashes per style preference.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, lighthouse comments, async tests - Strip SilentProcessRunner.cs to signature-only async change (move polyfill, EnableRaisingEvents, and await WaitForExitAsync to #1226 abandon PR) - Apply lighthouse comment pattern uniformly to all sync-over-async sites including test scaffolding and the second/third sites in LinuxServiceConfigurator - Improve LinuxServiceConfigurator justification: name the sync interface chain (IServiceConfigurator -> AbstractCommand -> ICommand -> Topshelf) and the no-SynchronizationContext property of Main/Topshelf worker threads - Move SilentProcessRunnerFixture sync-over-async comment to actual call site - Convert LinuxConfigureServiceHelperFixture test methods to async Task, eliminating sync-over-async entirely in that file Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…daries Consolidates 9 sync-over-async sites into 5 single-method bridges, each of which is a tiny sync wrapper over a private async implementation. All internal helpers become fully async, removing the in-method GetAwaiter calls in SystemCtlHelper and LinuxServiceConfigurator entirely. Bridges: - PowerShellPrerequisite.Check - KubernetesDirectoryInformationProvider.GetPathUsedBytes - LinuxServiceConfigurator.ConfigureService (replaces 5 prior bridges) - WindowsServiceConfigurator.ConfigureService - CommandLineRunner.Execute (also exposes ExecuteAsync for async callers) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address PR review on #1236: the sync-over-async hop at the IKubernetesDirectoryInformationProvider boundary was carrying two consumers — a sync override (EnsureDiskHasEnoughFreeSpace) and an async one (CreateScriptContainer). Expose async siblings (GetPathUsedBytesAsync on the provider, GetStorageInformationAsync on KubernetesPhysicalFileSystem) so the async caller awaits directly. The sync GetPathUsedBytes and GetStorageInformation remain for the IOctopusFileSystem override, and the lighthouse comment now names that exact consumer instead of waving at a non-existent background sweeper.
Address PR review on #1236: - Plainer two-section "Why this is sync / Why blocking is safe" format across all sync-over-async sites; tests add a "Why low risk" line per reviewer ask. - KubernetesDirectoryInformationProvider, PowerShellPrerequisite, Linux/WindowsServiceConfigurator, KubernetesAgentInstaller, SilentProcessRunnerFixture, LinuxTestUserPrincipal (previously had no justification, now does). - LinuxTentacleFetcher.ExtractTarGzip: both callers are already async, so flip the helper to ExtractTarGzipAsync and remove the sync-over-async entirely. Update NugetTentacleFetcher.ExtractTentacle to await. - Fix WhenSettingUpPollingTentacle_TelemetryEventShouldBeSent: builder was stubbing only the sync Execute overloads, so the ExecuteAsync switch in ReviewAndRunScriptTabViewModel returned a default false Task and the telemetry callback never fired. Stub both overloads.
Address PR review on #1236: the sync and async GetStorageInformation variants had duplicated body. Factor the bytesTotal lookup + tuple assembly into a private sync helper so the sync/async pair we need stays DRY.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Adds CancellationToken abandon parameter - Switches the await from sync body to await WaitForExitAsync(abandon) - Returns AbandonedExitCode when abandon fires before process exits - Adds net48 polyfill (WaitForExitAsyncNetFramework) - Sets process.EnableRaisingEvents = true - Removes process.Close() race in DoOurBestToCleanUp - Adds SafelyCancelOutputAndErrorRead helper - Adds TentacleDebugDisableProcessKill test affordance to Hitman Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four caller sites that bridge sync interfaces to the (now-takes-abandon) ExecuteCommandAsync pass CancellationToken.None for the new abandon parameter (these callers don't need abandon semantics): - PowerShellPrerequisite.CheckPowerShellIsInstalledAsync - KubernetesDirectoryInformationProvider.GetDriveBytesUsingDuAsync - WindowsServiceConfigurator.ScAsync - CommandLineRunner.ExecuteAsync SystemCtlHelper and LinuxServiceConfigurator call the CommandLineInvocation extension overloads, which don't expose an abandon parameter — they default to CancellationToken.None internally, so no edits needed there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… cancelled Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ace cleanup Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t helper Test infrastructure (Kubernetes setup, tentacle fetcher, PowerShell tests) and the IAsyncClientScriptServiceV2 test decorator now thread the new abandon parameter through. Also extracts SafelyCancelOutputAndErrorRead in SilentProcessRunner so the normal-completion path and the abandon-catch path stay in sync. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CapabilitiesServiceV2 was advertising nameof(IAsyncClientScriptServiceV2 .AbandonScriptAsync) which evaluates to the method name "AbandonScriptAsync", not the intended capability name. Both the unit test fixture (CapabilitiesServiceV2Fixture) and the EFT-3295 commit subject expect "AbandonScriptV2"; the integration test in CapabilitiesServiceV2Test had the same nameof() mistake. Replace with the literal "AbandonScriptV2" in production and the integration test, drop the now-unused IAsyncClientScriptServiceV2 import. Verify against the server's expected capability name before merging.
Address PR review on #1226: the method waits for whatever PID you pass it, not specifically a grandchild process. Renamed to WaitForPidFileAsync and updated call sites. Also updated the timeout-exception message inside the method to drop grandchild-specific wording. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR review on #1226: - ScriptServiceV2.StartScriptAsync: tighter comment on the double-call guard at the script-state check (per Jim's suggestion). - ScriptServiceV2.AbandonScriptAsync: rewrite the abandon-effect comment to spell out the WaitForExitAsync return path and the server's follow-up GetStatus pattern. - ScriptServiceV2.CompleteScriptAsync: remove a stale comment on the running-script bookkeeping disposal, expand the abandon-aware workspace.Delete comment, and link to the customer docs and AbandonScriptCommandV2 contract. - ScriptServiceV2.WasAbandoned: extracted helper covering the state-store-load + AbandonedExitCode check so the conditional in CompleteScriptAsync reads cleanly. - RunningScript.Execute: condensed the three catch-clause comments into one-liners explaining why each fires (abandon vs cancel vs mutex timeout) rather than restating the behaviour.
…ests Address PR review on #1226: - Production: replace the "AbandonScriptV2" literal with nameof(ScriptServiceV2.AbandonScriptAsync) on the non-Kubernetes capabilities path, matching the nameof pattern used for every other capability string in the codebase. Wire string is now "AbandonScriptAsync". - Unit fixture: same nameof in both AbandonScript-related assertions. - Drop the redundant .Count.Should().Be(N) assertions; BeEquivalentTo already checks element count. - Integration test: revert the AbandonScript assertions that were added inside CapabilitiesFromAnOlderTentacleWhichHasNoCapabilitiesService_... and CapabilitiesServiceDoesNotReturnKubernetesScriptServiceForNonKubernetesTentacle. Each test's scope is its own concern; AbandonScript advertisement gets a dedicated test (LatestTentacle_AdvertisesAbandonScriptCapability), modeled on the K8s-specific test pattern in the same file.
PR review on #1226: - Rewrote the outputResetEvent.Wait comment per Jim's suggestion (explains the EOF wait semantics, the no-pipe-close behaviour, and the grandchild case). - Clarified the 'workspace log writer' comment by naming the failure mode explicitly (late OutputDataReceived after the workspace was disposed would throw ObjectDisposedException). - Expanded the process.Close() removal comment with the full old-sync / new-async explanation Jim wrote, including a link to the Microsoft docs on Process.WaitForExitAsync output-processing behaviour. - Renamed TentacleDebugDisableProcessKill to TentacleDebugDisableProcessKill_UNSAFE_FOR_PRODUCTION so the variable's danger is obvious to anyone reading it; updated the declaration in EnvironmentVariables and the references in SilentProcessRunner's Hitman helper and the ClientScriptExecutionAbandon integration tests.
PR review on #1226: the call sites that didn't have a token chain to forward were passing `cancel: CancellationToken.None, abandon: CancellationToken.None` explicitly to ExecuteCommandAsync. The signature already defaults both, so the explicit None is noise. Drop it at every site that doesn't have a real token to pass. Leaves the defaults in place so CA2016 can be enabled later to catch the 'I have a token in scope and forgot to forward it' case at the analyzer level.
Address PR review on #1226: the inline #if NETFRAMEWORK / #else block at the WaitForExit call site, plus the separate WaitForExitAsyncNetFramework helper, are collapsed into a single WaitForProcessExitAsync method whose body contains the pragma. The call site reads as one ordinary await; the comments above it refer to WaitForProcessExitAsync as the named place where we block, matching the long process.Close() removal comment.
Address PR review on #1226: ClientAndTentacle previously exposed the HalibutRuntime so the abandon integration test could construct its own IAsyncClientScriptServiceV2 proxy and call AbandonScriptAsync directly. TentacleClient now surfaces AbandonScript as a first-class method, the test uses it, and the direct-Halibut exposure is rolled back. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address PR review on #1226: both 'when workspace.Delete fails' tests were constructing a second ScriptServiceV2 with a mock workspace factory only for CompleteScriptAsync, while StartScriptAsync / AbandonScriptAsync had run on the fixture-level service. Because running-script state is per-instance, the new service didn't know the script existed, and the test never actually reached the abandon-aware Delete tolerance code. Refactored both tests to run on a single test-specific ScriptServiceV2 instance built with a DeleteThrowingScriptWorkspaceFactory decorator that forwards every member through to a real workspace except Delete, which throws. The decorator replaces the NSubstitute-based helper so any workspace member StartScript or RunningScript may call passes through to the real implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR review on #1226: - Move process.EnableRaisingEvents = true inside #if NETFRAMEWORK so it only fires on the netframework target where the polyfill needs it. Microsoft docs confirm Process.WaitForExitAsync sets the flag itself on .NET 5+, so the unconditional assignment was only doing real work on the framework target. - Drop EFT-3295 ticket reference from the abandon-token comment; just say "the abandon feature". - Extract the netframework polyfill body from WaitForProcessExitAsync into a separate WaitForProcessExitAsyncNetFrameworkPolyfill method so both pragma branches are one-liners. - Shrink the long process.Close() removal comment to a 3-line pointer; the full history lives in a PR comment on the line.
…xposure PR review on #1226: abandon no longer needs the direct Halibut bypass now that TentacleClient.AbandonScript exists. Refactored the two ClientScriptExecutionAbandon test sites to call TentacleClient.CancelScript and TentacleClient.GetStatus through new ScriptTicket-based extension helpers in TentacleClientExtensionMethods (which synthesize a CommandContext for ScriptServiceVersion2 internally, since the tests are interleaving RPCs alongside an in-flight ExecuteScript task and don't carry a CommandContext from a prior orchestrated call). Removed CreateScriptServiceV2Client from ClientAndTentacle along with the now-unused Halibut and ScriptServiceV2 contract usings. The halibutRuntime field stays because LegacyTentacleClientBuilder still uses it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR review on #1226: replaced the [SetUp]-method approach for constructing the SUT with a ScriptServiceV2Builder. Defaults match what SetUp did; tests opt into mock overrides via .WithX(...) chained calls. The CompleteScript_*WhenWorkspaceDeleteFails* tests now build their SUT through the builder with the throwing-Delete factory and mock log injected via the new With methods. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop the explanatory comment block above the TentacleClient CancelScript/GetStatus extension overloads per the latest review note. - Move process.EnableRaisingEvents = true off the process configuration block and into WaitForProcessExitAsyncNetFrameworkPolyfill itself, so the pragma stops polluting ExecuteCommandAsync's body. The polyfill only runs on netframework, so EnableRaisingEvents only fires there; on .NET 8+ Process.WaitForExitAsync sets the flag itself.
- Delete the two plan files under docs/superpowers/plans/. Per Luke and Jim: plans are merge artefacts, specs stay (Jim is asking Luke to experiment with keeping specs for intent-discovery). - Drop the unused `using System.Threading;` from KubernetesDirectoryInformationProvider.cs (leftover from the earlier cancel-token-then-removed pass). - Assert that the sleep process is still running in the abandon test before the finally-block force-kills it. We don't care in production whether the script keeps running after abandon; the assertion is test-fixture confidence that we exercised the abandon path rather than accidentally cancelling.
Address PR review on #1226: Luke flagged the middle-layer abandon test as not pulling its weight given coverage at the top (TentacleClient integration) and bottom (SilentProcessRunner unit) layers. Deleted the test plus the WaitForPidFileAsync / SafelyReadPidFile helpers that only it used.
Address PR review on #1226: 7-line explanation shrunk to a 2-line intent statement.
CapabilitiesFromAnOlderTentacle... and CapabilitiesServiceDoesNotReturnKubernetesScriptServiceForNonKubernetesTentacle: were asserting capabilities.Count.Should().Be(expectedCapabilitiesCount) without accounting for the new AbandonScript advertisement on Latest. The Count check was belt-and-braces; the per-capability Contain/NotContain assertions still carry the test's purpose, and the new dedicated LatestTentacle_AdvertisesAbandonScriptCapability covers the abandon advertisement on its own. Drop the Count assertion. CancellationToken_WhenUnixGrandchildHoldsRedirectedPipes_ShouldNotHang: the 30s task.Wait bound was hitting on slow Linux CI agents. Worst case expected is ~10s; a real regression hangs indefinitely. Bump to 60s, which preserves the regression-detection property without flaking on slow agents.
Previous push only bumped the Unix task.Wait. CI showed 10 builds still failing on grandchild-pipes tests — most likely the asymmetric 30s bounds (Windows task.Wait, Unix WaitForPidFile) were still being hit on slow CI agents. Make both grandchild tests symmetric at 60s on the PID-file wait and the cancel-completion wait. Regression-detection property preserved because a real regression hangs indefinitely.
WaitForExitAsync on net8.0 waits for stream EOF after the Exited event fires. A re-parented grandchild holding our redirected stdout/stderr open prevents EOF, so the await hangs indefinitely on the cancel path. Close() releases the pipe handles immediately, matching what main does for the sync version. The cancel.Register callback runs Hitman.Kill + Close, the readers see EOF, and the await returns. Local Unix grandchild test goes from 60s+ hang to 80ms pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dles" This reverts commit f1cfcc1.
Process.WaitForExitAsync on net8.0 waits for stream EOF after the Exited event fires (per the dotnet/runtime source). A re-parented grandchild that inherits our redirected stdout/stderr holds the pipes open, EOF never arrives, and the await hangs. Adding process.Close() to cancel cleanup (matching main's sync version) broke things differently: Close clears the Process object's stream fields AND aborts the in-flight WaitForExitAsync, which broke ShouldCancelPing and the AbandonScript_WhenCancelFailsToKillProcess tests. Instead, pass a linked cancel∪abandon token to WaitForExitAsync. When cancel fires, Hitman.Kill runs synchronously via cancel.Register, the linked token cancels the await, and we fall through to the existing bounded stream-drain cleanup. The brief WaitForExit(5000) ensures the process has actually terminated before SafelyGetExitCode reads process.ExitCode. Local Unix grandchild test: 62ms (was 60s+ hang). Local ShouldCancelPing: still passes (2s). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…arantees
The previous single-catch on the linked-token await hijacked the abandon
path: cancel firing first caused the cancel-OCE catch to win, even when
the user-intended sequence was cancel-then-abandon. The AbandonScript
integration tests broke as a result.
Split into two catches with different exit-code guarantees:
* Abandon catch: returns AbandonedExitCode unconditionally. Abandon
is inherently a race against natural script exit — if the script
happened to finish at the same moment we got the abandon signal,
reporting "abandoned" is still semantically right.
* Cancel catch: briefly WaitForExit(5000) so the kill takes effect,
then falls through to SafelyGetExitCode for the real exit code (137
on Linux SIGKILL, 1 on Windows TerminateProcess). Only falls back
to AbandonedExitCode if the kill genuinely didn't land within 5s
(e.g., Hitman disabled by test, or process is stuck) — the only
honest signal available when there's no real exit code yet.
Local Unix grandchild test: 115ms (was 60s+ hang).
Local ShouldCancelPing: 2s with real exit code 137 preserved.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Behavior-preserving cleanup of the cancel OperationCanceledException catch in SilentProcessRunner. No change to returned exit codes for any path. - Extract the 5s magic number into CancelKillGraceMilliseconds. - Remove the nested try/catch around WaitForExit(int): on a started, non-detached Process it cannot throw, so the empty catch was dead defensiveness (and a nested try inside a catch). - Use WaitForExit(int)'s own bool result instead of a second HasExited read. - Rewrite comments to cite, per decision, the test that requires it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… exit Cancel no longer breaks the WaitForExitAsync wait. It kills via cancel.Register and waits for the process to exit naturally, returning the real exit code. Only the abandon token breaks the wait early (-> AbandonedExitCode), leaving the OS process running. Consequence: a script that won't exit -- genuinely stuck, or a re-parented grandchild holding the redirected pipes so stream EOF never arrives -- keeps waiting until the caller abandons it. Cancel alone can no longer stop such a script; cancel-then-abandon is required. - Drop the linked (abandon + cancel) token, the cancel catch, and the bounded grace. - Flip both grandchild tests to cancel-then-abandon: cancel alone keeps waiting, abandon stops it and returns AbandonedExitCode. The linked-token alternative is preserved on branch jimpelletier/eft-3295-cancel-linked-token. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cancel behaves exactly as main: kill + Close() in cancel.Register, so cancel -- including the re-parented-grandchild-holds-pipes case -- is handled by Close releasing the pipe handles. No async, no server dependency, no behaviour change to cancel. Abandon is the only addition: we race the blocking process.WaitForExit() against a task that completes when the abandon token fires (Task.WhenAny). Abandon wins -> AbandonedExitCode, OS process left running. A genuinely un-killable script hangs the wait until abandon fires. First of two PRs. PR2 switches to async WaitForExitAsync (perf) and makes grandchildren require abandon, once all servers know how to abandon. - Restore Close() in DoOurBestToCleanUp; remove the async WaitForExitAsync wrapper/polyfill (PR1 uses sync WaitForExit). - Grandchild tests assert cancel alone does not hang (Close handles it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oad is .NET 5+) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fold the PR1/PR2 split doc into the main abandon spec and update it to the current design: best-effort-kill-on-abandon (AbandonScriptAsync -> Cancel() + Abandon()), the sync-WhenAny (PR1) vs async-WaitForExitAsync (PR2) split, the ExecuteScript abandonAfterCancellationPendingFor param, and the ADR-42 "not killed" reversal. Deletes the now-absorbed split doc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Abandon now best-effort-kills before releasing: the runner's abandon branch calls DoOurBestToCleanUp (kill, idempotent) then returns AbandonedExitCode. A process survives abandon only when the kill genuinely can't land (stuck / re-parented grandchild). Closes the abuse vector where abandon could leave a killable process running unmanaged. The runner keys the abandoned result on `abandon.IsCancellationRequested` (not on which WhenAny task won), so it is deterministic when both cancel and abandon fire. NOTE — deviation from the spec's Section 2: the spec proposed AbandonScriptAsync calling Cancel() then Abandon() (kill via the cancel machinery). That has a race — cancel's kill resolves the wait to the cancel exit code before the abandon token is observed (returned -1 instead of -48, proven by AbandonScript_WithNoPriorCancel). Doing the kill sequentially in the runner's abandon branch is race-free. Spec to be updated to match. Tests: AbandonScript_WithNoPriorCancel_KillsTheProcess (abandon with no prior cancel kills, -48); AbandonToken_WhenBothTokensFire (deterministic -48); AbandonToken fixture tests no longer set the kill-disable env var process-wide (it leaked into integration-test Tentacle subprocesses and caused flakiness). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…to AbandonScript
New optional parameter on TentacleClient.ExecuteScript (and ITentacleClient /
ObservingScriptOrchestrator):
TimeSpan? abandonAfterCancellationPendingFor = null
The orchestrator's cancellation poll loop records when the CancellationToken first
fired; once elapsed crosses the threshold AND the AbandonScriptV2 capability is
advertised, it calls AbandonScript instead of CancelScript. Capability absent ->
keep cancelling (skip cleanly). The capability string is "AbandonScriptAsync"
(nameof(ScriptServiceV2.AbandonScriptAsync)), what CapabilitiesServiceV2 advertises.
ScriptExecutionResult.ExitCode (-48) is the source of truth; no shared Contracts
constant.
AbandonScript added to IScriptExecutor: ScriptServiceV2Executor does the real RPC +
capability check; V1 / Kubernetes executors fall back to CancelScript.
Tests: ObservingScriptOrchestratorAbandonTests, ScriptServiceV2ExecutorAbandonTests,
CapabilitiesResponseV2ExtensionMethodsTests, and integration
ExecuteScript_WhenCancellationStaysPending_EscalatesToAbandon.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…, not via AbandonScriptAsync calling Cancel() Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rebuilt on top of the current PR1 (#1226): Change 1 (abandon best-effort-kills) and Change 2 (ExecuteScript abandon-timeout). Swaps the runner from PR1's synchronous WaitForExit + Task.WhenAny to async WaitForExitAsync (perf: no thread blocked per running script). With the async wait, Close() can't unblock the grandchild case (it races the Exited event WaitForExitAsync depends on), so DoOurBestToCleanUp no longer Close()s and cancel alone can no longer unstick a re-parented grandchild -- abandon is required. The abandon branch still best-effort-kills (Change 1) before returning -48. Grandchild fixture tests flipped from ShouldNotHang (cancel-via-Close) to CancelThenAbandon_StopsWaiting. DO NOT MERGE until PR1 has shipped and all customer servers know how to abandon, otherwise a grandchild-holding script can no longer be unstuck by cancel alone. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
d84e9e7 to
d137217
Compare
…-48) Same race as PR1: firing cancel then abandon and demanding -48 races on a killable process (cancel's kill makes WaitForExitAsync complete with the killed code before abandon is observed). The race does not exist in production -- cancel and abandon are sequential RPCs, abandon sent only after cancel fails to unstick the script. Renamed and fire abandon first for determinism; deterministic stuck case covered by the integration test AbandonScript_WhenCancelFailsToKillProcess. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Follow-up to #1226. Do not merge until #1226 has shipped and all customer servers know how to abandon a script.
Switches
SilentProcessRunnerfrom #1226's synchronousWaitForExit()+Task.WhenAny(abandon)to asyncWaitForExitAsync, for performance (no thread blocked per running script).With the async wait,
process.Close()can't be used to unblock the grandchild-holds-pipes case (it races theExitedevent the async wait depends on), so:This is why it must wait for #1226: until servers reliably call abandon, a grandchild-holding script could no longer be unstuck by cancel alone.
The two grandchild tests are flipped to assert the cancel-then-abandon contract.
🤖 Generated with Claude Code