Skip to content

Make SilentProcessRunner.Execute async#1236

Merged
jimmyp merged 12 commits into
mainfrom
jimpelletier/eft-3295-async-migration-base
May 28, 2026
Merged

Make SilentProcessRunner.Execute async#1236
jimmyp merged 12 commits into
mainfrom
jimpelletier/eft-3295-async-migration-base

Conversation

@jimmyp
Copy link
Copy Markdown
Contributor

@jimmyp jimmyp commented May 25, 2026

Why

Tentacle needs to release the script-isolation mutex even when a script can't be cancelled. Philips is running into situations where CrowdStrike and Rapid7 contention freezes the script running on the Tentacle and our cancel signal can't shake it loose. The only recovery today is restart tentacle.

To make this so in a follow up PR SilentProcessRunner.ExecuteCommand becomes async. Which makes everything that calls it async too.

This PR just adds the changes to method signatures that need to become async, to make the next riskier PR easier to review.

What changed

Replaces SilentProcessRunner.ExecuteCommand with ExecuteCommandAsync, plus the matching interfaces (ISilentProcessRunner, ICommandLineRunner) and their wrappers. All ~20 callers in production, integration tests, and test scaffolding.

There are 5 places we synchronously call these new async call paths, each documented well in code:

  • KubernetesDirectoryInformationProvider.GetPathUsedBytes (only called via an IOctopusFileSystem sync override)
  • PowerShellPrerequisite.Check (IPrerequisite is sync, used by the WPF installer)
  • CommandLineRunner.Execute (sync wrapper for the WPF installer flow)
  • LinuxServiceConfigurator and WindowsServiceConfigurator (Topshelf's ServiceCommand → AbstractCommand → ICommand.Start chain is sync top to bottom)

Risks & mitigations

The classic sync-over-async deadlock is the obvious worry here. .GetAwaiter().GetResult() hangs when the calling thread has captured a SynchronizationContext. See each call site for why it is safe.

Mock vs. production drift in tests was a surprising issue I ran intodue to places we mock Execute we needed to switch this to ExecuteAsync. We hit this on WhenSettingUpPollingTentacle_TelemetryEventShouldBeSent and the wizard model builder now stubs both.

For Kubernetes agent owners

This PR touches your ownership area but should cause no behaviour changes.

  • KubernetesDirectoryInformationProvider: exposes sync (existing) and async (new) ways to read path usage. Same du invocation, same 30-second cache; we need both because one downstream is sync and another is async.
  • KubernetesPhysicalFileSystem: same shape. Sync GetStorageInformation stays, new async GetStorageInformationAsync added. Both go through one private helper so the two never drift.
  • KubernetesScriptPodCreator.CreateScriptContainer: previously blocked on async work indirectly. Now awaits the new async sibling directly. No behaviour change, just one fewer sync-over-async hop on your script-pod creation path.
  • KubernetesAgentInstaller.Dispose (integration test scaffolding only): still blocks on async because Dispose is sync.

What we'd love your eyes on: there's one remaining sync-over-async hop in the K8s agent at EnsureDiskHasEnoughFreeSpace, because it implements a sync filesystem interface. It blocks on the async du path. This is safe in a normal .NET console app because there's no SynchronizationContext for the awaited continuation to come back to. If you know of anything in the K8s agent's host startup (pod watchers, bootstrap, a custom task scheduler) that would install a SynchronizationContext and break that assumption, the disk-space check is where the deadlock would surface. Worth a quick sanity check against your mental model of how the agent boots up.

How to review

  • Is this change safe?
  • Is there anything I you would do to de-risk it further beyond a green build?

jimmyp and others added 6 commits May 25, 2026 12:17
Replaces the sync WaitForExit() with await WaitForExitAsync(cancel).
The cancel token is passed directly so the existing cancel semantics
are preserved: cancel firing throws OCE from the await and unwinds.
DoOurBestToCleanUp continues to fire on cancel via cancel.Register
exactly as it did in the sync version.

Adds a net48 polyfill for WaitForExitAsync using Process.Exited +
TaskCompletionSource.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ISilentProcessRunner, SilentProcessRunnerWrapper, and the
SilentProcessRunnerExtended helpers now return Task<int> and call
SilentProcessRunner.ExecuteCommandAsync directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CommandLineRunner.Execute is consumed by Octopus.Manager.Tentacle
(a WPF app), so the public method stays sync. It now blocks on the
new SilentProcessRunner.ExecuteCommandAsync via GetAwaiter().GetResult()
— safe because the WPF callers dispatch through ThreadPool.QueueUserWorkItem,
which has no synchronisation context.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
RunningScript now awaits SilentProcessRunner.ExecuteCommandAsync
through a new RunScriptAsync helper. The monitored-startup path
also awaits the async helper inside Task.Run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The six immediate sync callers of SilentProcessRunner now go through
ExecuteCommandAsync(...).GetAwaiter().GetResult():
  - Octopus.Manager.Tentacle PowerShellPrerequisite (WPF installer)
  - KubernetesDirectoryInformationProvider (IMemoryCache factory)
  - SystemCtlHelper (2 sites — start and sudo retry)
  - LinuxServiceConfigurator (3 sites — chmod, systemctl probe, sudo probe)
  - WindowsServiceConfigurator (sc.exe wrapper)

Each site gets a comment explaining why it must be sync and why
blocking on a thread-pool worker is deadlock-safe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Updates Kubernetes integration test setup helpers, PowerShell
startup-detection tests, integration support, and Linux test
fixtures to call the new async ExecuteCommandAsync API.

Tests that don't await directly (NUnit static helpers, cache
factories) block on .GetAwaiter().GetResult() and document why
it's deadlock-safe on the test threadpool.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrites the six sync-boundary comments to name the canonical pattern
("sync-over-async"), link Stephen Cleary's "Don't Block on Async Code"
reference, and keep the "we are here / we do this / safe because"
structure. Removes em-dashes per style preference.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
process.BeginErrorReadLine();

process.WaitForExit();
#if NETFRAMEWORK
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove all changes from this method implementation, this should be a signature change only in this class

}
}

#if NETFRAMEWORK
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this and put it in the next PR


var exitCode = SilentProcessRunner.ExecuteCommand(
// Dispose() cannot be made async; .GetAwaiter().GetResult() is safe here
// because this runs in test teardown (not inside an async context with a sync-blocking SynchronizationContext).
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't match our lighthouse example comment on the wpf installer. Specifically where it calls out the pattern and links the blog post

public void Dispose()
{
var exitCode = SilentProcessRunner.ExecuteCommand(
// Dispose() cannot be made async; .GetAwaiter().GetResult() is safe here
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also doesn't follow our lighthouse comment example


var chmodCmd = new CommandLineInvocation("/bin/bash", $"-c \"chmod 777 {scriptPath}\"");
chmodCmd.ExecuteCommand();
// Safe: sync test helper, no synchronisation context.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not follow our lighthouse comment example

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also there are many instances of this in this class. Can we create a helper method to dry it up or push this sync to async transition higher?

log,
CancellationToken.None);
cancel: CancellationToken.None)
// Safe: static void helper, no synchronisation context.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not follow our lighthouse comment example

{
var commandLineInvocation = new CommandLineInvocation("/bin/bash", arguments);
var result = commandLineInvocation.ExecuteCommand();
// Safe: constructor-time helper, no synchronisation context.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not follow our lighthouse comment example

}

// Sync-over-async is safe here: NUnit runs tests on a plain ThreadPool thread with no
// synchronisation context, so there is no risk of deadlock.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not follow our lighthouse comment example and is not at the actual call site

var result = commandLineInvocation.ExecuteCommand();
// We're in WriteUnitFile, called from IServiceConfigurator.ConfigureService implementations
// which are sync (called from the Tentacle service-management CLI), so we block on the
// async call with .GetAwaiter().GetResult().
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ITs not clear why these things have to be sync?

// We're in WriteUnitFile, called from IServiceConfigurator.ConfigureService implementations
// which are sync (called from the Tentacle service-management CLI), so we block on the
// async call with .GetAwaiter().GetResult().
// This is sync-over-async but is safe because the CLI dispatches us on a plain
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know the cli dispatches us on a thread pool worker?

var commandLineInvocation = new CommandLineInvocation("/bin/bash", "-c \"command -v systemctl >/dev/null\"");
var result = commandLineInvocation.ExecuteCommand();
// Same sync-over-async boundary as WriteUnitFile: CheckSystemPrerequisites is called
// from the sync ConfigureService path, on a plain thread-pool worker.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not follow our lighthouse comment example

{
var commandLineInvocation = new CommandLineInvocation("/bin/bash", "-c \"sudo -vn 2> /dev/null\"");
var result = commandLineInvocation.ExecuteCommand();
// Same sync-over-async boundary as IsSystemdInstalled.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not follow our lighthouse comment example

…, lighthouse comments, async tests

- Strip SilentProcessRunner.cs to signature-only async change (move polyfill,
  EnableRaisingEvents, and await WaitForExitAsync to #1226 abandon PR)
- Apply lighthouse comment pattern uniformly to all sync-over-async sites
  including test scaffolding and the second/third sites in LinuxServiceConfigurator
- Improve LinuxServiceConfigurator justification: name the sync interface chain
  (IServiceConfigurator -> AbstractCommand -> ICommand -> Topshelf) and the
  no-SynchronizationContext property of Main/Topshelf worker threads
- Move SilentProcessRunnerFixture sync-over-async comment to actual call site
- Convert LinuxConfigureServiceHelperFixture test methods to async Task,
  eliminating sync-over-async entirely in that file

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
// synchronously, so we block on the async call with .GetAwaiter().GetResult().
// This is sync-over-async but is safe because the installer dispatches us on a
// plain thread-pool worker. No captured SynchronizationContext, so no deadlock.
// See https://blog.stephencleary.com/2012/07/dont-block-on-async-code.html
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LukeButters do you have enough knowledge of this repo to verify Claude claims on each of these? Or shall I organise us a sync to go through them one by one

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @sburmanoctopus knows about this, he has done UI programming.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my WPF days, blocking would cause the UI to freeze, which wasn't great.

When async/await came along, it was amazing, because you could do async stuff as long as you added .ConfigureAwait(true) to the end. This would ensure the logic returned to the calling thread, which was the UI thread. This freed up the UI thread for other things while async code was running, and didn't block the UI thread anymore!

So by default, I would suggest using .ConfigureAwait(true).

However, in saying that, I have no context on where this code is being called, how it's being called etc. It's possible something else is dealing with the thread locking issue, or this is being called from a place where we must block the UI thread.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its being called from Winow.Start via DispatchHelper.Background here which queues it via Threadpool.QueueUserWorkItem which I think means that it is not on the UI thread and because Threadpool.QueueUserWorkItem isn't adding a synchronisation context we aren't suceptible to blocking?

I'd love it if someone could double check my thinking here. Claude can you give your 2c as well?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes.

Yeah, the Start method is being called by the Loaded event, which is on the UI thread. But like you said, since the Start method calls ThreadPool.QueueUserWorkItem, any work done there is on a thread from the thread pool (so not the UI thread).

In which case, all we would be doing here is blocking the thread from the thread pool. If there aren't many processes being run on the thread pool, I don't think this is an issue. Especially for a manager UI.

It would be nice to see if we could make that whole call chain async, but I'm not sure that's worth the effort.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I tried. It runs into an issue when another host doesn't have an async main entry point. I could push it to there. But I think I'm happy with where I've gotten to so far

// This is sync-over-async but is safe because the cache factory runs on a plain
// thread-pool worker. No captured SynchronizationContext, so no deadlock.
// See https://blog.stephencleary.com/2012/07/dont-block-on-async-code.html
var exitCode = silentProcessRunner.ExecuteCommandAsync("du", $"-s -B 1 {directoryPath}", "/", stdOut.Add, stdErr.Add)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimmyp Might want check with modern deployments on this one or push the async.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took the second option @LukeButters: pushed the async sibling up so the in-process async caller awaits directly.

The two consumers of this method had different async-ness:

  • KubernetesScriptPodCreator.CreateScriptContainer is async end-to-end. It used to call sync GetStorageInformation() → sync GetPathUsedBytes() → block on async. That hop is gone — CreateScriptContainer now awaits a new GetStorageInformationAsync() / GetPathUsedBytesAsync() pair.
  • KubernetesPhysicalFileSystem.EnsureDiskHasEnoughFreeSpace is the IOctopusFileSystem override and is sync because the interface is sync. The sync GetPathUsedBytes still exists for this one consumer.

So we still have one sync-over-async on the file-system override path. Hoping you can do a sanity check on it @liam-mackie since you wrote the original — the relevant safety claim is:

The Kubernetes agent is a console process (static int Main), so .NET doesn't install a default SynchronizationContext. I've also confirmed by binary inspection that Halibut 8.1.1633 (the version Tentacle uses) contains no references to SynchronizationContext or SetSynchronizationContext — its request-handler threads don't install one either. As long as nothing in the K8s-agent host startup (pod watcher, bootstrap wrappers, anything you'd recognise) installs one, the remaining .GetAwaiter().GetResult() on the EnsureDiskHasEnoughFreeSpace path is deadlock-safe.

An example of what would make it unsafe (so you know what to look for): something like SynchronizationContext.SetSynchronizationContext(new SomeCustomContext()) early in Main, or a UI-style context being installed by a library before the file-system code runs. Anything custom there would mean await inside GetPathUsedBytesAsync could try to resume on a thread that's blocked on .GetResult() — classic deadlock.

Does anything in the agent startup do that? If not, I'll leave the sync wrapper as-is.

Copy link
Copy Markdown
Contributor Author

@jimmyp jimmyp May 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @liam-mackie, Claude got enthusiastic and posted this before I was really ready to reach out to you. Are you the right person to talk to? If so great, but let me shape this up a bit better before bothering to engage with it

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got context here, and I'm probably the best person in this particular instance, but it's likely worth reaching out to the k8s requests channel for more information.

jimmyp and others added 2 commits May 26, 2026 12:13
…daries

Consolidates 9 sync-over-async sites into 5 single-method bridges, each
of which is a tiny sync wrapper over a private async implementation. All
internal helpers become fully async, removing the in-method GetAwaiter
calls in SystemCtlHelper and LinuxServiceConfigurator entirely.

Bridges:
- PowerShellPrerequisite.Check
- KubernetesDirectoryInformationProvider.GetPathUsedBytes
- LinuxServiceConfigurator.ConfigureService (replaces 5 prior bridges)
- WindowsServiceConfigurator.ConfigureService
- CommandLineRunner.Execute (also exposes ExecuteAsync for async callers)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address PR review on #1236: the sync-over-async hop at the
IKubernetesDirectoryInformationProvider boundary was carrying two
consumers — a sync override (EnsureDiskHasEnoughFreeSpace) and an async
one (CreateScriptContainer). Expose async siblings (GetPathUsedBytesAsync
on the provider, GetStorageInformationAsync on KubernetesPhysicalFileSystem)
so the async caller awaits directly. The sync GetPathUsedBytes and
GetStorageInformation remain for the IOctopusFileSystem override, and the
lighthouse comment now names that exact consumer instead of waving at a
non-existent background sweeper.
Comment thread source/Octopus.Manager.Tentacle/PreReq/PowerShellPrerequisite.cs

Action<string> log = s => logger.Information(s);
var exitCode = SilentProcessRunner.ExecuteCommand(
// We're in a synchronous public static helper (ExtractTarGzip). The method
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to talk about where this is called from and why this method needs to be sync

var exitCode = SilentProcessRunner.ExecuteCommand(

// We're in a synchronous test helper (Execute) that exposes a sync int
// return and out parameters. The method must return synchronously, so we
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why must this method return synchronously?

Comment thread source/Octopus.Tentacle.Tests/Util/LinuxTestUserPrincipal.cs
this.directoryInformationCache = directoryInformationCache;
}

// Sync-over-async bridge for the one remaining sync caller: KubernetesPhysicalFileSystem
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires context of our conversation to understand. Make this match the same format as the other justifications and make it clear and succinct

Comment thread source/Octopus.Tentacle/Startup/LinuxServiceConfigurator.cs
serviceConfigurationState);
}

// We're at the IServiceConfigurator boundary. IServiceConfigurator is consumed by
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apply the same feedback here as we gave on the linux service configurator

Address PR review on #1236:
- Plainer two-section "Why this is sync / Why blocking is safe" format
  across all sync-over-async sites; tests add a "Why low risk" line per
  reviewer ask.
- KubernetesDirectoryInformationProvider, PowerShellPrerequisite,
  Linux/WindowsServiceConfigurator, KubernetesAgentInstaller,
  SilentProcessRunnerFixture, LinuxTestUserPrincipal (previously had
  no justification, now does).
- LinuxTentacleFetcher.ExtractTarGzip: both callers are already async, so
  flip the helper to ExtractTarGzipAsync and remove the sync-over-async
  entirely. Update NugetTentacleFetcher.ExtractTentacle to await.
- Fix WhenSettingUpPollingTentacle_TelemetryEventShouldBeSent: builder
  was stubbing only the sync Execute overloads, so the ExecuteAsync
  switch in ReviewAndRunScriptTabViewModel returned a default false
  Task and the telemetry callback never fired. Stub both overloads.
Comment thread source/Octopus.Tentacle/Kubernetes/KubernetesPhysicalFileSystem.cs
// deadlock when the calling thread has a SynchronizationContext. The
// Kubernetes agent is a console app and doesn't set one up, so there's
// nothing for the awaited continuation to wait on.
// See https://blog.stephencleary.com/2012/07/dont-block-on-async-code.html
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liam-mackie I think this is the only place I need your eyes.

This change is because we are making something async lower down the call stack to help make it easier to cancel/abandon when a script it not behaving. Most tentacle code is sync so we end up blocking on async code which can cause issues. I think the comment above justifies that this instance below wont cause any issues, but let me know if you think the rationale misses something?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds about right to me - I don't think the agent has anything that would add a SynchronisationContext - @APErebus just double checking you don't have anything to add here, since you have a bunch of experience here

Address PR review on #1236: the sync and async GetStorageInformation
variants had duplicated body. Factor the bytesTotal lookup + tuple
assembly into a private sync helper so the sync/async pair we need stays
DRY.
@jimmyp jimmyp requested a review from liam-mackie May 28, 2026 04:39
@jimmyp jimmyp changed the title Migrate SilentProcessRunner to async Make SilentProcessRunner.Execute async May 28, 2026
@jimmyp jimmyp merged commit 9b89a57 into main May 28, 2026
54 checks passed
@jimmyp jimmyp deleted the jimpelletier/eft-3295-async-migration-base branch May 28, 2026 07:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants