Skip to content

fix: bound concurrent subprocess launches to avoid fd exhaustion#275

Open
fortmarek wants to merge 5 commits into
mainfrom
fix/bound-concurrent-process-launches
Open

fix: bound concurrent subprocess launches to avoid fd exhaustion#275
fortmarek wants to merge 5 commits into
mainfrom
fix/bound-concurrent-process-launches

Conversation

@fortmarek
Copy link
Copy Markdown
Member

@fortmarek fortmarek commented Jun 2, 2026

What changed

Bound the number of concurrently-running subprocesses with an AsyncResourceLimiter (a cancellation-aware async semaphore), sized from the current soft RLIMIT_NOFILE. Each CommandRunner.run launch acquires a permit for the duration it holds its pipe file descriptors.

Why

Every running subprocess holds open pipe file descriptors (stdout + stderr, plus the transient pipe used to locate the executable). Before #249, the blocking waitUntilExit kept a cooperative thread busy per launch, which implicitly capped how many subprocesses ran at once. #249 switched to an async wait (correctly fixing thread starvation) — but that removed the implicit bound, so a caller that fans out many concurrent commands can run the process out of file descriptors.

The failure surfaces as a raw NSPOSIXErrorDomain Code=9 "Bad file descriptor" from Process.run(). It's especially easy to hit on CI, where the default soft RLIMIT_NOFILE is low (256 on macOS). Reported downstream in Tuist: tuist graph on a very large project (the AWS SDK for iOS) aborts with exactly this error.

The fix

AsyncResourceLimiter mirrors the pattern already used in tuist/FileSystem for bounding concurrent file operations:

CommandRunner holds an instance limiter (defaulting to a shared, fd-derived one) so resource usage stays bounded regardless of how many commands a caller fans out, rather than relying on the ambient fd limit being high enough.

Validation

Committed regression test (red/green). CommandRunnerRaceTests.boundsConcurrentSubprocessLaunches runs 60 commands through a runner capped at 4 and asserts that the number of subprocesses alive at any instant (observed via per-process marker files) never exceeds the cap:

  • With the limiter bypassed: Observed 60 concurrent subprocesses, expected at most 4fails.
  • With the limiter: peak ≤ 4 → passes.

swift build and swift test pass (15 tests across 3 suites, including the existing concurrency race suite).

Symptom-level e2e (manual). A standalone harness depending on this package, run as a fresh process with the soft fd limit forced to 256, reproduces the exact NSPOSIXErrorDomain Code=9 "Bad file descriptor" on the pre-fix code under a high-concurrency mix of large-output and short commands; with the limiter the same load completes with zero errors. (This is done out-of-suite because a swift test runner process doesn't enforce a lowered soft fd limit for new descriptors the way a fresh process does, so the committed test asserts the concurrency-bound invariant instead.)

Review follow-ups

Two issues raised in review of the limiter were addressed in follow-up commits.

P1 — withPermit didn't compile under strict concurrency

CI failed to compile on macOS, Ubuntu, and Windows with non-sendable result type 'T' cannot be sent from nonisolated context. withPermit is actor-isolated and returns its operation's result out to a nonisolated caller, so the result genuinely crosses an isolation boundary and T must be Sendable. Constrained withPermit<T: Sendable>; the only caller returns Void, so nothing downstream is affected. (Bumping the CI Swift version was considered and rejected — the diagnostic is a real cross-isolation send, not a toolchain quirk, and a library must compile on the Swift versions its consumers use.)

P2 — stream cancellation could miss queued launches

continuation.onTermination was installed only after the producer task had acquired a permit and built the Process. If a consumer dropped/cancelled the stream while the task was still parked in withPermit waiting for a permit, the detached task was never cancelled; once a permit freed up it launched an orphaned subprocess with no consumer. (Assigning onTermination after the iterator has already cancelled does not replay the cancellation callback, so the late assignment couldn't recover it.)

Fix: install onTermination at the stream-builder level, before the permit is awaited, and cancel() the producer task from it. Cancellation now unwinds the task whether it is waiting for a permit (via the limiter's withTaskCancellationHandlercancelWaiter) or already running (the published process is terminated). A Task.isCancelled check right after process.run() closes the narrow publish-before-run race.

Committed regression test (red/green). CommandRunnerRaceTests.cancellingWhileWaitingForPermit_doesNotLaunchSubprocess holds the only permit with one command, queues a second, cancels it while it is still waiting, then releases the permit and asserts the second command never launched a subprocess:

  • Without the fix: the orphaned producer launches once the permit frees → sentinel file appears → fails.
  • With the fix: the producer is cancelled while waiting → no launch → passes.

Full suite is green (16 tests across 3 suites).

fortmarek and others added 5 commits June 2, 2026 17:53
Each running subprocess holds open pipe file descriptors. Since #249 made the
subprocess wait async, launches are no longer implicitly bounded by the
cooperative thread pool, so a caller that fans out many concurrent commands can
run the process out of file descriptors — surfacing as
`NSPOSIXErrorDomain Code=9 "Bad file descriptor"` from `Process.run()`. This is
especially easy to hit on CI, where the default soft `RLIMIT_NOFILE` is low
(256 on macOS).

Introduce an `AsyncResourceLimiter` (a cancellation-aware async semaphore,
mirroring tuist/FileSystem) and wrap each launch in it. The limit is derived
from the current soft file-descriptor limit, so it adapts when the limit is
raised, and it suspends rather than blocks while waiting — preserving #249's
fix for thread starvation.

Verified with a stress harness: under a forced 256-fd limit, a high-concurrency
mix of large-output and short commands previously threw "Bad file descriptor";
with the limiter the same load completes with zero errors. Existing tests
(including the concurrency race suite) pass.
Adds `boundsConcurrentSubprocessLaunches`, which runs many commands through a
runner with a small cap and asserts the number of subprocesses alive at once
never exceeds it (observed via per-process marker files). It fails without the
limiter (peak == fan-out) and passes with it.

To make the bound testable deterministically, `CommandRunner` now holds an
instance limiter (defaulting to the shared, fd-derived one) and gains an
internal initializer that takes an explicit `maximumConcurrentProcesses`.
withPermit returns its operation's result across the actor's isolation boundary, so under StrictConcurrency the result type must be Sendable. CI (Swift 6.1, all platforms) failed to compile with 'non-sendable result type T cannot be sent from nonisolated context'. The only caller returns Void, so the bound is free.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
continuation.onTermination was installed only after the producer task had acquired a permit and created the Process. A consumer that cancelled the stream while the task was still parked in AsyncResourceLimiter.withPermit waiting for a permit left the detached task uncancelled; once a permit freed up it launched an orphaned subprocess with no consumer.

Install the handler at the stream-builder level, before the permit is awaited, and cancel the producer task from it. Cancellation unwinds the task whether it is waiting for a permit (via the limiter's withTaskCancellationHandler) or already running (the published process is terminated). A post-run isCancelled check closes the publish-before-run race.

Adds a red/green regression test: with a single permit, a queued command cancelled while waiting must never launch a subprocess.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Capturing a subprocess's output requires two pipes (~4 file descriptors), which
is the dominant per-process descriptor cost. Many callers don't need the output
at all, yet still paid for the pipes.

Add an `OutputRedirection` option to `run`:
- `.capture` (default, unchanged): pipes + `CommandEvent`s.
- `.discard`: redirects stdout/stderr to the null device — no pipes, no events.
- `.inherit`: the subprocess inherits the parent's stdout/stderr — no pipes.

This mirrors swift-subprocess's discard/redirect output modes and lets callers
that don't capture output drop their descriptor cost to zero, easing pressure
on the file-descriptor table alongside the concurrency limiter. The new variant
is additive; the existing `run` signatures are unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant