fix: bound concurrent subprocess launches to avoid fd exhaustion#275
Open
fortmarek wants to merge 5 commits into
Open
fix: bound concurrent subprocess launches to avoid fd exhaustion#275fortmarek wants to merge 5 commits into
fortmarek wants to merge 5 commits into
Conversation
Each running subprocess holds open pipe file descriptors. Since #249 made the subprocess wait async, launches are no longer implicitly bounded by the cooperative thread pool, so a caller that fans out many concurrent commands can run the process out of file descriptors — surfacing as `NSPOSIXErrorDomain Code=9 "Bad file descriptor"` from `Process.run()`. This is especially easy to hit on CI, where the default soft `RLIMIT_NOFILE` is low (256 on macOS). Introduce an `AsyncResourceLimiter` (a cancellation-aware async semaphore, mirroring tuist/FileSystem) and wrap each launch in it. The limit is derived from the current soft file-descriptor limit, so it adapts when the limit is raised, and it suspends rather than blocks while waiting — preserving #249's fix for thread starvation. Verified with a stress harness: under a forced 256-fd limit, a high-concurrency mix of large-output and short commands previously threw "Bad file descriptor"; with the limiter the same load completes with zero errors. Existing tests (including the concurrency race suite) pass.
Adds `boundsConcurrentSubprocessLaunches`, which runs many commands through a runner with a small cap and asserts the number of subprocesses alive at once never exceeds it (observed via per-process marker files). It fails without the limiter (peak == fan-out) and passes with it. To make the bound testable deterministically, `CommandRunner` now holds an instance limiter (defaulting to the shared, fd-derived one) and gains an internal initializer that takes an explicit `maximumConcurrentProcesses`.
withPermit returns its operation's result across the actor's isolation boundary, so under StrictConcurrency the result type must be Sendable. CI (Swift 6.1, all platforms) failed to compile with 'non-sendable result type T cannot be sent from nonisolated context'. The only caller returns Void, so the bound is free. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
continuation.onTermination was installed only after the producer task had acquired a permit and created the Process. A consumer that cancelled the stream while the task was still parked in AsyncResourceLimiter.withPermit waiting for a permit left the detached task uncancelled; once a permit freed up it launched an orphaned subprocess with no consumer. Install the handler at the stream-builder level, before the permit is awaited, and cancel the producer task from it. Cancellation unwinds the task whether it is waiting for a permit (via the limiter's withTaskCancellationHandler) or already running (the published process is terminated). A post-run isCancelled check closes the publish-before-run race. Adds a red/green regression test: with a single permit, a queued command cancelled while waiting must never launch a subprocess. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Capturing a subprocess's output requires two pipes (~4 file descriptors), which is the dominant per-process descriptor cost. Many callers don't need the output at all, yet still paid for the pipes. Add an `OutputRedirection` option to `run`: - `.capture` (default, unchanged): pipes + `CommandEvent`s. - `.discard`: redirects stdout/stderr to the null device — no pipes, no events. - `.inherit`: the subprocess inherits the parent's stdout/stderr — no pipes. This mirrors swift-subprocess's discard/redirect output modes and lets callers that don't capture output drop their descriptor cost to zero, easing pressure on the file-descriptor table alongside the concurrency limiter. The new variant is additive; the existing `run` signatures are unchanged.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
Bound the number of concurrently-running subprocesses with an
AsyncResourceLimiter(a cancellation-aware async semaphore), sized from the current softRLIMIT_NOFILE. EachCommandRunner.runlaunch acquires a permit for the duration it holds its pipe file descriptors.Why
Every running subprocess holds open pipe file descriptors (stdout + stderr, plus the transient pipe used to locate the executable). Before #249, the blocking
waitUntilExitkept a cooperative thread busy per launch, which implicitly capped how many subprocesses ran at once. #249 switched to an async wait (correctly fixing thread starvation) — but that removed the implicit bound, so a caller that fans out many concurrent commands can run the process out of file descriptors.The failure surfaces as a raw
NSPOSIXErrorDomain Code=9 "Bad file descriptor"fromProcess.run(). It's especially easy to hit on CI, where the default softRLIMIT_NOFILEis low (256 on macOS). Reported downstream in Tuist:tuist graphon a very large project (the AWS SDK for iOS) aborts with exactly this error.The fix
AsyncResourceLimitermirrors the pattern already used in tuist/FileSystem for bounding concurrent file operations:min(256, (softLimit - 32) / fdsPerProcess)— so it adapts if the limit is raised at startup, and falls back to a safe default when the limit can't be read.CommandRunnerholds an instance limiter (defaulting to a shared, fd-derived one) so resource usage stays bounded regardless of how many commands a caller fans out, rather than relying on the ambient fd limit being high enough.Validation
Committed regression test (red/green).
CommandRunnerRaceTests.boundsConcurrentSubprocessLaunchesruns 60 commands through a runner capped at 4 and asserts that the number of subprocesses alive at any instant (observed via per-process marker files) never exceeds the cap:Observed 60 concurrent subprocesses, expected at most 4→ fails.swift buildandswift testpass (15 tests across 3 suites, including the existing concurrency race suite).Symptom-level e2e (manual). A standalone harness depending on this package, run as a fresh process with the soft fd limit forced to 256, reproduces the exact
NSPOSIXErrorDomain Code=9 "Bad file descriptor"on the pre-fix code under a high-concurrency mix of large-output and short commands; with the limiter the same load completes with zero errors. (This is done out-of-suite because aswift testrunner process doesn't enforce a lowered soft fd limit for new descriptors the way a fresh process does, so the committed test asserts the concurrency-bound invariant instead.)Review follow-ups
Two issues raised in review of the limiter were addressed in follow-up commits.
P1 —
withPermitdidn't compile under strict concurrencyCI failed to compile on macOS, Ubuntu, and Windows with
non-sendable result type 'T' cannot be sent from nonisolated context.withPermitisactor-isolated and returns its operation's result out to a nonisolated caller, so the result genuinely crosses an isolation boundary andTmust beSendable. ConstrainedwithPermit<T: Sendable>; the only caller returnsVoid, so nothing downstream is affected. (Bumping the CI Swift version was considered and rejected — the diagnostic is a real cross-isolation send, not a toolchain quirk, and a library must compile on the Swift versions its consumers use.)P2 — stream cancellation could miss queued launches
continuation.onTerminationwas installed only after the producer task had acquired a permit and built theProcess. If a consumer dropped/cancelled the stream while the task was still parked inwithPermitwaiting for a permit, the detached task was never cancelled; once a permit freed up it launched an orphaned subprocess with no consumer. (AssigningonTerminationafter the iterator has already cancelled does not replay the cancellation callback, so the late assignment couldn't recover it.)Fix: install
onTerminationat the stream-builder level, before the permit is awaited, andcancel()the producer task from it. Cancellation now unwinds the task whether it is waiting for a permit (via the limiter'swithTaskCancellationHandler→cancelWaiter) or already running (the published process is terminated). ATask.isCancelledcheck right afterprocess.run()closes the narrow publish-before-run race.Committed regression test (red/green).
CommandRunnerRaceTests.cancellingWhileWaitingForPermit_doesNotLaunchSubprocessholds the only permit with one command, queues a second, cancels it while it is still waiting, then releases the permit and asserts the second command never launched a subprocess:Full suite is green (16 tests across 3 suites).