Retry transient EAGAIN when spawning a subprocess#312
Open
broken-circle wants to merge 1 commit into
Open
Conversation
The spawn path surfaced a transient `EAGAIN` from `clone3()`/`fork()` (Linux, Android, FreeBSD, OpenBSD) or posix_spawn (Darwin) directly as `SubprocessError.spawnFailed`. `EAGAIN` from the fork side signals a momentary per-uid or system task-count limit, so a single refusal was reported as a permanent failure. Add a bounded retry around the spawn syscall: up to `maxSpawnAttempts` attempts with jittered exponential backoff, after which the original error is surfaced unchanged. Restrict retries to a fork-side `EAGAIN`, where the kernel created nothing and re-attempting is clean, identified on the `fork()`/`clone3()` path by the shim returning without a process descriptor, and on Darwin by the absence of pre-fork setup (plain `posix_spawn()`, which leaves no child to reap). An exec-side failure, which can carry a child, is left to the existing handling. Place the backoff in the async context between attempts rather than in the C shim, which holds a process-wide fork lock with signals blocked and runs on a single shared worker thread; sleeping there would stall every other spawn. Cancellation during a backoff stops retrying and surfaces the pending failure rather than a cancellation error.
Member
Author
|
The FreeBSD CI is failing inside the swift-system dependency, unrelated to this change: The CI for this PR resolved to swift-system 1.7.0, while the more recent CI run for #311 resolved to 1.6.5. Might need to pin this dependency if there was a regression. See apple/swift-system#316. Fixed by apple/swift-system#319. CI should be all green on the next run. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #156.
The spawn path surfaces a transient
EAGAINfromclone3()/fork()(Linux, Android, FreeBSD, OpenBSD) orposix_spawn()(Darwin) as a terminalSubprocessError.spawnFailed.EAGAINfrom the fork side is a momentary per-uid or system task-count limit, not a permanent condition, so a single refusal is reported as failure.This PR adds a bounded retry around the spawn: up to
maxSpawnAttempts(currently6, which is the initial attempt plus 5 retries) with jittered exponential backoff (up to ~31 ms), after which the original error surfaces unchanged. Genuine exhaustion still surfaces in tens of milliseconds rather than hanging. Only a fork-sideEAGAINis retried, where the kernel created nothing so re-attempting is clean: onfork()/clone3()identified byprocessDescriptor == .invalidDescriptor, and on Darwin by the absence of pre-fork setup. An exec-side failure is left to the existing handling. Windows is excluded; its spawn failures are reported as Win32 errors, notEAGAIN.The backoff runs in the async context between attempts, not in the shim, which holds a process-wide fork lock with signals blocked and runs on one shared worker thread; sleeping there would stall every other spawn.
Related: While in the spawn path I noticed two pre-existing cleanup gaps on the exec/prefork side, unrelated to this change. On
fork()/clone3(), an exec failure with an errno outside{ENOENT, EACCES, ENOTDIR}throws without closing theclone3()-allocatedpidfd; and on Darwin, the pre-fork path doesn't reap its child on exec failure. This PR's retry is fork-side only, so it neither touches nor amplifies either. I'll open a follow-up PR to fix these.