Skip to content

Retry transient EAGAIN when spawning a subprocess#312

Open
broken-circle wants to merge 1 commit into
swiftlang:mainfrom
broken-circle:transient-EAGAIN-156
Open

Retry transient EAGAIN when spawning a subprocess#312
broken-circle wants to merge 1 commit into
swiftlang:mainfrom
broken-circle:transient-EAGAIN-156

Conversation

@broken-circle

@broken-circle broken-circle commented Jun 12, 2026

Copy link
Copy Markdown
Member

Closes #156.

The spawn path surfaces a transient EAGAIN from clone3()/fork() (Linux, Android, FreeBSD, OpenBSD) or posix_spawn() (Darwin) as a terminal SubprocessError.spawnFailed. EAGAIN from the fork side is a momentary per-uid or system task-count limit, not a permanent condition, so a single refusal is reported as failure.

This PR adds a bounded retry around the spawn: up to maxSpawnAttempts (currently 6, which is the initial attempt plus 5 retries) with jittered exponential backoff (up to ~31 ms), after which the original error surfaces unchanged. Genuine exhaustion still surfaces in tens of milliseconds rather than hanging. Only a fork-side EAGAIN is retried, where the kernel created nothing so re-attempting is clean: on fork()/clone3() identified by processDescriptor == .invalidDescriptor, and on Darwin by the absence of pre-fork setup. An exec-side failure is left to the existing handling. Windows is excluded; its spawn failures are reported as Win32 errors, not EAGAIN.

The backoff runs in the async context between attempts, not in the shim, which holds a process-wide fork lock with signals blocked and runs on one shared worker thread; sleeping there would stall every other spawn.

Related: While in the spawn path I noticed two pre-existing cleanup gaps on the exec/prefork side, unrelated to this change. On fork()/clone3(), an exec failure with an errno outside {ENOENT, EACCES, ENOTDIR} throws without closing the clone3()-allocated pidfd; and on Darwin, the pre-fork path doesn't reap its child on exec failure. This PR's retry is fork-side only, so it neither touches nor amplifies either. I'll open a follow-up PR to fix these.

The spawn path surfaced a transient `EAGAIN` from `clone3()`/`fork()`
(Linux, Android, FreeBSD, OpenBSD) or posix_spawn (Darwin) directly as
`SubprocessError.spawnFailed`. `EAGAIN` from the fork side signals a
momentary per-uid or system task-count limit, so a single refusal was
reported as a permanent failure.

Add a bounded retry around the spawn syscall: up to `maxSpawnAttempts`
attempts with jittered exponential backoff, after which the original
error is surfaced unchanged. Restrict retries to a fork-side `EAGAIN`,
where the kernel created nothing and re-attempting is clean, identified
on the `fork()`/`clone3()` path by the shim returning without a process
descriptor, and on Darwin by the absence of pre-fork setup (plain
`posix_spawn()`, which leaves no child to reap). An exec-side failure,
which can carry a child, is left to the existing handling.

Place the backoff in the async context between attempts rather than in
the C shim, which holds a process-wide fork lock with signals blocked
and runs on a single shared worker thread; sleeping there would stall
every other spawn. Cancellation during a backoff stops retrying and
surfaces the pending failure rather than a cancellation error.
@broken-circle

broken-circle commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

The FreeBSD CI is failing inside the swift-system dependency, unrelated to this change:

[38/59] Compiling SystemPackage RawBuffer.swift
.../Constants.swift:635:3: error: cannot find 'O_CLOFORK' in scope
633 | internal var _O_CLOFORK: CInt {
634 |   #if !os(WASI) && !os(Linux) && !os(Android) && !canImport(Darwin)
635 |   O_CLOFORK
    |   `- error: cannot find 'O_CLOFORK' in scope
636 |   #else
637 |   0

The CI for this PR resolved to swift-system 1.7.0, while the more recent CI run for #311 resolved to 1.6.5. Might need to pin this dependency if there was a regression.

See apple/swift-system#316. Fixed by apple/swift-system#319. CI should be all green on the next run.

@broken-circle broken-circle marked this pull request as ready for review June 12, 2026 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

stressTestWithLittleOutput fails nondeterministically with EAGAIN

1 participant