Fix macOS handoff flake: tolerate EINVAL when arming SO_RCVTIMEO on a closed peer#2
Merged
Merged
Conversation
… closed peer
The supervisor arms a per-recv liveness timeout with `SO_RCVTIMEO`
(`UnixStream::set_read_timeout`) before every control-socket read. On macOS
and the BSDs, `sosetoptlock` rejects *any* `setsockopt` with `EINVAL` once a
socket is fully shut down (`SS_CANTRCVMORE | SS_CANTSENDMORE`) — exactly the
state a peer that closed its end leaves behind. Linux has no such check:
`setsockopt` succeeds and the following read returns EOF.
So when a successor (or incumbent) closed its end of the control socket right
as the supervisor reached `set_read_timeout`, the call returned EINVAL. That
surfaced as `ready read failed: io: Invalid argument (os error 22)` and
aborted the handoff. It was flaky because it depended on the peer's close
propagating into both shutdown flags before the supervisor armed the timeout —
a race against the successor's fork/exec/exit window. Different scenarios lost
the race on different runs.
Proven at the syscall level by instrumenting the failing `setsockopt`:
`fd` was a valid connected `SOCK_STREAM`, `SO_ERROR=0`, the timeval was
`{10,0}` (unquestionably valid), yet both `set_read_timeout` and a hand-rolled
`setsockopt(SO_RCVTIMEO)` returned EINVAL. Two states observed: peer closed
with an empty buffer (clean EOF) and — critically — peer closed with a
complete 28-byte `Ready` frame still buffered. A blind "treat EINVAL as
failure" would have dropped that buffered `Ready` and aborted a handoff the
successor actually completed.
`arm_recv_timeout()` arms the timeout, and on EINVAL confirms via a
non-blocking `MSG_PEEK` that the read cannot block (peer gone: buffered frame
is delivered, otherwise EOF) before proceeding to read without the timeout.
A genuine EINVAL on a still-open empty socket is surfaced rather than risking
an unbounded blocking read. The Linux path is unchanged (setsockopt never
fails there). Applied to all three arming sites (Ready/`read_until`,
seal-wait, Hello). The Linux build keeps the atomic `SOCK_CLOEXEC`
socketpair creation.
Also fixes two pre-existing, macOS-timing-exposed test issues surfaced by
looping the suite:
- Several crash scenarios asserted O's asynchronously-written `resume-called`
(and N's startup `successor-pid`) marker with an instant `marker_exists`
immediately after the crashed process exited, racing the survivor's
recovery. Converted to the bounded `wait_marker(..., 3s)` already used
elsewhere — a real "never resumes" bug still fails after 3s, so this absorbs
scheduling jitter without masking regressions.
- `count_open_fds` in the stress test read Linux-only `/proc/self/fd`; now uses
`/dev/fd` on macOS/BSD so the FD-leak check runs (and passes) cross-platform.
Verified: `crash_matrix` green over 50 consecutive runs and the named EINVAL
scenarios over 40 more; full `cargo test --workspace --lib --tests` green;
clippy and fmt clean. ARCHITECTURE.md updated to document the macOS
`SO_RCVTIMEO`-on-shutdown behavior.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
5e3e621 to
aad5609
Compare
Merged
jaredLunde
added a commit
that referenced
this pull request
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bug
On macOS, ~1–2 of 22
crash_matrixfault-injection scenarios failed per run, a different subset each time, all with the same supervisor-log signature:EINVAL on the supervisor's timeout-bounded read of a control frame. Linux is unaffected.
Root cause (proven at the syscall level)
The supervisor arms a per-recv liveness timeout with
SO_RCVTIMEO(UnixStream::set_read_timeout) before each control-socket read. macOS/BSDsosetoptlockrejects anysetsockoptwithEINVALonce a socket is fully shut down (SS_CANTRCVMORE | SS_CANTSENDMORE) — exactly the state a peer that closed its end leaves behind. Linux has no such check.So when the successor (or incumbent) closed its socket end right as the supervisor reached
set_read_timeout, the call returned EINVAL instead of the read returning a clean EOF. Flaky because it raced the peer's fork/exec/exit window — which is why a different subset failed each run.Instrumentation at the failing
setsockoptconfirmed:fdwas a valid connectedSOCK_STREAM,SO_ERROR=0, timeval{10,0}(valid), yet bothset_read_timeoutand a hand-rolledsetsockopt(SO_RCVTIMEO)returned EINVAL. Two states observed:so_nreadReadyframe still bufferedState B is why the task's "no blind retry / no swallow" constraint matters: treating EINVAL as failure/EOF would drop a buffered
Readyand abort a handoff the successor actually completed.Fix
arm_recv_timeout()arms the timeout and, on EINVAL, confirms via a non-blockingMSG_PEEKthat the read cannot block (peer gone → buffered frame is delivered, otherwise EOF) before reading without the timeout. A genuine EINVAL on a still-open empty socket is surfaced rather than risking an unbounded blocking read. Linux path unchanged; atomicSOCK_CLOEXECsocketpair creation preserved. Applied to all three arming sites (read_until/Ready, seal-wait, Hello).Test fixes (pre-existing, macOS-timing-exposed)
resume-called(and N's startupsuccessor-pid) marker with an instantmarker_existsright after the crashed process exited — racing the survivor's recovery. Converted to the boundedwait_marker(…, 3s)already used elsewhere in the file. A real "never resumes" bug still fails after 3s, so this absorbs scheduling jitter without masking regressions.count_open_fdsread Linux-only/proc/self/fd; now uses/dev/fdon macOS/BSD so the FD-leak stress check runs (and passes) cross-platform — also confirming the fix leaks no fds.Verification (macOS)
crash_matrixgreen over 50 consecutive runs; the three task-named EINVAL scenarios over 40 more.cargo test --workspace --lib --testsgreen.cargo clippy --workspace --all-targetsandcargo fmt --checkclean.SO_RCVTIMEO-on-shutdown behavior.🤖 Generated with Claude Code