fix(cluster): preserve pipelined MembersUpdate across join handoff — deterministic bootstrap election (SPEC-305)#47
Merged
Conversation
A joining node could permanently miss the MembersUpdate the master sends immediately after the JoinResponse. LengthDelimitedCodec may buffer the coalesced next frame in its read buffer; recovering the raw stream via Framed::into_inner() and re-wrapping it discarded that buffer, dropping the update and leaving the node stuck on a stale membership view (e.g. 2 of 3 Active members). Carry the live Framed through the seed-join handoff (SeedAttemptOutcome, held_streams, listen_for_master_elected, promotion loop) into a new handle_peer_framed read loop instead of round-tripping through into_inner(), so buffered frames survive. Single-master and lowest-id election invariants are unchanged; the safety-valve self-promote path is never reached. Reproduced ~20-30% failure under CPU starvation before; 0/15 after.
Replace the bare convergence wait with wait_for_stable_master: gate all assertions on (a) every node seeing N Active members, (b) exactly one is_master(), and (c) the same unique master identity observed across a short re-sample, so the tests can never assert against a transient mid-election snapshot. Reads production ClusterState (current_view()/is_master()), no test-only shadow state. Apply to parallel_boot_elects_single_master and parallel_boot_with_loopback_delays; add a 3-active-members check to the delays test. Document the convergence budget's derivation from the formation constants and persist the investigation conclusion (test-timing vs real bug, safety-valve never reached, run count) as a tracked WHY-comment, since .specflow/ is gitignored.
Deterministically pins the buffer-preservation invariant the formation fix restored: a MembersUpdate written back-to-back with the JoinResponse (coalesced in one TCP read) must remain readable from the same Framed after the JoinResponse is consumed. The prior into_inner()+re-wrap handoff dropped the codec read buffer, losing the update; that reproduced only under CPU starvation (~20-30%). This loopback test forces the coalescing every run. The simulation harness (sim/cluster.rs) pre-seeds ClusterState and does not drive ClusterFormationService over TCP, so it cannot exercise the election handoff without the out-of-scope formation refactor; this focused loopback regression is the in-repo behavioral coverage for the fixed branch.
Deploying topgun with
|
| Latest commit: |
4b83dda
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://3df5cb56.topgun-f45.pages.dev |
| Branch Preview URL: | https://sf-446-flaky-bootstrap-elect.topgun-f45.pages.dev |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SPEC-305 (from TODO-446) — the "flaky"
cluster_bootstrapmaster-election tests were masking areal election bug, not just test timing. Per TODO-446's mandate (rule out a genuine
split-brain before declaring it a timing flake), the root cause was found and fixed, then the
tests were made deterministic.
Root cause + fix
cluster/formation.rs— a pipelinedMembersUpdatecould be lost across the joinhandoff, letting parallel boot under loopback delays converge to the wrong/duplicate master.
Fix preserves the pipelined
MembersUpdateacross the handoff.tests/cluster_bootstrap_test.rs— settle on a stable single-master before asserting(deterministic settle condition, not a wall-clock wait) + a regression test for the pipelined
MembersUpdatehandoff.Caveat for a future CI gate (carried forward as a TODO)
The test functions are
parallel_boot_elects_single_master/parallel_boot_with_loopback_delays. Acargo test cluster_bootstrapname-substring filtermatches neither → reports
0 passed(silent false-green). The correct selector is--test cluster_bootstrap_test(the test target). Tracked so it isn't lost with the archived spec.Verification
Reviewed via the SpecFlow impl-review gate. 2 Rust files. CI gate: Rust + Simulation
(cargo test + sim + clippy
--all-targets --all-features -D warnings+ fmt).