fix(cluster): preserve pipelined MembersUpdate across join handoff — deterministic bootstrap election (SPEC-305) by ivkan · Pull Request #47 · TopGunBuild/topgun

ivkan · 2026-06-11T08:07:31Z

Summary

SPEC-305 (from TODO-446) — the "flaky" cluster_bootstrap master-election tests were masking a
real election bug, not just test timing. Per TODO-446's mandate (rule out a genuine
split-brain before declaring it a timing flake), the root cause was found and fixed, then the
tests were made deterministic.

Root cause + fix

cluster/formation.rs — a pipelined MembersUpdate could be lost across the join
handoff, letting parallel boot under loopback delays converge to the wrong/duplicate master.
Fix preserves the pipelined MembersUpdate across the handoff.
tests/cluster_bootstrap_test.rs — settle on a stable single-master before asserting
(deterministic settle condition, not a wall-clock wait) + a regression test for the pipelined
MembersUpdate handoff.

Caveat for a future CI gate (carried forward as a TODO)

The test functions are parallel_boot_elects_single_master /
parallel_boot_with_loopback_delays. A cargo test cluster_bootstrap name-substring filter
matches neither → reports 0 passed (silent false-green). The correct selector is
--test cluster_bootstrap_test (the test target). Tracked so it isn't lost with the archived spec.

Verification

Reviewed via the SpecFlow impl-review gate. 2 Rust files. CI gate: Rust + Simulation
(cargo test + sim + clippy --all-targets --all-features -D warnings + fmt).

A joining node could permanently miss the MembersUpdate the master sends immediately after the JoinResponse. LengthDelimitedCodec may buffer the coalesced next frame in its read buffer; recovering the raw stream via Framed::into_inner() and re-wrapping it discarded that buffer, dropping the update and leaving the node stuck on a stale membership view (e.g. 2 of 3 Active members). Carry the live Framed through the seed-join handoff (SeedAttemptOutcome, held_streams, listen_for_master_elected, promotion loop) into a new handle_peer_framed read loop instead of round-tripping through into_inner(), so buffered frames survive. Single-master and lowest-id election invariants are unchanged; the safety-valve self-promote path is never reached. Reproduced ~20-30% failure under CPU starvation before; 0/15 after.

Replace the bare convergence wait with wait_for_stable_master: gate all assertions on (a) every node seeing N Active members, (b) exactly one is_master(), and (c) the same unique master identity observed across a short re-sample, so the tests can never assert against a transient mid-election snapshot. Reads production ClusterState (current_view()/is_master()), no test-only shadow state. Apply to parallel_boot_elects_single_master and parallel_boot_with_loopback_delays; add a 3-active-members check to the delays test. Document the convergence budget's derivation from the formation constants and persist the investigation conclusion (test-timing vs real bug, safety-valve never reached, run count) as a tracked WHY-comment, since .specflow/ is gitignored.

Deterministically pins the buffer-preservation invariant the formation fix restored: a MembersUpdate written back-to-back with the JoinResponse (coalesced in one TCP read) must remain readable from the same Framed after the JoinResponse is consumed. The prior into_inner()+re-wrap handoff dropped the codec read buffer, losing the update; that reproduced only under CPU starvation (~20-30%). This loopback test forces the coalescing every run. The simulation harness (sim/cluster.rs) pre-seeds ClusterState and does not drive ClusterFormationService over TCP, so it cannot exercise the election handoff without the out-of-scope formation refactor; this focused loopback regression is the in-repo behavioral coverage for the fixed branch.

cloudflare-workers-and-pages · 2026-06-11T08:07:37Z

Deploying topgun with Cloudflare Pages

Latest commit:	`4b83dda`
Status:	✅ Deploy successful!
Preview URL:	https://3df5cb56.topgun-f45.pages.dev
Branch Preview URL:	https://sf-446-flaky-bootstrap-elect.topgun-f45.pages.dev

View logs

ivkan added 3 commits June 10, 2026 22:51

ivkan merged commit d3f1d05 into main Jun 11, 2026
10 checks passed

ivkan deleted the sf-446-flaky-bootstrap-election branch June 11, 2026 08:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cluster): preserve pipelined MembersUpdate across join handoff — deterministic bootstrap election (SPEC-305)#47

fix(cluster): preserve pipelined MembersUpdate across join handoff — deterministic bootstrap election (SPEC-305)#47
ivkan merged 3 commits into
mainfrom
sf-446-flaky-bootstrap-election

ivkan commented Jun 11, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ivkan commented Jun 11, 2026

Summary

Root cause + fix

Caveat for a future CI gate (carried forward as a TODO)

Verification

Uh oh!

cloudflare-workers-and-pages Bot commented Jun 11, 2026

Deploying topgun with Cloudflare Pages

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant