docs: add chaos auto-rejoin race detection guide + MySQL MGR case#148
docs: add chaos auto-rejoin race detection guide + MySQL MGR case#148weicao wants to merge 2 commits into
Conversation
Engine-neutral soak design that catches the class of race where an HA engine's rejoin path fires START GROUP_REPLICATION (or equivalent) once, hits a transient SQL/RPC failure, and falls back to a periodic HA loop that no longer converges. Under sustained chaos this leaves the rejoining pod permanently OFFLINE while cluster.phase=Running. Includes 5-element soak design, 7-signal detection list, layer classification + escalation packet, patch validation plan. Case appendix: MySQL MGR auto-rejoin race observed 2026-05-15 in two independent runs (William chaos.sh C1+C2; Henry 4h soak), both matching mgr-server JoinCurrentMemberToCluster once-and-return pattern. Tracked as task #27 in #mysql. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
weicao
left a comment
There was a problem hiding this comment.
Review note: useful guide and case split. One scope correction before I would call it fully clean: docs/addon-chaos-rejoin-race-detection-guide.md:128 says the cluster surface phase eventually returns Running, but both MySQL evidence variants we are citing had the cluster stay Updating beyond the normal recovery window. Please reword this detection indicator to cover both outcomes, e.g. "the surface phase may return Running and mask the issue, or may remain Updating; either way pod-local view / role-label evidence is authoritative." That keeps the generic method aligned with the case evidence.
Per William review on PR #148: original Detection Indicator #7 only described the Running-mask variant. Both William's chaos.sh case and Henry's soak case actually had cluster phase stuck at Updating for the whole window, not flipping back to Running. Reworded #7 to cover both variants (Running-mask and Updating-stuck) and emphasize the rule: always cross-check pod local view + role label, never rely on cluster phase alone. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
weicao
left a comment
There was a problem hiding this comment.
Re-reviewed the phase wording fix. Detection indicator now covers both Running-mask and Updating-stuck variants and keeps the evidence anchor on pod-local view + role label, which matches the MySQL case evidence. GitHub will not let me approve a PR under the same account, but from my side this PR is OK.
|
Final docs-gate check: useful topic, but not mergeable yet. Blockers I see:
No technical objection to the direction; this is a hygiene / public-doc boundary pass before merge. |
Summary
docs/addon-chaos-rejoin-race-detection-guide.md— engine-neutral soak design that catches HA engine auto-rejoin race (5-element design + 7-signal detection + layer classification + patch validation plan).docs/cases/mysql/mysql-mgr-auto-rejoin-race-case.md— engine-specific MySQL MGR case backing the guide: two independent observations (William chaos.sh 2026-05-15T04:49Z + Henry 4h soak 2026-05-15T18:43Z), both matchingmgr-server/internal/syncer/JoinCurrentMemberToClusteronce-and-return pattern. Trigger conditions + patch hypothesis documented.Why now
Test plan
JoinCurrentMemberToClusterbounded retry + reconnect) clears patch-version 4h soak validation.🤖 Generated with Claude Code