Skip to content

docs: add chaos auto-rejoin race detection guide + MySQL MGR case#148

Open
weicao wants to merge 2 commits into
mainfrom
feature/chaos-c2-auto-rejoin-race-detection
Open

docs: add chaos auto-rejoin race detection guide + MySQL MGR case#148
weicao wants to merge 2 commits into
mainfrom
feature/chaos-c2-auto-rejoin-race-detection

Conversation

@weicao
Copy link
Copy Markdown
Contributor

@weicao weicao commented May 15, 2026

Summary

  • Add docs/addon-chaos-rejoin-race-detection-guide.md — engine-neutral soak design that catches HA engine auto-rejoin race (5-element design + 7-signal detection + layer classification + patch validation plan).
  • Add docs/cases/mysql/mysql-mgr-auto-rejoin-race-case.md — engine-specific MySQL MGR case backing the guide: two independent observations (William chaos.sh 2026-05-15T04:49Z + Henry 4h soak 2026-05-15T18:43Z), both matching mgr-server/internal/syncer/JoinCurrentMemberToCluster once-and-return pattern. Trigger conditions + patch hypothesis documented.

Why now

  • 22 valid chaos C2 attempts (batches 1-6, fresh + C1→C2 sequence, varied conditions, single-row writers) recovered ≤30s but never hit the race — strong negative evidence that the race requires sustained-load + long-duration to expose.
  • 4h soak with 5 parallel writers, alternating chaos every 5 min, hit the race at cycle 36-37 (~3h in), with engine log pattern identical to William's earlier independent observation.
  • Document the soak design so any addon team can detect this class of race without having to rediscover it.

Test plan

  • Both observations have evidence packages with full pod-side engine logs, KB controller logs, kill-log, 3-pod independent MGR view snapshots. shas listed in the case appendix.
  • Engine-neutral guide checks the index pattern (Hard Rules / When To Invoke / Mechanism / Soak Design / Detection / Classification / Escalation / Validation Plan / Related Docs).
  • Case appendix keeps engine specifics out of the main guide.
  • Tracked as task docs(probe-script-zombie): cross-link mariadb negative case #27 in #mysql; will update case material when William's patch (JoinCurrentMemberToCluster bounded retry + reconnect) clears patch-version 4h soak validation.

🤖 Generated with Claude Code

Engine-neutral soak design that catches the class of race where an HA
engine's rejoin path fires START GROUP_REPLICATION (or equivalent) once,
hits a transient SQL/RPC failure, and falls back to a periodic HA loop
that no longer converges. Under sustained chaos this leaves the
rejoining pod permanently OFFLINE while cluster.phase=Running.

Includes 5-element soak design, 7-signal detection list, layer
classification + escalation packet, patch validation plan.

Case appendix: MySQL MGR auto-rejoin race observed 2026-05-15 in two
independent runs (William chaos.sh C1+C2; Henry 4h soak), both matching
mgr-server JoinCurrentMemberToCluster once-and-return pattern. Tracked
as task #27 in #mysql.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@weicao weicao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review note: useful guide and case split. One scope correction before I would call it fully clean: docs/addon-chaos-rejoin-race-detection-guide.md:128 says the cluster surface phase eventually returns Running, but both MySQL evidence variants we are citing had the cluster stay Updating beyond the normal recovery window. Please reword this detection indicator to cover both outcomes, e.g. "the surface phase may return Running and mask the issue, or may remain Updating; either way pod-local view / role-label evidence is authoritative." That keeps the generic method aligned with the case evidence.

Per William review on PR #148: original Detection Indicator #7 only
described the Running-mask variant. Both William's chaos.sh case and
Henry's soak case actually had cluster phase stuck at Updating for the
whole window, not flipping back to Running.

Reworded #7 to cover both variants (Running-mask and Updating-stuck)
and emphasize the rule: always cross-check pod local view + role label,
never rely on cluster phase alone.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@weicao weicao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed the phase wording fix. Detection indicator now covers both Running-mask and Updating-stuck variants and keeps the evidence anchor on pod-local view + role label, which matches the MySQL case evidence. GitHub will not let me approve a PR under the same account, but from my side this PR is OK.

@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 15, 2026

Final docs-gate check: useful topic, but not mergeable yet.

Blockers I see:

  1. Public GitHub hygiene: PR body still has the generated-by footer, and both commits have Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>. Please rewrite/squash the branch so origin/main..HEAD commit bodies are clean, and remove the PR body footer.
  2. Missing index entries: both new files need docs/SKILL-INDEX.md entries (docs/addon-chaos-rejoin-race-detection-guide.md and docs/cases/mysql/mysql-mgr-auto-rejoin-race-case.md).
  3. Missing standard intro metadata: the guide and the MySQL case need the standard intro block fields (Audience, Status, Applies to, Applies to KB version, Affected by version skew; case files should also state scope / generalization boundary clearly).
  4. Public-doc residue in the case: remove local absolute evidence paths (/Users/wei/...) and Slock msg-id references (msg 61756a30, msg f5289913). Keep public artifact names, sha, timestamps, and technical evidence summaries instead.
  5. Evidence boundary: the guide currently recommends N=3 patch validation, while the case says patch validation is still pending. That is fine, but please make the case status explicit as "candidate mechanism / patch validation pending", not settled or release-ready.

No technical objection to the direction; this is a hygiene / public-doc boundary pass before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant