From 9ec0bc73c623b794a511cd9a101b3ec05b7ebc83 Mon Sep 17 00:00:00 2001 From: Wei Cao Date: Sat, 16 May 2026 03:49:01 +0800 Subject: [PATCH 1/2] docs: add chaos auto-rejoin race detection guide + MySQL MGR case Engine-neutral soak design that catches the class of race where an HA engine's rejoin path fires START GROUP_REPLICATION (or equivalent) once, hits a transient SQL/RPC failure, and falls back to a periodic HA loop that no longer converges. Under sustained chaos this leaves the rejoining pod permanently OFFLINE while cluster.phase=Running. Includes 5-element soak design, 7-signal detection list, layer classification + escalation packet, patch validation plan. Case appendix: MySQL MGR auto-rejoin race observed 2026-05-15 in two independent runs (William chaos.sh C1+C2; Henry 4h soak), both matching mgr-server JoinCurrentMemberToCluster once-and-return pattern. Tracked as task #27 in #mysql. Co-Authored-By: Claude Opus 4.7 --- ...addon-chaos-rejoin-race-detection-guide.md | 195 ++++++++++++++++++ .../mysql/mysql-mgr-auto-rejoin-race-case.md | 115 +++++++++++ 2 files changed, 310 insertions(+) create mode 100644 docs/addon-chaos-rejoin-race-detection-guide.md create mode 100644 docs/cases/mysql/mysql-mgr-auto-rejoin-race-case.md diff --git a/docs/addon-chaos-rejoin-race-detection-guide.md b/docs/addon-chaos-rejoin-race-detection-guide.md new file mode 100644 index 0000000..75f5ae3 --- /dev/null +++ b/docs/addon-chaos-rejoin-race-detection-guide.md @@ -0,0 +1,195 @@ +# Chaos Auto-Rejoin Race Detection (Soak Design) + +When an addon's HA engine rejoins a cluster after a pod restart, the rejoin path +typically calls one or more engine-level commands (e.g. `START GROUP_REPLICATION`, +`pg_basebackup` + `replica start`, `RECOVER PRIMARY`, etc). If that path fires +once-and-returns on transient failure (SQL i/o timeout, socket reset, partial +write), and only relies on an external HA periodic retry to recover, then under +sustained load + repeated chaos the failure mode is **the rejoining pod never +catches up and falls out of the group permanently**. Within minutes the cluster +degenerates from N replicas to 1 healthy replica + (N-1) Error pods, even though +each individual chaos event "recovered" by surface phase metrics. + +This guide is engine-neutral. It captures the soak design that detects this +class of race, the detection indicators, and the layer classification. + +## When To Invoke + +Use this guide when: + +- An HA-capable addon (MGR / WSREP / Galera / OB / Paxos / sentinel + replica) + has passed N-of-N short chaos rounds with surface phase = Running but you do + not yet trust long-duration behavior. +- A test or production report mentions "pod restarted, rejoined", and you want + to know whether the rejoin path is hard-tested. +- A rare incident report says "two replicas lost role label after a pod kill" + or "secondaries became OFFLINE while primary still sees them healthy". +- You are deciding what soak design satisfies the long-duration release standard + (third condition of the release matrix). + +## Failure mode (engine-neutral) + +``` +T0 : 3 healthy replicas, all writing OK, role labels set +T1 : kill pod-X (the rejoining target) +T1+ : pod-X restarts, engine startup script reaches the "join existing group" + step (engine-specific name varies) +T1+s: the join SQL/RPC call hits a transient (i/o timeout, connection refused, + group view temporarily stale) +T1+s+ε: join call returns the original error and exits the startup script step +T1+...: HA daemon's periodic loop tries to bring this pod back, but because the + original join was already attempted and the engine's local state machine + is now in an indeterminate sub-state (e.g. group_replication_local_address + already set, member list believes it was added), retries no longer + converge to a healthy ONLINE +RESULT: cluster.phase goes Updating → Running on the surface (other N-1 + replicas form a quorum), but the rejoining pod is permanently Failed/Error + with no role label. Repeat chaos a few more times → another pod hits the + same race → cluster degenerates further +``` + +The race is invisible if: +- the chaos test only measures `cluster.phase == Running` after each cycle +- the chaos test does not query MGR/WSREP/SHOW SLAVE STATUS view from each pod + independently +- the soak is short enough that only one or two cycles run before stop + +## Soak design that catches this race + +Five concrete elements. Each is needed; missing any one masks the failure mode. + +### 1. Long duration + +At least **3 hours**, ideally 4+. Race exposure scales with number of rejoin +attempts × probability per attempt. With ~5 min between chaos cycles, 3 hours +gives ~36 rejoins, enough to expose 1-in-30 probabilistic races. + +### 2. Continuous concurrent write load throughout the entire window + +Multiple writer threads (≥3, ≥1 INSERT/0.5s each), inserting into a replicated +table. The write load: +- pressures the rejoining engine's commit/apply path while it is starting up +- increases the probability of the rejoin SQL/RPC hitting a transient failure +- guarantees that "data was being written when the rejoining pod was killed", + matching real-world load patterns + +A single 1-row-per-second writer is too gentle. + +### 3. Repeated kill alternating victim role + +Cycle over chaos events: cycle 1 kill primary, cycle 2 kill secondary, +cycle 3 kill primary, .... Do this for the entire soak window. + +Always-kill-secondary or always-kill-primary masks asymmetries in the rejoin +path (e.g. the primary's rejoin path may be different from a secondary's). + +### 4. Per-cycle multi-replica view consistency check + +After each chaos cycle recovers (cluster.phase=Running), query the HA group +membership view **from every replica independently**, not just from the +primary. Compare the views. They must all agree on `ONLINE_count`, +`PRIMARY_count`, and `member set`. + +Example for MGR: + +```sql +-- run on each pod independently, compare results +SELECT COUNT(*), SUM(MEMBER_STATE='ONLINE'), SUM(MEMBER_ROLE='PRIMARY') +FROM performance_schema.replication_group_members; +``` + +If the views disagree, mark the cycle as `view-split` and snapshot evidence +immediately. Do NOT continue chaos cycles after a view split — you'll lose +the evidence to the next kill. + +### 5. Early-stop on degeneration + +If `find_role_label primary` or `find_role_label secondary` returns empty in +the next cycle, do not skip and keep cycling. **Snapshot evidence and stop.** +Empty role labels mean the engine's rejoin path failed to restore membership +on at least one replica. + +## Detection indicators + +When the race triggers, you will see (in order): + +1. The killed pod restarts. +2. Pod-side engine log shows the join call, then a transient error (SQL i/o + timeout / connection reset / etc), then no further successful join. +3. Pod-side engine log shows the HA daemon's periodic retry loop running but + never reaching ONLINE. +4. KB controller log shows `probe event failed exit code: 1` repeatedly for + that pod (because role probe queries the engine which now reports OFFLINE + or no member info). +5. KB controller log shows `handle role change event ... role="" originalRole=""` + — the role label on that pod is cleared. +6. `kubectl get pod` shows the pod stuck in Error or 0/4 Ready forever, until + manual restart. +7. The cluster surface phase eventually returns to Running once the other N-1 + replicas form a quorum, masking the failure if you only watch `.status.phase`. + +## Layer classification + +| Layer | When to assign | +|---|---| +| environment | only if vcluster / host kubelet / network is observably broken (API timeouts, node not ready, syncer halted). If the API responds and other pods recover, NOT environment. | +| runner / harness | only if a clean control cycle (without writers / without repeated kill) also degenerates. If the soak ran clean for N cycles before failure, NOT harness. | +| KubeBlocks controller / DataProtection / syncer | only if controller-side reconcile is the first failure (e.g. controller never updates cluster.phase, role label cleared by KB controller without engine event). If KB controller is faithfully reflecting the engine's OFFLINE / probe-failed state, NOT KB controller. | +| **addon / engine product** | when pod-side engine log shows the join call returning an error that the engine's startup script does not retry, and the engine's HA periodic loop does not recover. **This is the default assignment for the auto-rejoin race.** | + +## Route to addon / engine team — escalation packet + +```text +Auto-rejoin race finding: +- engine + version: +- cluster + namespace: +- timeline: + - T0 cluster healthy + - T-kill-1 first chaos event + - T-degen first cycle where role disappeared +- evidence files: + - pod-X engine log around T-degen (look for the failing join call) + - KB controller log probe event failed window + - kill log with cycle index + - 3-pod independent group view snapshots +- candidate mechanism / hypothesis (from engine log lines): +- minimum patch sketch: + - bounded retry of the join call + - reconnect engine local socket on each retry + - preserve the original error return path for permanent failures +- soak validation plan after patch: + - same soak design (≥3h, ≥3 writers, alternating chaos, view check, early-stop) + - patch must run at least 1 full clean soak +``` + +## Patch-version validation + +After the engine team produces a patch: + +1. Build engine image locally or pull patched image (see + `local-build-sideload-test-image` skill). +2. Sideload to the test vcluster. +3. Verify live `imageID` / build info matches the patched commit (do not trust + tag-only deployment). +4. Run the same soak design (same writers, same cycle interval, same duration). +5. **At least one full soak must complete with `view-split=0` AND `role + disappearance=0`.** If any cycle still degenerates, the patch does not fix + the race; gather new evidence and return to engine team. +6. Repeat the soak at least 2 more times (N=3 floor for probabilistic events) + to bound the residual rate. + +## Related skills / docs + +- `skills/soak-test-classification/SKILL.md` — classify each cycle into + invariant-break / product-path-failure / harness-race / external-environmental-cascade +- `skills/first-blocker-classification/SKILL.md` — 5-layer classification before + routing +- `skills/continuous-test-until-release-ready/SKILL.md` — when soak is green for + a while, decide whether to stop, extend, or claim release-ready +- `skills/local-build-sideload-test-image/SKILL.md` — build + sideload patch + images for validation +- `docs/addon-vcluster-bounded-convergence-window-guide.md` — vcluster timing + multipliers when running soak inside vcluster +- `docs/cases/mysql/mysql-mgr-auto-rejoin-race-case.md` — MySQL MGR case + (William 2026-05-15T04:49Z + Henry 2026-05-15T18:43Z soak; mgr-server + `JoinCurrentMemberToCluster` → `START GROUP_REPLICATION` once-and-return) diff --git a/docs/cases/mysql/mysql-mgr-auto-rejoin-race-case.md b/docs/cases/mysql/mysql-mgr-auto-rejoin-race-case.md new file mode 100644 index 0000000..8015c09 --- /dev/null +++ b/docs/cases/mysql/mysql-mgr-auto-rejoin-race-case.md @@ -0,0 +1,115 @@ +# MySQL MGR Auto-Rejoin Race Case + +Engine-specific case material for the engine-neutral guide +`docs/addon-chaos-rejoin-race-detection-guide.md`. + +## Context + +- Engine: MySQL 8.0.33 (KubeBlocks MySQL addon 1.0.2). +- Topology: MGR 3-replica. +- Test environment: vcluster on idc2 (`mason-idc2-mysql-vc-2`). +- Observed in two independent runs on the same day: + 1. **William 2026-05-15T04:49 UTC** — original `tests/chaos.sh` C1 + C2 sequence + run, cluster `mysql-chaos-53410`. Evidence: + `/Users/wei/.slock/agents/5b8c8201-7b26-4882-a6bd-ed7f9c0e711b/evidence/mysql-tests-chaos-20260515T044954Z.tar.gz` + sha `19b429c6451c1966fd5877ef0ea2aa944668aff70c05363aaa7ec8c1a48a5859`. + 2. **Henry 2026-05-15T15:35 UTC start, T18:43 UTC degenerate** — 4h soak, + cluster `mysql-chaos-soak-2712`. Evidence: + `/Users/wei/.slock/agents/beeaf92e-9737-4a15-a301-72d3aaca3991/work/chaos-c2-soak/mysql-chaos-c2-soak-product-event-20260516T034000Z.tar.gz` + sha `736a42cbb954ee54ff89a1f74154ba5da9f347b300c5e6898ae66796a68a6808`. + +## Pre-finding negative N + +Before the 4h soak, the chaos C2 lane ran: + +- Batch 1 (1 run, fresh cluster + kill secondary): 20s recovered, no view-split. +- Batch 2 (4 runs, fresh cluster + kill secondary): all 20s, no view-split. +- Batch 3 (4 runs, C1 → wait Running → C2 sequence matching William's + `tests/chaos.sh`): all 20s, no view-split. +- Batch 4 (3 valid runs of 4; 1 SKIP forced mysql-2; with pre-C2 triple view + snapshot + 1 writer/s + alternating kill target): all 20s, no view-split. +- Batch 5 (5 valid runs of 6; 1 SKIP; same conditions as batch 4, higher N): + all 20s, no view-split. +- Batch 6 (5 valid runs of 6; tightened C1→C2 timing to skip + `wait_cluster_running`, +5 parallel writers, +rotated kill target including + forced mysql-2): all 20s, no view-split. + +**Cumulative: 22 valid chaos C2 attempts under varied conditions, all +recovered within 30s, 0 view-split.** This was strong negative evidence and +worth recording, but it did NOT prove the race was absent — it only bounded +the per-cycle probability. + +## Findings: soak run (Henry) + +Script: `work/chaos-c2-soak.sh`. Duration target 14400s (4h), +`PARALLEL_WRITERS=5`, `CYCLE_SECONDS=300` (5 min). Alternating +primary/secondary kill. + +Timeline: + +| Phase | Cycle range | Wall time UTC | Status | +|---|---|---|---| +| Pre-chaos | T15:35 - T15:36 | cluster Running, table created, writers started | clean | +| Healthy chaos | cycle 1 - 37 | T15:36 - T18:43 | 37 cycles all `recovered phase=Running` in 20-31s; `views_consistent count=3` every cycle | +| First degenerate | cycle 36 (kill mysql-0 secondary) | T18:38:50 UTC | mysql-0 restart, `Current member is not in cluster, add it to cluster` → `init mgr plugin` (twice) → `start group replication` → 5s later `read tcp 127.0.0.1:59478->127.0.0.1:3306: i/o timeout` → `Join member to cluster failed: invalid connection`. mysql-0 stuck out of group. | +| Second degenerate | cycle 37 (kill mysql-2 primary) | T18:43:42 UTC | cluster phase → Updating, mysql-1 promoted to primary. mysql-2 restart at T18:44:06, same pattern: `Join member to cluster failed: invalid connection` at T18:44:11. mysql-2 stuck out of group. | +| Degenerate steady | cycle 38 - 61 | T18:48 - T19:34 | 24 cycles return `no primary/secondary found` (only mysql-1 has role label), kill skipped. KB controller logs 135 `probe event failed exit code: 1` events + clears role labels on mysql-0 and mysql-2. | +| Soak natural end | T19:35 UTC | trap EXIT fires, cluster delete issued | cleanup | + +Match to William's case: + +| Signal | William 04:49Z | Henry soak (18:43+) | +|---|---|---| +| pod-side engine log loop `init mgr plugin` + `SET GLOBAL group_replication_group_seeds` | mysql-0 at T04:57:51+, 6 min after kill | mysql-0 at T18:38:55, mysql-2 at T18:44:06 | +| pod-side error `Join member to cluster failed: invalid connection` / SQL i/o timeout | yes (mysql-0) | yes (mysql-0 and mysql-2) | +| KB controller `probe event failed exit code: 1` | 75 (mysql-0) + 45 (mysql-1) + 60 (mysql-2) over 80s | 135 over ~10 min window | +| KB controller `handle role change ... role="" originalRole=""` | mysql-0 at T04:51:48, mysql-2 at T04:51:47 | mysql-2 at T18:44:06, mysql-0 earlier | +| cluster.phase=Updating beyond normal recovery window | yes (>900s) | yes (cycles 38+ never recovered to "kill kill" baseline) | + +Variant difference: William's pod independent views diverged (primary saw 3 +ONLINE, two secondaries each saw self-OFFLINE alone). Henry's variant had 2 +pods drop role entirely while mysql-1 stayed primary. Both are manifestations +of the same underlying race in +`mgr-server/internal/syncer/JoinCurrentMemberToCluster`: `START +GROUP_REPLICATION` fires once, hits a transient SQL i/o timeout, returns +without retry, and the engine state machine can no longer be cleanly recovered +by HA daemon periodic ticks. + +## Trigger conditions + +The race recurs reliably when ALL of: + +1. Cluster runs continuously ≥3h (gives sufficient rejoin attempts). +2. ≥5 parallel writers continuously inserting (~25 INSERT/s). +3. Repeated chaos kill every ~5 min, alternating primary/secondary. +4. mysql-0 (or any single pod) is killed multiple times — increases the chance + it hits the race on a rejoin. + +Below this threshold (22 attempts in our chaos C2 lane before the soak), the +race is rare enough that no single batch hit it. The shorter cycle (≤30s +recovery, ≤4 runs per batch) prevents enough rejoin opportunities. + +## Patch hypothesis (William, msg 61756a30) + +Root cause: `mgr-server/internal/syncer/JoinCurrentMemberToCluster` calls +`START GROUP_REPLICATION`. If the underlying SQL connection times out once, +the function returns the error and exits. The next attempt is only made by the +HA daemon periodic loop, but by then the engine's local state may already be +in an indeterminate sub-state (group seeds set, member partially registered). + +Minimum patch: +- bounded retry loop around `START GROUP_REPLICATION` (e.g. 3 attempts with + exponential backoff) +- reconnect the local MySQL socket between retries (so socket-level i/o timeout + is not the propagating failure) +- preserve original error return path so permanent failures still surface + +## Validation plan (Henry, msg f5289913) + +After the patch lands and the engine image is built (local-build + sideload to +vc-2), re-run the same 4h soak with identical parameters (5 writers, 5 min +cycle, alternating kill). Acceptance condition: +- At least 1 full 4h soak with `view-split=0` AND `role-disappearance=0`. +- Repeat ≥2 more soaks at the same conditions (N=3 floor for probabilistic + release validation). +- Optionally extend to longer soak (24h) per release standard third condition. From a2c06182c93ff7a4a263c3602ee18d8d62ffc3dd Mon Sep 17 00:00:00 2001 From: Wei Cao Date: Sat, 16 May 2026 04:02:09 +0800 Subject: [PATCH 2/2] docs(chaos-rejoin-race): clarify cluster phase observation variants Per William review on PR #148: original Detection Indicator #7 only described the Running-mask variant. Both William's chaos.sh case and Henry's soak case actually had cluster phase stuck at Updating for the whole window, not flipping back to Running. Reworded #7 to cover both variants (Running-mask and Updating-stuck) and emphasize the rule: always cross-check pod local view + role label, never rely on cluster phase alone. Co-Authored-By: Claude Opus 4.7 --- docs/addon-chaos-rejoin-race-detection-guide.md | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/docs/addon-chaos-rejoin-race-detection-guide.md b/docs/addon-chaos-rejoin-race-detection-guide.md index 75f5ae3..90281d8 100644 --- a/docs/addon-chaos-rejoin-race-detection-guide.md +++ b/docs/addon-chaos-rejoin-race-detection-guide.md @@ -125,8 +125,19 @@ When the race triggers, you will see (in order): — the role label on that pod is cleared. 6. `kubectl get pod` shows the pod stuck in Error or 0/4 Ready forever, until manual restart. -7. The cluster surface phase eventually returns to Running once the other N-1 - replicas form a quorum, masking the failure if you only watch `.status.phase`. +7. Cluster surface `.status.phase` lies about overall health, in one of two + variants: + - **Running-mask**: the other N-1 replicas form a quorum and KB controller + flips phase back to Running; the OFFLINE pod is left out but cluster + phase looks healthy. If you only watch `.status.phase` you will miss it. + - **Updating-stuck**: role probe failures keep KB controller from + re-flipping phase to Running; phase stays Updating for the entire + window. If your test only sets a 5-min OpsRequest timeout you may + classify this as a different failure ("OpsRequest hung") and miss the + underlying engine race. + - Both variants occurred in the same engine pattern on 2026-05-15 (see + case appendix). Bottom line: **always cross-check pod local view + + role label**, never rely on cluster phase alone. ## Layer classification