apecloud · weicao · May 15, 2026 · May 15, 2026
diff --git a/docs/addon-chaos-rejoin-race-detection-guide.md b/docs/addon-chaos-rejoin-race-detection-guide.md
@@ -0,0 +1,206 @@
+# Chaos Auto-Rejoin Race Detection (Soak Design)
+
+When an addon's HA engine rejoins a cluster after a pod restart, the rejoin path
+typically calls one or more engine-level commands (e.g. `START GROUP_REPLICATION`,
+`pg_basebackup` + `replica start`, `RECOVER PRIMARY`, etc). If that path fires
+once-and-returns on transient failure (SQL i/o timeout, socket reset, partial
+write), and only relies on an external HA periodic retry to recover, then under
+sustained load + repeated chaos the failure mode is **the rejoining pod never
+catches up and falls out of the group permanently**. Within minutes the cluster
+degenerates from N replicas to 1 healthy replica + (N-1) Error pods, even though
+each individual chaos event "recovered" by surface phase metrics.
+
+This guide is engine-neutral. It captures the soak design that detects this
+class of race, the detection indicators, and the layer classification.
+
+## When To Invoke
+
+Use this guide when:
+
+- An HA-capable addon (MGR / WSREP / Galera / OB / Paxos / sentinel + replica)
+  has passed N-of-N short chaos rounds with surface phase = Running but you do
+  not yet trust long-duration behavior.
+- A test or production report mentions "pod restarted, rejoined", and you want
+  to know whether the rejoin path is hard-tested.
+- A rare incident report says "two replicas lost role label after a pod kill"
+  or "secondaries became OFFLINE while primary still sees them healthy".
+- You are deciding what soak design satisfies the long-duration release standard
+  (third condition of the release matrix).
+
+## Failure mode (engine-neutral)
+
+```
+T0  : 3 healthy replicas, all writing OK, role labels set
+T1  : kill pod-X (the rejoining target)
+T1+ : pod-X restarts, engine startup script reaches the "join existing group"
+      step (engine-specific name varies)
+T1+s: the join SQL/RPC call hits a transient (i/o timeout, connection refused,
+      group view temporarily stale)
+T1+s+ε: join call returns the original error and exits the startup script step
+T1+...: HA daemon's periodic loop tries to bring this pod back, but because the
+      original join was already attempted and the engine's local state machine
+      is now in an indeterminate sub-state (e.g. group_replication_local_address
+      already set, member list believes it was added), retries no longer
+      converge to a healthy ONLINE
+RESULT: cluster.phase goes Updating → Running on the surface (other N-1
+      replicas form a quorum), but the rejoining pod is permanently Failed/Error
+      with no role label. Repeat chaos a few more times → another pod hits the
+      same race → cluster degenerates further
+```
+
+The race is invisible if:
+- the chaos test only measures `cluster.phase == Running` after each cycle
+- the chaos test does not query MGR/WSREP/SHOW SLAVE STATUS view from each pod
+  independently
+- the soak is short enough that only one or two cycles run before stop
+
+## Soak design that catches this race
+
+Five concrete elements. Each is needed; missing any one masks the failure mode.
+
+### 1. Long duration
+
+At least **3 hours**, ideally 4+. Race exposure scales with number of rejoin
+attempts × probability per attempt. With ~5 min between chaos cycles, 3 hours
+gives ~36 rejoins, enough to expose 1-in-30 probabilistic races.
+
+### 2. Continuous concurrent write load throughout the entire window
+
+Multiple writer threads (≥3, ≥1 INSERT/0.5s each), inserting into a replicated
+table. The write load:
+- pressures the rejoining engine's commit/apply path while it is starting up
+- increases the probability of the rejoin SQL/RPC hitting a transient failure
+- guarantees that "data was being written when the rejoining pod was killed",
+  matching real-world load patterns
+
+A single 1-row-per-second writer is too gentle.
+
+### 3. Repeated kill alternating victim role
+
+Cycle over chaos events: cycle 1 kill primary, cycle 2 kill secondary,
+cycle 3 kill primary, .... Do this for the entire soak window.
+
+Always-kill-secondary or always-kill-primary masks asymmetries in the rejoin
+path (e.g. the primary's rejoin path may be different from a secondary's).
+
+### 4. Per-cycle multi-replica view consistency check
+
+After each chaos cycle recovers (cluster.phase=Running), query the HA group
+membership view **from every replica independently**, not just from the
+primary. Compare the views. They must all agree on `ONLINE_count`,
+`PRIMARY_count`, and `member set`.
+
+Example for MGR:
+
+```sql
+-- run on each pod independently, compare results
+SELECT COUNT(*), SUM(MEMBER_STATE='ONLINE'), SUM(MEMBER_ROLE='PRIMARY')
+FROM performance_schema.replication_group_members;
+```
+
+If the views disagree, mark the cycle as `view-split` and snapshot evidence
+immediately. Do NOT continue chaos cycles after a view split — you'll lose
+the evidence to the next kill.
+
+### 5. Early-stop on degeneration
+
+If `find_role_label primary` or `find_role_label secondary` returns empty in
+the next cycle, do not skip and keep cycling. **Snapshot evidence and stop.**
+Empty role labels mean the engine's rejoin path failed to restore membership
+on at least one replica.
+
+## Detection indicators
+
+When the race triggers, you will see (in order):
+
+1. The killed pod restarts.
+2. Pod-side engine log shows the join call, then a transient error (SQL i/o
+   timeout / connection reset / etc), then no further successful join.
+3. Pod-side engine log shows the HA daemon's periodic retry loop running but
+   never reaching ONLINE.
+4. KB controller log shows `probe event failed exit code: 1` repeatedly for
+   that pod (because role probe queries the engine which now reports OFFLINE
+   or no member info).
+5. KB controller log shows `handle role change event ... role="" originalRole=""`
+   — the role label on that pod is cleared.
+6. `kubectl get pod` shows the pod stuck in Error or 0/4 Ready forever, until
+   manual restart.
+7. Cluster surface `.status.phase` lies about overall health, in one of two
+   variants:
+   - **Running-mask**: the other N-1 replicas form a quorum and KB controller
+     flips phase back to Running; the OFFLINE pod is left out but cluster
+     phase looks healthy. If you only watch `.status.phase` you will miss it.
+   - **Updating-stuck**: role probe failures keep KB controller from
+     re-flipping phase to Running; phase stays Updating for the entire
+     window. If your test only sets a 5-min OpsRequest timeout you may
+     classify this as a different failure ("OpsRequest hung") and miss the
+     underlying engine race.
+   - Both variants occurred in the same engine pattern on 2026-05-15 (see
+     case appendix). Bottom line: **always cross-check pod local view +
+     role label**, never rely on cluster phase alone.
+
+## Layer classification
+
+| Layer | When to assign |
+|---|---|
+| environment | only if vcluster / host kubelet / network is observably broken (API timeouts, node not ready, syncer halted). If the API responds and other pods recover, NOT environment. |
+| runner / harness | only if a clean control cycle (without writers / without repeated kill) also degenerates. If the soak ran clean for N cycles before failure, NOT harness. |
+| KubeBlocks controller / DataProtection / syncer | only if controller-side reconcile is the first failure (e.g. controller never updates cluster.phase, role label cleared by KB controller without engine event). If KB controller is faithfully reflecting the engine's OFFLINE / probe-failed state, NOT KB controller. |
+| **addon / engine product** | when pod-side engine log shows the join call returning an error that the engine's startup script does not retry, and the engine's HA periodic loop does not recover. **This is the default assignment for the auto-rejoin race.** |
+
+## Route to addon / engine team — escalation packet
+
+```text
+Auto-rejoin race finding:
+- engine + version:
+- cluster + namespace:
+- timeline:
+  - T0 cluster healthy
+  - T-kill-1 first chaos event
+  - T-degen first cycle where role disappeared
+- evidence files:
+  - pod-X engine log around T-degen (look for the failing join call)
+  - KB controller log probe event failed window
+  - kill log with cycle index
+  - 3-pod independent group view snapshots
+- candidate mechanism / hypothesis (from engine log lines):
+- minimum patch sketch:
+  - bounded retry of the join call
+  - reconnect engine local socket on each retry
+  - preserve the original error return path for permanent failures
+- soak validation plan after patch:
+  - same soak design (≥3h, ≥3 writers, alternating chaos, view check, early-stop)
+  - patch must run at least 1 full clean soak
+```
+
+## Patch-version validation
+
+After the engine team produces a patch:
+
+1. Build engine image locally or pull patched image (see
+   `local-build-sideload-test-image` skill).
+2. Sideload to the test vcluster.
+3. Verify live `imageID` / build info matches the patched commit (do not trust
+   tag-only deployment).
+4. Run the same soak design (same writers, same cycle interval, same duration).
+5. **At least one full soak must complete with `view-split=0` AND `role
+   disappearance=0`.** If any cycle still degenerates, the patch does not fix
+   the race; gather new evidence and return to engine team.
+6. Repeat the soak at least 2 more times (N=3 floor for probabilistic events)
+   to bound the residual rate.
+
+## Related skills / docs
+
+- `skills/soak-test-classification/SKILL.md` — classify each cycle into
+  invariant-break / product-path-failure / harness-race / external-environmental-cascade
+- `skills/first-blocker-classification/SKILL.md` — 5-layer classification before
+  routing
+- `skills/continuous-test-until-release-ready/SKILL.md` — when soak is green for
+  a while, decide whether to stop, extend, or claim release-ready
+- `skills/local-build-sideload-test-image/SKILL.md` — build + sideload patch
+  images for validation
+- `docs/addon-vcluster-bounded-convergence-window-guide.md` — vcluster timing
+  multipliers when running soak inside vcluster
+- `docs/cases/mysql/mysql-mgr-auto-rejoin-race-case.md` — MySQL MGR case
+  (William 2026-05-15T04:49Z + Henry 2026-05-15T18:43Z soak; mgr-server
+  `JoinCurrentMemberToCluster` → `START GROUP_REPLICATION` once-and-return)
diff --git a/docs/cases/mysql/mysql-mgr-auto-rejoin-race-case.md b/docs/cases/mysql/mysql-mgr-auto-rejoin-race-case.md
@@ -0,0 +1,115 @@
+# MySQL MGR Auto-Rejoin Race Case
+
+Engine-specific case material for the engine-neutral guide
+`docs/addon-chaos-rejoin-race-detection-guide.md`.
+
+## Context
+
+- Engine: MySQL 8.0.33 (KubeBlocks MySQL addon 1.0.2).
+- Topology: MGR 3-replica.
+- Test environment: vcluster on idc2 (`mason-idc2-mysql-vc-2`).
+- Observed in two independent runs on the same day:
+  1. **William 2026-05-15T04:49 UTC** — original `tests/chaos.sh` C1 + C2 sequence
+     run, cluster `mysql-chaos-53410`. Evidence:
+     `/Users/wei/.slock/agents/5b8c8201-7b26-4882-a6bd-ed7f9c0e711b/evidence/mysql-tests-chaos-20260515T044954Z.tar.gz`
+     sha `19b429c6451c1966fd5877ef0ea2aa944668aff70c05363aaa7ec8c1a48a5859`.
+  2. **Henry 2026-05-15T15:35 UTC start, T18:43 UTC degenerate** — 4h soak,
+     cluster `mysql-chaos-soak-2712`. Evidence:
+     `/Users/wei/.slock/agents/beeaf92e-9737-4a15-a301-72d3aaca3991/work/chaos-c2-soak/mysql-chaos-c2-soak-product-event-20260516T034000Z.tar.gz`
+     sha `736a42cbb954ee54ff89a1f74154ba5da9f347b300c5e6898ae66796a68a6808`.
+
+## Pre-finding negative N
+
+Before the 4h soak, the chaos C2 lane ran:
+
+- Batch 1 (1 run, fresh cluster + kill secondary): 20s recovered, no view-split.
+- Batch 2 (4 runs, fresh cluster + kill secondary): all 20s, no view-split.
+- Batch 3 (4 runs, C1 → wait Running → C2 sequence matching William's
+  `tests/chaos.sh`): all 20s, no view-split.
+- Batch 4 (3 valid runs of 4; 1 SKIP forced mysql-2; with pre-C2 triple view
+  snapshot + 1 writer/s + alternating kill target): all 20s, no view-split.
+- Batch 5 (5 valid runs of 6; 1 SKIP; same conditions as batch 4, higher N):
+  all 20s, no view-split.
+- Batch 6 (5 valid runs of 6; tightened C1→C2 timing to skip
+  `wait_cluster_running`, +5 parallel writers, +rotated kill target including
+  forced mysql-2): all 20s, no view-split.
+
+**Cumulative: 22 valid chaos C2 attempts under varied conditions, all
+recovered within 30s, 0 view-split.** This was strong negative evidence and
+worth recording, but it did NOT prove the race was absent — it only bounded
+the per-cycle probability.
+
+## Findings: soak run (Henry)
+
+Script: `work/chaos-c2-soak.sh`. Duration target 14400s (4h),
+`PARALLEL_WRITERS=5`, `CYCLE_SECONDS=300` (5 min). Alternating
+primary/secondary kill.
+
+Timeline:
+
+| Phase | Cycle range | Wall time UTC | Status |
+|---|---|---|---|
+| Pre-chaos | T15:35 - T15:36 | cluster Running, table created, writers started | clean |
+| Healthy chaos | cycle 1 - 37 | T15:36 - T18:43 | 37 cycles all `recovered phase=Running` in 20-31s; `views_consistent count=3` every cycle |
+| First degenerate | cycle 36 (kill mysql-0 secondary) | T18:38:50 UTC | mysql-0 restart, `Current member is not in cluster, add it to cluster` → `init mgr plugin` (twice) → `start group replication` → 5s later `read tcp 127.0.0.1:59478->127.0.0.1:3306: i/o timeout` → `Join member to cluster failed: invalid connection`. mysql-0 stuck out of group. |
+| Second degenerate | cycle 37 (kill mysql-2 primary) | T18:43:42 UTC | cluster phase → Updating, mysql-1 promoted to primary. mysql-2 restart at T18:44:06, same pattern: `Join member to cluster failed: invalid connection` at T18:44:11. mysql-2 stuck out of group. |
+| Degenerate steady | cycle 38 - 61 | T18:48 - T19:34 | 24 cycles return `no primary/secondary found` (only mysql-1 has role label), kill skipped. KB controller logs 135 `probe event failed exit code: 1` events + clears role labels on mysql-0 and mysql-2. |
+| Soak natural end | T19:35 UTC | trap EXIT fires, cluster delete issued | cleanup |
+
+Match to William's case:
+
+| Signal | William 04:49Z | Henry soak (18:43+) |
+|---|---|---|
+| pod-side engine log loop `init mgr plugin` + `SET GLOBAL group_replication_group_seeds` | mysql-0 at T04:57:51+, 6 min after kill | mysql-0 at T18:38:55, mysql-2 at T18:44:06 |
+| pod-side error `Join member to cluster failed: invalid connection` / SQL i/o timeout | yes (mysql-0) | yes (mysql-0 and mysql-2) |
+| KB controller `probe event failed exit code: 1` | 75 (mysql-0) + 45 (mysql-1) + 60 (mysql-2) over 80s | 135 over ~10 min window |
+| KB controller `handle role change ... role="" originalRole=""` | mysql-0 at T04:51:48, mysql-2 at T04:51:47 | mysql-2 at T18:44:06, mysql-0 earlier |
+| cluster.phase=Updating beyond normal recovery window | yes (>900s) | yes (cycles 38+ never recovered to "kill kill" baseline) |
+
+Variant difference: William's pod independent views diverged (primary saw 3
+ONLINE, two secondaries each saw self-OFFLINE alone). Henry's variant had 2
+pods drop role entirely while mysql-1 stayed primary. Both are manifestations
+of the same underlying race in
+`mgr-server/internal/syncer/JoinCurrentMemberToCluster`: `START
+GROUP_REPLICATION` fires once, hits a transient SQL i/o timeout, returns
+without retry, and the engine state machine can no longer be cleanly recovered
+by HA daemon periodic ticks.
+
+## Trigger conditions
+
+The race recurs reliably when ALL of:
+
+1. Cluster runs continuously ≥3h (gives sufficient rejoin attempts).
+2. ≥5 parallel writers continuously inserting (~25 INSERT/s).
+3. Repeated chaos kill every ~5 min, alternating primary/secondary.
+4. mysql-0 (or any single pod) is killed multiple times — increases the chance
+   it hits the race on a rejoin.
+
+Below this threshold (22 attempts in our chaos C2 lane before the soak), the
+race is rare enough that no single batch hit it. The shorter cycle (≤30s
+recovery, ≤4 runs per batch) prevents enough rejoin opportunities.
+
+## Patch hypothesis (William, msg 61756a30)
+
+Root cause: `mgr-server/internal/syncer/JoinCurrentMemberToCluster` calls
+`START GROUP_REPLICATION`. If the underlying SQL connection times out once,
+the function returns the error and exits. The next attempt is only made by the
+HA daemon periodic loop, but by then the engine's local state may already be
+in an indeterminate sub-state (group seeds set, member partially registered).
+
+Minimum patch:
+- bounded retry loop around `START GROUP_REPLICATION` (e.g. 3 attempts with
+  exponential backoff)
+- reconnect the local MySQL socket between retries (so socket-level i/o timeout
+  is not the propagating failure)
+- preserve original error return path so permanent failures still surface
+
+## Validation plan (Henry, msg f5289913)
+
+After the patch lands and the engine image is built (local-build + sideload to
+vc-2), re-run the same 4h soak with identical parameters (5 writers, 5 min
+cycle, alternating kill). Acceptance condition:
+- At least 1 full 4h soak with `view-split=0` AND `role-disappearance=0`.
+- Repeat ≥2 more soaks at the same conditions (N=3 floor for probabilistic
+  release validation).
+- Optionally extend to longer soak (24h) per release standard third condition.