Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
206 changes: 206 additions & 0 deletions docs/addon-chaos-rejoin-race-detection-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,206 @@
# Chaos Auto-Rejoin Race Detection (Soak Design)

When an addon's HA engine rejoins a cluster after a pod restart, the rejoin path
typically calls one or more engine-level commands (e.g. `START GROUP_REPLICATION`,
`pg_basebackup` + `replica start`, `RECOVER PRIMARY`, etc). If that path fires
once-and-returns on transient failure (SQL i/o timeout, socket reset, partial
write), and only relies on an external HA periodic retry to recover, then under
sustained load + repeated chaos the failure mode is **the rejoining pod never
catches up and falls out of the group permanently**. Within minutes the cluster
degenerates from N replicas to 1 healthy replica + (N-1) Error pods, even though
each individual chaos event "recovered" by surface phase metrics.

This guide is engine-neutral. It captures the soak design that detects this
class of race, the detection indicators, and the layer classification.

## When To Invoke

Use this guide when:

- An HA-capable addon (MGR / WSREP / Galera / OB / Paxos / sentinel + replica)
has passed N-of-N short chaos rounds with surface phase = Running but you do
not yet trust long-duration behavior.
- A test or production report mentions "pod restarted, rejoined", and you want
to know whether the rejoin path is hard-tested.
- A rare incident report says "two replicas lost role label after a pod kill"
or "secondaries became OFFLINE while primary still sees them healthy".
- You are deciding what soak design satisfies the long-duration release standard
(third condition of the release matrix).

## Failure mode (engine-neutral)

```
T0 : 3 healthy replicas, all writing OK, role labels set
T1 : kill pod-X (the rejoining target)
T1+ : pod-X restarts, engine startup script reaches the "join existing group"
step (engine-specific name varies)
T1+s: the join SQL/RPC call hits a transient (i/o timeout, connection refused,
group view temporarily stale)
T1+s+ε: join call returns the original error and exits the startup script step
T1+...: HA daemon's periodic loop tries to bring this pod back, but because the
original join was already attempted and the engine's local state machine
is now in an indeterminate sub-state (e.g. group_replication_local_address
already set, member list believes it was added), retries no longer
converge to a healthy ONLINE
RESULT: cluster.phase goes Updating → Running on the surface (other N-1
replicas form a quorum), but the rejoining pod is permanently Failed/Error
with no role label. Repeat chaos a few more times → another pod hits the
same race → cluster degenerates further
```

The race is invisible if:
- the chaos test only measures `cluster.phase == Running` after each cycle
- the chaos test does not query MGR/WSREP/SHOW SLAVE STATUS view from each pod
independently
- the soak is short enough that only one or two cycles run before stop

## Soak design that catches this race

Five concrete elements. Each is needed; missing any one masks the failure mode.

### 1. Long duration

At least **3 hours**, ideally 4+. Race exposure scales with number of rejoin
attempts × probability per attempt. With ~5 min between chaos cycles, 3 hours
gives ~36 rejoins, enough to expose 1-in-30 probabilistic races.

### 2. Continuous concurrent write load throughout the entire window

Multiple writer threads (≥3, ≥1 INSERT/0.5s each), inserting into a replicated
table. The write load:
- pressures the rejoining engine's commit/apply path while it is starting up
- increases the probability of the rejoin SQL/RPC hitting a transient failure
- guarantees that "data was being written when the rejoining pod was killed",
matching real-world load patterns

A single 1-row-per-second writer is too gentle.

### 3. Repeated kill alternating victim role

Cycle over chaos events: cycle 1 kill primary, cycle 2 kill secondary,
cycle 3 kill primary, .... Do this for the entire soak window.

Always-kill-secondary or always-kill-primary masks asymmetries in the rejoin
path (e.g. the primary's rejoin path may be different from a secondary's).

### 4. Per-cycle multi-replica view consistency check

After each chaos cycle recovers (cluster.phase=Running), query the HA group
membership view **from every replica independently**, not just from the
primary. Compare the views. They must all agree on `ONLINE_count`,
`PRIMARY_count`, and `member set`.

Example for MGR:

```sql
-- run on each pod independently, compare results
SELECT COUNT(*), SUM(MEMBER_STATE='ONLINE'), SUM(MEMBER_ROLE='PRIMARY')
FROM performance_schema.replication_group_members;
```

If the views disagree, mark the cycle as `view-split` and snapshot evidence
immediately. Do NOT continue chaos cycles after a view split — you'll lose
the evidence to the next kill.

### 5. Early-stop on degeneration

If `find_role_label primary` or `find_role_label secondary` returns empty in
the next cycle, do not skip and keep cycling. **Snapshot evidence and stop.**
Empty role labels mean the engine's rejoin path failed to restore membership
on at least one replica.

## Detection indicators

When the race triggers, you will see (in order):

1. The killed pod restarts.
2. Pod-side engine log shows the join call, then a transient error (SQL i/o
timeout / connection reset / etc), then no further successful join.
3. Pod-side engine log shows the HA daemon's periodic retry loop running but
never reaching ONLINE.
4. KB controller log shows `probe event failed exit code: 1` repeatedly for
that pod (because role probe queries the engine which now reports OFFLINE
or no member info).
5. KB controller log shows `handle role change event ... role="" originalRole=""`
— the role label on that pod is cleared.
6. `kubectl get pod` shows the pod stuck in Error or 0/4 Ready forever, until
manual restart.
7. Cluster surface `.status.phase` lies about overall health, in one of two
variants:
- **Running-mask**: the other N-1 replicas form a quorum and KB controller
flips phase back to Running; the OFFLINE pod is left out but cluster
phase looks healthy. If you only watch `.status.phase` you will miss it.
- **Updating-stuck**: role probe failures keep KB controller from
re-flipping phase to Running; phase stays Updating for the entire
window. If your test only sets a 5-min OpsRequest timeout you may
classify this as a different failure ("OpsRequest hung") and miss the
underlying engine race.
- Both variants occurred in the same engine pattern on 2026-05-15 (see
case appendix). Bottom line: **always cross-check pod local view +
role label**, never rely on cluster phase alone.

## Layer classification

| Layer | When to assign |
|---|---|
| environment | only if vcluster / host kubelet / network is observably broken (API timeouts, node not ready, syncer halted). If the API responds and other pods recover, NOT environment. |
| runner / harness | only if a clean control cycle (without writers / without repeated kill) also degenerates. If the soak ran clean for N cycles before failure, NOT harness. |
| KubeBlocks controller / DataProtection / syncer | only if controller-side reconcile is the first failure (e.g. controller never updates cluster.phase, role label cleared by KB controller without engine event). If KB controller is faithfully reflecting the engine's OFFLINE / probe-failed state, NOT KB controller. |
| **addon / engine product** | when pod-side engine log shows the join call returning an error that the engine's startup script does not retry, and the engine's HA periodic loop does not recover. **This is the default assignment for the auto-rejoin race.** |

## Route to addon / engine team — escalation packet

```text
Auto-rejoin race finding:
- engine + version:
- cluster + namespace:
- timeline:
- T0 cluster healthy
- T-kill-1 first chaos event
- T-degen first cycle where role disappeared
- evidence files:
- pod-X engine log around T-degen (look for the failing join call)
- KB controller log probe event failed window
- kill log with cycle index
- 3-pod independent group view snapshots
- candidate mechanism / hypothesis (from engine log lines):
- minimum patch sketch:
- bounded retry of the join call
- reconnect engine local socket on each retry
- preserve the original error return path for permanent failures
- soak validation plan after patch:
- same soak design (≥3h, ≥3 writers, alternating chaos, view check, early-stop)
- patch must run at least 1 full clean soak
```

## Patch-version validation

After the engine team produces a patch:

1. Build engine image locally or pull patched image (see
`local-build-sideload-test-image` skill).
2. Sideload to the test vcluster.
3. Verify live `imageID` / build info matches the patched commit (do not trust
tag-only deployment).
4. Run the same soak design (same writers, same cycle interval, same duration).
5. **At least one full soak must complete with `view-split=0` AND `role
disappearance=0`.** If any cycle still degenerates, the patch does not fix
the race; gather new evidence and return to engine team.
6. Repeat the soak at least 2 more times (N=3 floor for probabilistic events)
to bound the residual rate.

## Related skills / docs

- `skills/soak-test-classification/SKILL.md` — classify each cycle into
invariant-break / product-path-failure / harness-race / external-environmental-cascade
- `skills/first-blocker-classification/SKILL.md` — 5-layer classification before
routing
- `skills/continuous-test-until-release-ready/SKILL.md` — when soak is green for
a while, decide whether to stop, extend, or claim release-ready
- `skills/local-build-sideload-test-image/SKILL.md` — build + sideload patch
images for validation
- `docs/addon-vcluster-bounded-convergence-window-guide.md` — vcluster timing
multipliers when running soak inside vcluster
- `docs/cases/mysql/mysql-mgr-auto-rejoin-race-case.md` — MySQL MGR case
(William 2026-05-15T04:49Z + Henry 2026-05-15T18:43Z soak; mgr-server
`JoinCurrentMemberToCluster` → `START GROUP_REPLICATION` once-and-return)
115 changes: 115 additions & 0 deletions docs/cases/mysql/mysql-mgr-auto-rejoin-race-case.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
# MySQL MGR Auto-Rejoin Race Case

Engine-specific case material for the engine-neutral guide
`docs/addon-chaos-rejoin-race-detection-guide.md`.

## Context

- Engine: MySQL 8.0.33 (KubeBlocks MySQL addon 1.0.2).
- Topology: MGR 3-replica.
- Test environment: vcluster on idc2 (`mason-idc2-mysql-vc-2`).
- Observed in two independent runs on the same day:
1. **William 2026-05-15T04:49 UTC** — original `tests/chaos.sh` C1 + C2 sequence
run, cluster `mysql-chaos-53410`. Evidence:
`/Users/wei/.slock/agents/5b8c8201-7b26-4882-a6bd-ed7f9c0e711b/evidence/mysql-tests-chaos-20260515T044954Z.tar.gz`
sha `19b429c6451c1966fd5877ef0ea2aa944668aff70c05363aaa7ec8c1a48a5859`.
2. **Henry 2026-05-15T15:35 UTC start, T18:43 UTC degenerate** — 4h soak,
cluster `mysql-chaos-soak-2712`. Evidence:
`/Users/wei/.slock/agents/beeaf92e-9737-4a15-a301-72d3aaca3991/work/chaos-c2-soak/mysql-chaos-c2-soak-product-event-20260516T034000Z.tar.gz`
sha `736a42cbb954ee54ff89a1f74154ba5da9f347b300c5e6898ae66796a68a6808`.

## Pre-finding negative N

Before the 4h soak, the chaos C2 lane ran:

- Batch 1 (1 run, fresh cluster + kill secondary): 20s recovered, no view-split.
- Batch 2 (4 runs, fresh cluster + kill secondary): all 20s, no view-split.
- Batch 3 (4 runs, C1 → wait Running → C2 sequence matching William's
`tests/chaos.sh`): all 20s, no view-split.
- Batch 4 (3 valid runs of 4; 1 SKIP forced mysql-2; with pre-C2 triple view
snapshot + 1 writer/s + alternating kill target): all 20s, no view-split.
- Batch 5 (5 valid runs of 6; 1 SKIP; same conditions as batch 4, higher N):
all 20s, no view-split.
- Batch 6 (5 valid runs of 6; tightened C1→C2 timing to skip
`wait_cluster_running`, +5 parallel writers, +rotated kill target including
forced mysql-2): all 20s, no view-split.

**Cumulative: 22 valid chaos C2 attempts under varied conditions, all
recovered within 30s, 0 view-split.** This was strong negative evidence and
worth recording, but it did NOT prove the race was absent — it only bounded
the per-cycle probability.

## Findings: soak run (Henry)

Script: `work/chaos-c2-soak.sh`. Duration target 14400s (4h),
`PARALLEL_WRITERS=5`, `CYCLE_SECONDS=300` (5 min). Alternating
primary/secondary kill.

Timeline:

| Phase | Cycle range | Wall time UTC | Status |
|---|---|---|---|
| Pre-chaos | T15:35 - T15:36 | cluster Running, table created, writers started | clean |
| Healthy chaos | cycle 1 - 37 | T15:36 - T18:43 | 37 cycles all `recovered phase=Running` in 20-31s; `views_consistent count=3` every cycle |
| First degenerate | cycle 36 (kill mysql-0 secondary) | T18:38:50 UTC | mysql-0 restart, `Current member is not in cluster, add it to cluster` → `init mgr plugin` (twice) → `start group replication` → 5s later `read tcp 127.0.0.1:59478->127.0.0.1:3306: i/o timeout` → `Join member to cluster failed: invalid connection`. mysql-0 stuck out of group. |
| Second degenerate | cycle 37 (kill mysql-2 primary) | T18:43:42 UTC | cluster phase → Updating, mysql-1 promoted to primary. mysql-2 restart at T18:44:06, same pattern: `Join member to cluster failed: invalid connection` at T18:44:11. mysql-2 stuck out of group. |
| Degenerate steady | cycle 38 - 61 | T18:48 - T19:34 | 24 cycles return `no primary/secondary found` (only mysql-1 has role label), kill skipped. KB controller logs 135 `probe event failed exit code: 1` events + clears role labels on mysql-0 and mysql-2. |
| Soak natural end | T19:35 UTC | trap EXIT fires, cluster delete issued | cleanup |

Match to William's case:

| Signal | William 04:49Z | Henry soak (18:43+) |
|---|---|---|
| pod-side engine log loop `init mgr plugin` + `SET GLOBAL group_replication_group_seeds` | mysql-0 at T04:57:51+, 6 min after kill | mysql-0 at T18:38:55, mysql-2 at T18:44:06 |
| pod-side error `Join member to cluster failed: invalid connection` / SQL i/o timeout | yes (mysql-0) | yes (mysql-0 and mysql-2) |
| KB controller `probe event failed exit code: 1` | 75 (mysql-0) + 45 (mysql-1) + 60 (mysql-2) over 80s | 135 over ~10 min window |
| KB controller `handle role change ... role="" originalRole=""` | mysql-0 at T04:51:48, mysql-2 at T04:51:47 | mysql-2 at T18:44:06, mysql-0 earlier |
| cluster.phase=Updating beyond normal recovery window | yes (>900s) | yes (cycles 38+ never recovered to "kill kill" baseline) |

Variant difference: William's pod independent views diverged (primary saw 3
ONLINE, two secondaries each saw self-OFFLINE alone). Henry's variant had 2
pods drop role entirely while mysql-1 stayed primary. Both are manifestations
of the same underlying race in
`mgr-server/internal/syncer/JoinCurrentMemberToCluster`: `START
GROUP_REPLICATION` fires once, hits a transient SQL i/o timeout, returns
without retry, and the engine state machine can no longer be cleanly recovered
by HA daemon periodic ticks.

## Trigger conditions

The race recurs reliably when ALL of:

1. Cluster runs continuously ≥3h (gives sufficient rejoin attempts).
2. ≥5 parallel writers continuously inserting (~25 INSERT/s).
3. Repeated chaos kill every ~5 min, alternating primary/secondary.
4. mysql-0 (or any single pod) is killed multiple times — increases the chance
it hits the race on a rejoin.

Below this threshold (22 attempts in our chaos C2 lane before the soak), the
race is rare enough that no single batch hit it. The shorter cycle (≤30s
recovery, ≤4 runs per batch) prevents enough rejoin opportunities.

## Patch hypothesis (William, msg 61756a30)

Root cause: `mgr-server/internal/syncer/JoinCurrentMemberToCluster` calls
`START GROUP_REPLICATION`. If the underlying SQL connection times out once,
the function returns the error and exits. The next attempt is only made by the
HA daemon periodic loop, but by then the engine's local state may already be
in an indeterminate sub-state (group seeds set, member partially registered).

Minimum patch:
- bounded retry loop around `START GROUP_REPLICATION` (e.g. 3 attempts with
exponential backoff)
- reconnect the local MySQL socket between retries (so socket-level i/o timeout
is not the propagating failure)
- preserve original error return path so permanent failures still surface

## Validation plan (Henry, msg f5289913)

After the patch lands and the engine image is built (local-build + sideload to
vc-2), re-run the same 4h soak with identical parameters (5 writers, 5 min
cycle, alternating kill). Acceptance condition:
- At least 1 full 4h soak with `view-split=0` AND `role-disappearance=0`.
- Repeat ≥2 more soaks at the same conditions (N=3 floor for probabilistic
release validation).
- Optionally extend to longer soak (24h) per release standard third condition.