Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 85 additions & 0 deletions docs/cases/sqlserver/sqlserver-ch60-2replica-identity-swap-case.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# SQL Server CH60 2-replica identity-swap 案例

> **Engine**: SQL Server AlwaysOn AG with `DB_FAILOVER=ON, CLUSTER_TYPE=EXTERNAL`
> **Case for**: [`../../test/addon-sync-secondary-kill-quorum-aware-invariant-guide.md`](../../test/addon-sync-secondary-kill-quorum-aware-invariant-guide.md)
> **触发时间**: 2026-05-18 → 2026-05-19,idc2 sqlserver vcluster `mason-idc2-etcd-vc`
> **Source artifacts**: `kubeblocks-tests/artifacts/sqlserver-ch60-2rep-n10-strength-155303/` (TS=155303)

## 背景

SQL Server addon CH60 chaos case 在 3-replica 拓扑(1P+2S)上跑过 N=10 全 PASS(TS=132132)+ 2-replica 拓扑(1P+1S)单 cycle PASS(TS=143306)。然后在 2-replica 拓扑跑 N=10 strength loop(TS=155303),出现概率性 FAIL:cycle-001 FAIL、cycle-002/003/004 PASS(25% FAIL 率)。FAIL 原因 = CH60 strict primary-identity invariant 没按拓扑分层,把 SQL Server AG 的 by-design 行为误判为 failure。

## Reproduction 数据

| Cycle | 拓扑 | Result | pre_delete primary | after primary | identity invariant |
|---|---|---|---|---|---|
| TS=143306 单 cycle | 2-rep | PASS | `mssql-0/4f697e1c` | `mssql-0/4f697e1c` | satisfied(运气 75%) |
| TS=155303 cycle-001 | 2-rep | **FAIL** | `mssql-0/02d07bdf` | `mssql-1/e68ac7f1` | violated(identity swap) |
| TS=155303 cycle-002 | 2-rep | PASS | (per cycles.tsv) | invariant satisfied | satisfied |
| TS=155303 cycle-003 | 2-rep | PASS | (per cycles.tsv) | invariant satisfied | satisfied |
| TS=155303 cycle-004 | 2-rep | PASS | (per cycles.tsv) | invariant satisfied | satisfied |
| TS=132132 cycle-001..010 | 3-rep | 10/10 PASS | invariant always satisfied | — | satisfied |

## 时间线(TS=155303 cycle-001)

```text
15:54:30Z pre_delete primary = mssql-0 UID 02d07bdf
15:54:30Z delete pod mssql-ch-7859-mssql-1 (the only secondary)
15:55:04Z write id=1 from mssql-0 OK
15:55:12Z write id=2 from mssql-0 OK
15:55:19Z write id=3 from mssql-0 → "TCP Provider Error 0x2746 Communication
link failure Timeout expired" mid-flight
readback returns SQL "Msg 983: Unable to access availability
database 'db1' because the database replica is not in the
PRIMARY or SECONDARY role" — mssql-0 进入 RESOLVING
15:56:33Z addon log on recreated mssql-1 (UID e68ac7f1):
"Replica is now PRIMARY"
15:56:36Z role-count audit: kb_primary_count=2, apps_primary_count=0
(KB label catch-up 同时显示两个 pod 为 primary)
15:56:50Z 收敛:kb_primary_count=1(mssql-1 唯一 primary)
mssql-0 改贴 secondary
15:57:29Z+ 稳态:1P+1S HEALTHY, mssql-1=PRIMARY, mssql-0=SECONDARY
```

dual-primary 标签窗口 14 秒(7 个 2-second 采样)。`apps_primary_count` 全程 = 0 — kbagent role probe 严格反映 AG 状态,未误标。

## Data integrity(未丢数据)

- 15 个 client-ack 写全部命中 final table
- 1 个 client-failed 但 server commit(commit_unknown=1)
- 0 duplicate
- AG 最终 1P+1S HEALTHY/ONLINE/CONNECTED/HEALTHY(角色与初始相反)

FAIL 仅来自 strict identity invariant,data path 完整。

## 为什么不是 SQL Server bug

SQL Server AG with `DB_FAILOVER=ON, CLUSTER_TYPE=EXTERNAL`:

1. **同步提交**:primary 提交一笔事务前要等 secondary ACK。secondary 不可达,primary 不能提交,写路径阻塞
2. **Quorum 防 split-brain**:sync secondary 是 quorum 成员。2-replica 拓扑中失去 secondary 意味着 quorum 丢失;primary 自身进入 RESOLVING 状态(拒绝读写)以避免和未知的对端形成 split-brain
3. **重选举不锁定原 primary**:secondary 重建并重新加入 AG 时,重选举 leader 取决于多个因素(LSN 进度、连接顺序、replica 配置);ALWAYS ON 不保证还原原 primary identity

3-replica 拓扑(1P+2S)杀 1 个 secondary 后还剩 1P+1S 仍构成 quorum,primary 不进 RESOLVING,无 identity swap。

## 修法(按 Jerry 8cba9fe8 / 820c2482 拍板)

`kubeblocks-tests/sqlserver/chaos.sh` CH60 invariant 按 `SQLSERVER_CHAOS_PROFILE` 分层:

- `3replica`(默认):保持 `pre_delete_primary == after_primary` strict identity invariant
- `2replica`:放宽 — 只要求 `AG converges 1P+1S HEALTHY/ONLINE/CONNECTED` + writer 三轨 invariant + cleanup PASS;短时 `kb_primary_count=2` 记录持续时间和最终收敛状态,最终仍 dual-primary 才升级 failure;apps_primary > 1、数据重复、ack 丢失才算真 failure

## 案例边界

- 不否决 3-replica N=10 (TS=132132) 10/10 PASS — 不同拓扑
- 不否决 first soak (TS=114631) 3-replica N=1 baseline
- TS=143306 2-replica 单 cycle PASS = 概率事件(~75% PASS),不是 deterministic
- patched runner 后 2-replica N=10 10/10 PASS 才升 C02 Covered

## 相关 evidence path

- `kubeblocks-tests/artifacts/sqlserver-ch60-2rep-n10-strength-155303/CLOSEOUT.md` — 完整 closeout(lane 状态、§1 cycle-001 详细证据、§2 first-blocker 分层排除表、§3 处置)
- `cycle-001/artifact/001-chaos-secondary-kill-{targets,primary-after}-mssql-ch-7859.tsv` — kill targets + after-state
- `cycle-001/artifact/001-chaos-role-counts-mssql-ch-7859.tsv` — 113 sample timeline,包含 14s dual-primary 窗口
- `cycle-001/artifact/001-chaos-writes-ch60-mssql-ch-7859-history.tsv` — writer history(写 3 触发 RESOLVING,Msg 983 持续 ~1.5 min)
- `cycle-001/cycle.log` — addon `"Replica is now PRIMARY"` 自报行
1 change: 1 addition & 0 deletions docs/test/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@
- [`addon-soak-test-result-classification-guide.md`](addon-soak-test-result-classification-guide.md) — 长跑测试 4-state 结果分类(invariant-break / product-path-failure / harness-race / external-environmental-cascade)
- [`addon-chaos-soak-counter-closeout-guide.md`](addon-chaos-soak-counter-closeout-guide.md) — chaos / soak closeout 必带计数字段:有效样本数、坏信号计数、恢复区间、cleanup、证据 sha、结论边界
- [`addon-auto-failover-threshold-vs-recovery-race-guide.md`](addon-auto-failover-threshold-vs-recovery-race-guide.md) — 自动故障切换阈值 vs K8s 容器恢复时间的竞速窗口设计契约
- [`addon-sync-secondary-kill-quorum-aware-invariant-guide.md`](addon-sync-secondary-kill-quorum-aware-invariant-guide.md) — 同步副本 secondary-kill 类 chaos invariant 必须按 quorum/剩余 sync 副本数分层;2-replica 拓扑禁用 strict `pre_delete_primary == after_primary`,改为 AG 收敛 + 数据三轨 + cleanup
- [`addon-ship-readiness-multi-phase-validation-guide.md`](addon-ship-readiness-multi-phase-validation-guide.md) — addon 接近发布标准前的持续测试规则:发布标准 = 覆盖度足够 + 多轮高强度无产品问题 + 长跑无产品问题
- [`addon-physical-backup-restore-verification-guide.md`](addon-physical-backup-restore-verification-guide.md) — 物理备份方法必须过三阶段:Backup 产物非空、restore 新 Cluster 读回 marker、pod 删除后 PVC 数据仍在

Expand Down
114 changes: 114 additions & 0 deletions docs/test/addon-sync-secondary-kill-quorum-aware-invariant-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# 同步副本杀 secondary 测试不变量 — 按拓扑分层指南

> **Audience**: addon dev / TL / chaos test author
> **Status**: stable
> **Applies to**: any KB addon whose engine uses **synchronous-commit replication with replica quorum** for write durability. Examples: SQL Server AlwaysOn AG with `DB_FAILOVER=ON`, PostgreSQL `synchronous_standby_names` ANY/FIRST, MySQL Group Replication with `group_replication_consistency=AFTER`, MongoDB replica set with `writeConcern: majority`, OceanBase paxos majority, TiDB pd majority。引擎细节不影响本指南要点。
> **Applies to KB version**: any
> **Affected by version skew**: 不受 KB 版本影响 — 本文是 chaos 测试不变量设计契约
> **Related**:
> - [`addon-auto-failover-threshold-vs-recovery-race-guide.md`](addon-auto-failover-threshold-vs-recovery-race-guide.md) — 杀 primary 太快 recovery,自动切换没 fire 的情况
> - [`addon-chaos-writer-three-track-commit-verdict-guide.md`](addon-chaos-writer-three-track-commit-verdict-guide.md) — writer ack / commit-unknown / duplicate 三轨

本文面向 chaos 测试 author,解决一个常见的"假 FAIL":**「我只是杀掉 secondary,怎么 primary 也变了?」**。答案通常不是引擎 bug,而是**测试不变量没有按拓扑分层**。同步副本架构下,secondary 是 primary 提交 quorum 的成员之一;2-replica 拓扑只有 1 个 secondary,杀掉它会让 primary 失去 quorum 自身进入 RESOLVING/只读/拒写状态;secondary 重建后 AG 重选举不保证还原原 primary 身份 — 这是同步副本引擎防 split-brain 的设计行为,不是缺陷。要让 CH60 类"杀 secondary"测试在多种拓扑下都得到稳定可重复的判定,invariant 必须按拓扑(实际是按"杀后剩余 sync 副本数是否仍构成 quorum")分层。

## 先用白话理解这篇文档

### 这篇文档解决什么问题

你写了一个 chaos case「杀 secondary,主写不中断」,在 3-replica 拓扑上跑 10 次 10 次 PASS:primary 身份不变、AG 恢复成 1P+2S 健康、所有写都成功落库。然后你把同样的 case 拿到 2-replica 拓扑上跑一次也 PASS,于是你写"这条 case 在 2-replica 拓扑也通过"。

但你跑第二次、第三次的时候,开始概率性 FAIL:primary 身份"换了",旧 primary 变成 secondary,新 secondary 变成 primary。数据一条没丢、AG 也健康,但 invariant 「pre_delete_primary == after_primary」失败了。

这不是引擎 bug,是**测试不变量与同步副本 quorum 模型没对齐**:

- 3-replica(1P + 2S):杀 1 个 S,还剩 1P + 1S 仍构成 quorum,primary 不进 RESOLVING,身份保留 — `pre_delete_primary == after_primary` 是合理的不变量
- 2-replica(1P + 1S):杀 1 个 S 后,**剩下的副本不再构成 quorum**。引擎按 split-brain 防护设计,让 primary 自身进入 RESOLVING / 只读 / 拒写;secondary 重建后 AG 重选举,谁当 primary 取决于 LSN/连接顺序/replica 配置,**不保证还原** — `pre_delete_primary == after_primary` 在这个拓扑下本就不应是必要不变量

强行用 3-replica 的 invariant 在 2-replica 上跑,得到一个概率性 FAIL,是 invariant 没按拓扑分层,不是引擎缺陷。

### 读完你能做什么决策

- **写 chaos case 时**:知道哪些 invariant 是「拓扑感知」的,哪些是「拓扑无关」的
- **看 chaos 测试 FAIL 报告时**:能快速判断"这是引擎 bug 还是 invariant 不适用本拓扑"
- **写 closeout 时**:能清晰区分 "this topology 不适用这个 invariant" 与 "engine has a real bug"
- **写 COVERAGE.md / release matrix 时**:知道为什么同一个 case 在不同拓扑下要分别打分,single-cycle PASS 不等于"该拓扑通过"

## 通用方法论

### 硬规则 — invariant 必须反映拓扑的 quorum 行为

按拓扑,"杀 secondary" 类 chaos case 应该有不同的不变量。

| 拓扑 | quorum 杀 secondary 后是否还成立 | strict identity invariant 是否成立 | 推荐 invariant |
|---|---|---|---|
| 3-replica 同步 (1P+2S) | 是(还剩 1P+1S 仍构成 quorum) | 是 — primary 不进 RESOLVING | **`pre_delete_primary == after_primary`** + secondary UID rotate + data integrity + AG 恢复 1P+2S |
| 2-replica 同步 (1P+1S) | **否** — primary 失去 quorum 进 RESOLVING | **否** — primary 可能身份切换 | **`AG converges to 1P+1S HEALTHY with data integrity preserved`** — 允许 primary 身份换,要求 data 不丢 |
| N-replica 同步 (1P+ (N-1)S,N≥3) | 是(杀 1 个 S 还剩 N-2 个 S 仍构成多数) | 是 | 同 3-replica |
| 异步 secondary | quorum 不依赖 secondary | 是 | 同 3-replica |

核心判断:**杀 secondary 之后,剩下的 sync 副本数是否仍构成 quorum**。是 → identity invariant 合法;否 → invariant 必须放宽,只能要求收敛 + data integrity。

### 数据完整性 invariant — 跨拓扑都适用

不管拓扑,下面这一组数据 invariant 都应该成立:

- 所有 client-ack 的写在 final table(acknowledged write 不丢)
- client-failed 但 server-committed 的写 → 单独记成 `commit_unknown`(按 ack 语义不算丢,但要可追溯)
- 没有 duplicate(同一 logical id 不重复落库)
- AG 最终收敛成 healthy(pod count 与拓扑一致,所有副本 ONLINE/CONNECTED/HEALTHY 或等价 engine 状态)
- role-count 不出现"逻辑 dual-primary"(apps_primary_count > 1)— KB role label 层短窗口的 dual-primary 是 label 同步延迟,不是 engine 双主

### 实现细节 — 怎么写

伪代码:

```bash
case "$SQLSERVER_CHAOS_PROFILE" in
2replica)
# 同步副本,杀掉唯一 sync secondary
# primary 身份可能改变 - 这是 product 设计行为,不是 bug
assert_ag_converges_to_1p1s_healthy
assert_writer_reconcile_data_integrity # ack=present, commit_unknown=tracked, dup=0
assert_role_audit_apps_primary_max_eq_1 # engine-level 严格
# 不 assert primary identity invariant
;;
3replica|default)
# quorum 在杀后仍成立
assert_primary_identity_unchanged # strict
assert_secondary_uid_rotated
assert_ag_converges_to_1p2s_healthy
assert_writer_reconcile_data_integrity
assert_role_audit_apps_primary_max_eq_1
;;
esac
```

### Anti-pattern

| 反模式 | 为什么错 | 正确做法 |
|---|---|---|
| 一份 invariant 通用于所有 replica count | 2-replica sync quorum 行为与 N-replica 不同 | 按 profile 分支,invariant 拓扑感知 |
| 杀 sync secondary 看 primary 身份不变就算通过 | 在 2-replica 是 probabilistic PASS,不是 deterministic | 杀 sync secondary 在 2-replica 时只要求收敛 + data integrity |
| Single-cycle PASS 就升 release matrix Covered | 概率性 invariant 在 N 小的样本下可能假性 PASS | N-multiplier 跑 10+ cycle 才升 Covered |
| 把 apps_primary_count==1 但 kb_primary_count==2 当 dual-primary 故障 | KB role label 层有几秒同步延迟,不是 engine 双主 | invariant 看 apps_primary_count(engine-level),kb_primary_count 仅作监控 |
| 看到 identity-swap 就直接报"engine bug" | sync quorum loss 引起的 identity swap 是引擎设计行为 | 先按拓扑表查 invariant 是否适用,再分层定位 |

## 验证 Gate(5 条最小契约)

写完 invariant 改动后,跑下面 5 条最小契约:

1. **3-replica** profile N≥10 cycle,strict identity invariant 应 10/10 PASS
2. **2-replica** profile N≥10 cycle,relaxed invariant(converge + data integrity)应 10/10 PASS
3. **2-replica** profile N≥10 cycle 中,identity-swap 实际发生次数应大于 0(验证 invariant 放宽是必要的)
4. data-integrity invariant 在两种拓扑都 10/10 PASS(数据不丢 / dup=0 / commit_unknown 可追溯)
5. closeout 区分 strict identity invariant 与 relaxed convergence invariant,不混写

## 引擎案例

SQL Server CH60 secondary kill 在 2-replica 拓扑的 identity-swap 现场:见 [`../cases/sqlserver/sqlserver-ch60-2replica-identity-swap-case.md`](../cases/sqlserver/sqlserver-ch60-2replica-identity-swap-case.md)。

## 与其他 guide 的边界

- [`addon-auto-failover-threshold-vs-recovery-race-guide.md`](addon-auto-failover-threshold-vs-recovery-race-guide.md) 讲的是**杀 primary 时 recovery 太快、阈值没过,failover 不 fire** 的现象;本文讲的是**杀 sync secondary 时 quorum 丢失、primary 自身被推 RESOLVING、AG 重选举可能 swap identity** 的现象。一个是单 fault 不够强、一个是单 fault 撞了 quorum 边界,方向相反。
- [`addon-chaos-writer-three-track-commit-verdict-guide.md`](addon-chaos-writer-three-track-commit-verdict-guide.md) 讲 data-integrity invariant 的三轨判定(ack / commit_unknown / duplicate),本文引用这套作为跨拓扑必备 invariant。
- [`addon-test-baseline-standard-guide.md`](addon-test-baseline-standard-guide.md) 讲 baseline / Runnable / Covered 等级阈值,本文给"Covered" 加了一个隐含要求:拓扑感知 invariant + N-multiplier 而非 single-cycle。