diff --git a/docs/cases/sqlserver/sqlserver-ch60-2replica-identity-swap-case.md b/docs/cases/sqlserver/sqlserver-ch60-2replica-identity-swap-case.md new file mode 100644 index 0000000..b1512c1 --- /dev/null +++ b/docs/cases/sqlserver/sqlserver-ch60-2replica-identity-swap-case.md @@ -0,0 +1,85 @@ +# SQL Server CH60 2-replica identity-swap 案例 + +> **Engine**: SQL Server AlwaysOn AG with `DB_FAILOVER=ON, CLUSTER_TYPE=EXTERNAL` +> **Case for**: [`../../test/addon-sync-secondary-kill-quorum-aware-invariant-guide.md`](../../test/addon-sync-secondary-kill-quorum-aware-invariant-guide.md) +> **触发时间**: 2026-05-18 → 2026-05-19,idc2 sqlserver vcluster `mason-idc2-etcd-vc` +> **Source artifacts**: `kubeblocks-tests/artifacts/sqlserver-ch60-2rep-n10-strength-155303/` (TS=155303) + +## 背景 + +SQL Server addon CH60 chaos case 在 3-replica 拓扑(1P+2S)上跑过 N=10 全 PASS(TS=132132)+ 2-replica 拓扑(1P+1S)单 cycle PASS(TS=143306)。然后在 2-replica 拓扑跑 N=10 strength loop(TS=155303),出现概率性 FAIL:cycle-001 FAIL、cycle-002/003/004 PASS(25% FAIL 率)。FAIL 原因 = CH60 strict primary-identity invariant 没按拓扑分层,把 SQL Server AG 的 by-design 行为误判为 failure。 + +## Reproduction 数据 + +| Cycle | 拓扑 | Result | pre_delete primary | after primary | identity invariant | +|---|---|---|---|---|---| +| TS=143306 单 cycle | 2-rep | PASS | `mssql-0/4f697e1c` | `mssql-0/4f697e1c` | satisfied(运气 75%) | +| TS=155303 cycle-001 | 2-rep | **FAIL** | `mssql-0/02d07bdf` | `mssql-1/e68ac7f1` | violated(identity swap) | +| TS=155303 cycle-002 | 2-rep | PASS | (per cycles.tsv) | invariant satisfied | satisfied | +| TS=155303 cycle-003 | 2-rep | PASS | (per cycles.tsv) | invariant satisfied | satisfied | +| TS=155303 cycle-004 | 2-rep | PASS | (per cycles.tsv) | invariant satisfied | satisfied | +| TS=132132 cycle-001..010 | 3-rep | 10/10 PASS | invariant always satisfied | — | satisfied | + +## 时间线(TS=155303 cycle-001) + +```text +15:54:30Z pre_delete primary = mssql-0 UID 02d07bdf +15:54:30Z delete pod mssql-ch-7859-mssql-1 (the only secondary) +15:55:04Z write id=1 from mssql-0 OK +15:55:12Z write id=2 from mssql-0 OK +15:55:19Z write id=3 from mssql-0 → "TCP Provider Error 0x2746 Communication + link failure Timeout expired" mid-flight + readback returns SQL "Msg 983: Unable to access availability + database 'db1' because the database replica is not in the + PRIMARY or SECONDARY role" — mssql-0 进入 RESOLVING +15:56:33Z addon log on recreated mssql-1 (UID e68ac7f1): + "Replica is now PRIMARY" +15:56:36Z role-count audit: kb_primary_count=2, apps_primary_count=0 + (KB label catch-up 同时显示两个 pod 为 primary) +15:56:50Z 收敛:kb_primary_count=1(mssql-1 唯一 primary) + mssql-0 改贴 secondary +15:57:29Z+ 稳态:1P+1S HEALTHY, mssql-1=PRIMARY, mssql-0=SECONDARY +``` + +dual-primary 标签窗口 14 秒(7 个 2-second 采样)。`apps_primary_count` 全程 = 0 — kbagent role probe 严格反映 AG 状态,未误标。 + +## Data integrity(未丢数据) + +- 15 个 client-ack 写全部命中 final table +- 1 个 client-failed 但 server commit(commit_unknown=1) +- 0 duplicate +- AG 最终 1P+1S HEALTHY/ONLINE/CONNECTED/HEALTHY(角色与初始相反) + +FAIL 仅来自 strict identity invariant,data path 完整。 + +## 为什么不是 SQL Server bug + +SQL Server AG with `DB_FAILOVER=ON, CLUSTER_TYPE=EXTERNAL`: + +1. **同步提交**:primary 提交一笔事务前要等 secondary ACK。secondary 不可达,primary 不能提交,写路径阻塞 +2. **Quorum 防 split-brain**:sync secondary 是 quorum 成员。2-replica 拓扑中失去 secondary 意味着 quorum 丢失;primary 自身进入 RESOLVING 状态(拒绝读写)以避免和未知的对端形成 split-brain +3. **重选举不锁定原 primary**:secondary 重建并重新加入 AG 时,重选举 leader 取决于多个因素(LSN 进度、连接顺序、replica 配置);ALWAYS ON 不保证还原原 primary identity + +3-replica 拓扑(1P+2S)杀 1 个 secondary 后还剩 1P+1S 仍构成 quorum,primary 不进 RESOLVING,无 identity swap。 + +## 修法(按 Jerry 8cba9fe8 / 820c2482 拍板) + +`kubeblocks-tests/sqlserver/chaos.sh` CH60 invariant 按 `SQLSERVER_CHAOS_PROFILE` 分层: + +- `3replica`(默认):保持 `pre_delete_primary == after_primary` strict identity invariant +- `2replica`:放宽 — 只要求 `AG converges 1P+1S HEALTHY/ONLINE/CONNECTED` + writer 三轨 invariant + cleanup PASS;短时 `kb_primary_count=2` 记录持续时间和最终收敛状态,最终仍 dual-primary 才升级 failure;apps_primary > 1、数据重复、ack 丢失才算真 failure + +## 案例边界 + +- 不否决 3-replica N=10 (TS=132132) 10/10 PASS — 不同拓扑 +- 不否决 first soak (TS=114631) 3-replica N=1 baseline +- TS=143306 2-replica 单 cycle PASS = 概率事件(~75% PASS),不是 deterministic +- patched runner 后 2-replica N=10 10/10 PASS 才升 C02 Covered + +## 相关 evidence path + +- `kubeblocks-tests/artifacts/sqlserver-ch60-2rep-n10-strength-155303/CLOSEOUT.md` — 完整 closeout(lane 状态、§1 cycle-001 详细证据、§2 first-blocker 分层排除表、§3 处置) +- `cycle-001/artifact/001-chaos-secondary-kill-{targets,primary-after}-mssql-ch-7859.tsv` — kill targets + after-state +- `cycle-001/artifact/001-chaos-role-counts-mssql-ch-7859.tsv` — 113 sample timeline,包含 14s dual-primary 窗口 +- `cycle-001/artifact/001-chaos-writes-ch60-mssql-ch-7859-history.tsv` — writer history(写 3 触发 RESOLVING,Msg 983 持续 ~1.5 min) +- `cycle-001/cycle.log` — addon `"Replica is now PRIMARY"` 自报行 diff --git a/docs/test/README.md b/docs/test/README.md index 7cc6109..f1359a4 100644 --- a/docs/test/README.md +++ b/docs/test/README.md @@ -38,6 +38,7 @@ - [`addon-soak-test-result-classification-guide.md`](addon-soak-test-result-classification-guide.md) — 长跑测试 4-state 结果分类(invariant-break / product-path-failure / harness-race / external-environmental-cascade) - [`addon-chaos-soak-counter-closeout-guide.md`](addon-chaos-soak-counter-closeout-guide.md) — chaos / soak closeout 必带计数字段:有效样本数、坏信号计数、恢复区间、cleanup、证据 sha、结论边界 - [`addon-auto-failover-threshold-vs-recovery-race-guide.md`](addon-auto-failover-threshold-vs-recovery-race-guide.md) — 自动故障切换阈值 vs K8s 容器恢复时间的竞速窗口设计契约 +- [`addon-sync-secondary-kill-quorum-aware-invariant-guide.md`](addon-sync-secondary-kill-quorum-aware-invariant-guide.md) — 同步副本 secondary-kill 类 chaos invariant 必须按 quorum/剩余 sync 副本数分层;2-replica 拓扑禁用 strict `pre_delete_primary == after_primary`,改为 AG 收敛 + 数据三轨 + cleanup - [`addon-ship-readiness-multi-phase-validation-guide.md`](addon-ship-readiness-multi-phase-validation-guide.md) — addon 接近发布标准前的持续测试规则:发布标准 = 覆盖度足够 + 多轮高强度无产品问题 + 长跑无产品问题 - [`addon-physical-backup-restore-verification-guide.md`](addon-physical-backup-restore-verification-guide.md) — 物理备份方法必须过三阶段:Backup 产物非空、restore 新 Cluster 读回 marker、pod 删除后 PVC 数据仍在 diff --git a/docs/test/addon-sync-secondary-kill-quorum-aware-invariant-guide.md b/docs/test/addon-sync-secondary-kill-quorum-aware-invariant-guide.md new file mode 100644 index 0000000..aa78d8a --- /dev/null +++ b/docs/test/addon-sync-secondary-kill-quorum-aware-invariant-guide.md @@ -0,0 +1,114 @@ +# 同步副本杀 secondary 测试不变量 — 按拓扑分层指南 + +> **Audience**: addon dev / TL / chaos test author +> **Status**: stable +> **Applies to**: any KB addon whose engine uses **synchronous-commit replication with replica quorum** for write durability. Examples: SQL Server AlwaysOn AG with `DB_FAILOVER=ON`, PostgreSQL `synchronous_standby_names` ANY/FIRST, MySQL Group Replication with `group_replication_consistency=AFTER`, MongoDB replica set with `writeConcern: majority`, OceanBase paxos majority, TiDB pd majority。引擎细节不影响本指南要点。 +> **Applies to KB version**: any +> **Affected by version skew**: 不受 KB 版本影响 — 本文是 chaos 测试不变量设计契约 +> **Related**: +> - [`addon-auto-failover-threshold-vs-recovery-race-guide.md`](addon-auto-failover-threshold-vs-recovery-race-guide.md) — 杀 primary 太快 recovery,自动切换没 fire 的情况 +> - [`addon-chaos-writer-three-track-commit-verdict-guide.md`](addon-chaos-writer-three-track-commit-verdict-guide.md) — writer ack / commit-unknown / duplicate 三轨 + +本文面向 chaos 测试 author,解决一个常见的"假 FAIL":**「我只是杀掉 secondary,怎么 primary 也变了?」**。答案通常不是引擎 bug,而是**测试不变量没有按拓扑分层**。同步副本架构下,secondary 是 primary 提交 quorum 的成员之一;2-replica 拓扑只有 1 个 secondary,杀掉它会让 primary 失去 quorum 自身进入 RESOLVING/只读/拒写状态;secondary 重建后 AG 重选举不保证还原原 primary 身份 — 这是同步副本引擎防 split-brain 的设计行为,不是缺陷。要让 CH60 类"杀 secondary"测试在多种拓扑下都得到稳定可重复的判定,invariant 必须按拓扑(实际是按"杀后剩余 sync 副本数是否仍构成 quorum")分层。 + +## 先用白话理解这篇文档 + +### 这篇文档解决什么问题 + +你写了一个 chaos case「杀 secondary,主写不中断」,在 3-replica 拓扑上跑 10 次 10 次 PASS:primary 身份不变、AG 恢复成 1P+2S 健康、所有写都成功落库。然后你把同样的 case 拿到 2-replica 拓扑上跑一次也 PASS,于是你写"这条 case 在 2-replica 拓扑也通过"。 + +但你跑第二次、第三次的时候,开始概率性 FAIL:primary 身份"换了",旧 primary 变成 secondary,新 secondary 变成 primary。数据一条没丢、AG 也健康,但 invariant 「pre_delete_primary == after_primary」失败了。 + +这不是引擎 bug,是**测试不变量与同步副本 quorum 模型没对齐**: + +- 3-replica(1P + 2S):杀 1 个 S,还剩 1P + 1S 仍构成 quorum,primary 不进 RESOLVING,身份保留 — `pre_delete_primary == after_primary` 是合理的不变量 +- 2-replica(1P + 1S):杀 1 个 S 后,**剩下的副本不再构成 quorum**。引擎按 split-brain 防护设计,让 primary 自身进入 RESOLVING / 只读 / 拒写;secondary 重建后 AG 重选举,谁当 primary 取决于 LSN/连接顺序/replica 配置,**不保证还原** — `pre_delete_primary == after_primary` 在这个拓扑下本就不应是必要不变量 + +强行用 3-replica 的 invariant 在 2-replica 上跑,得到一个概率性 FAIL,是 invariant 没按拓扑分层,不是引擎缺陷。 + +### 读完你能做什么决策 + +- **写 chaos case 时**:知道哪些 invariant 是「拓扑感知」的,哪些是「拓扑无关」的 +- **看 chaos 测试 FAIL 报告时**:能快速判断"这是引擎 bug 还是 invariant 不适用本拓扑" +- **写 closeout 时**:能清晰区分 "this topology 不适用这个 invariant" 与 "engine has a real bug" +- **写 COVERAGE.md / release matrix 时**:知道为什么同一个 case 在不同拓扑下要分别打分,single-cycle PASS 不等于"该拓扑通过" + +## 通用方法论 + +### 硬规则 — invariant 必须反映拓扑的 quorum 行为 + +按拓扑,"杀 secondary" 类 chaos case 应该有不同的不变量。 + +| 拓扑 | quorum 杀 secondary 后是否还成立 | strict identity invariant 是否成立 | 推荐 invariant | +|---|---|---|---| +| 3-replica 同步 (1P+2S) | 是(还剩 1P+1S 仍构成 quorum) | 是 — primary 不进 RESOLVING | **`pre_delete_primary == after_primary`** + secondary UID rotate + data integrity + AG 恢复 1P+2S | +| 2-replica 同步 (1P+1S) | **否** — primary 失去 quorum 进 RESOLVING | **否** — primary 可能身份切换 | **`AG converges to 1P+1S HEALTHY with data integrity preserved`** — 允许 primary 身份换,要求 data 不丢 | +| N-replica 同步 (1P+ (N-1)S,N≥3) | 是(杀 1 个 S 还剩 N-2 个 S 仍构成多数) | 是 | 同 3-replica | +| 异步 secondary | quorum 不依赖 secondary | 是 | 同 3-replica | + +核心判断:**杀 secondary 之后,剩下的 sync 副本数是否仍构成 quorum**。是 → identity invariant 合法;否 → invariant 必须放宽,只能要求收敛 + data integrity。 + +### 数据完整性 invariant — 跨拓扑都适用 + +不管拓扑,下面这一组数据 invariant 都应该成立: + +- 所有 client-ack 的写在 final table(acknowledged write 不丢) +- client-failed 但 server-committed 的写 → 单独记成 `commit_unknown`(按 ack 语义不算丢,但要可追溯) +- 没有 duplicate(同一 logical id 不重复落库) +- AG 最终收敛成 healthy(pod count 与拓扑一致,所有副本 ONLINE/CONNECTED/HEALTHY 或等价 engine 状态) +- role-count 不出现"逻辑 dual-primary"(apps_primary_count > 1)— KB role label 层短窗口的 dual-primary 是 label 同步延迟,不是 engine 双主 + +### 实现细节 — 怎么写 + +伪代码: + +```bash +case "$SQLSERVER_CHAOS_PROFILE" in + 2replica) + # 同步副本,杀掉唯一 sync secondary + # primary 身份可能改变 - 这是 product 设计行为,不是 bug + assert_ag_converges_to_1p1s_healthy + assert_writer_reconcile_data_integrity # ack=present, commit_unknown=tracked, dup=0 + assert_role_audit_apps_primary_max_eq_1 # engine-level 严格 + # 不 assert primary identity invariant + ;; + 3replica|default) + # quorum 在杀后仍成立 + assert_primary_identity_unchanged # strict + assert_secondary_uid_rotated + assert_ag_converges_to_1p2s_healthy + assert_writer_reconcile_data_integrity + assert_role_audit_apps_primary_max_eq_1 + ;; +esac +``` + +### Anti-pattern + +| 反模式 | 为什么错 | 正确做法 | +|---|---|---| +| 一份 invariant 通用于所有 replica count | 2-replica sync quorum 行为与 N-replica 不同 | 按 profile 分支,invariant 拓扑感知 | +| 杀 sync secondary 看 primary 身份不变就算通过 | 在 2-replica 是 probabilistic PASS,不是 deterministic | 杀 sync secondary 在 2-replica 时只要求收敛 + data integrity | +| Single-cycle PASS 就升 release matrix Covered | 概率性 invariant 在 N 小的样本下可能假性 PASS | N-multiplier 跑 10+ cycle 才升 Covered | +| 把 apps_primary_count==1 但 kb_primary_count==2 当 dual-primary 故障 | KB role label 层有几秒同步延迟,不是 engine 双主 | invariant 看 apps_primary_count(engine-level),kb_primary_count 仅作监控 | +| 看到 identity-swap 就直接报"engine bug" | sync quorum loss 引起的 identity swap 是引擎设计行为 | 先按拓扑表查 invariant 是否适用,再分层定位 | + +## 验证 Gate(5 条最小契约) + +写完 invariant 改动后,跑下面 5 条最小契约: + +1. **3-replica** profile N≥10 cycle,strict identity invariant 应 10/10 PASS +2. **2-replica** profile N≥10 cycle,relaxed invariant(converge + data integrity)应 10/10 PASS +3. **2-replica** profile N≥10 cycle 中,identity-swap 实际发生次数应大于 0(验证 invariant 放宽是必要的) +4. data-integrity invariant 在两种拓扑都 10/10 PASS(数据不丢 / dup=0 / commit_unknown 可追溯) +5. closeout 区分 strict identity invariant 与 relaxed convergence invariant,不混写 + +## 引擎案例 + +SQL Server CH60 secondary kill 在 2-replica 拓扑的 identity-swap 现场:见 [`../cases/sqlserver/sqlserver-ch60-2replica-identity-swap-case.md`](../cases/sqlserver/sqlserver-ch60-2replica-identity-swap-case.md)。 + +## 与其他 guide 的边界 + +- [`addon-auto-failover-threshold-vs-recovery-race-guide.md`](addon-auto-failover-threshold-vs-recovery-race-guide.md) 讲的是**杀 primary 时 recovery 太快、阈值没过,failover 不 fire** 的现象;本文讲的是**杀 sync secondary 时 quorum 丢失、primary 自身被推 RESOLVING、AG 重选举可能 swap identity** 的现象。一个是单 fault 不够强、一个是单 fault 撞了 quorum 边界,方向相反。 +- [`addon-chaos-writer-three-track-commit-verdict-guide.md`](addon-chaos-writer-three-track-commit-verdict-guide.md) 讲 data-integrity invariant 的三轨判定(ack / commit_unknown / duplicate),本文引用这套作为跨拓扑必备 invariant。 +- [`addon-test-baseline-standard-guide.md`](addon-test-baseline-standard-guide.md) 讲 baseline / Runnable / Covered 等级阈值,本文给"Covered" 加了一个隐含要求:拓扑感知 invariant + N-multiplier 而非 single-cycle。