Skip to content

docs(test): sync-secondary-kill quorum-aware invariant guide + sqlserver case#227

Open
weicao wants to merge 1 commit into
mainfrom
weicao/sync-secondary-kill-quorum-invariant
Open

docs(test): sync-secondary-kill quorum-aware invariant guide + sqlserver case#227
weicao wants to merge 1 commit into
mainfrom
weicao/sync-secondary-kill-quorum-invariant

Conversation

@weicao
Copy link
Copy Markdown
Contributor

@weicao weicao commented May 18, 2026

Summary

  • 新增引擎无关方法论:docs/test/addon-sync-secondary-kill-quorum-aware-invariant-guide.md — secondary-kill 类 chaos invariant 必须按 quorum / 剩余 sync 副本数分层;2-replica 拓扑下 strict pre_delete_primary == after_primary 是错的不变量
  • 新增 SQL Server 案例:docs/cases/sqlserver/sqlserver-ch60-2replica-identity-swap-case.md — TS=155303 cycle-001 真实 timeline + data integrity 闭合证据 + 为什么不是 SQL Server bug
  • 更新 docs/test/README.md 索引指向新 guide

Background

SQL Server CH60 chaos N=10 strength loop (TS=155303) on 2-replica 拓扑出现概率性 FAIL:cycle-001 FAIL(identity swap)+ cycle-002/003/004 PASS(25% FAIL 率)。同样的 case 在 3-replica (TS=132132) 10/10 PASS。

根因不是 SQL Server bug — SQL Server AlwaysOn AG 在 DB_FAILOVER=ON, CLUSTER_TYPE=EXTERNAL 下杀掉唯一 sync secondary 会让 primary 失去 quorum 自身进入 RESOLVING;secondary 重建后 AG 重选举不保证还原原 primary identity,这是同步副本引擎防 split-brain 的设计行为。问题是 CH60 invariant 没按拓扑分层。

文档分工

  • guide(主文):通用方法论,方法论不绑定具体引擎。其他同步副本引擎(PostgreSQL synchronous_standby_names、MySQL Group Replication AFTER、MongoDB writeConcern:majority、OceanBase paxos、TiDB pd majority)都适用
  • case(附录):SQL Server 专属的 timeline、evidence path、为什么不是 SQL Server bug 的引擎细节

kubeblocks-addon-docs 「一个主题一篇文档 + 引擎专属放案例附录」分工。

边界

  • 不否决既有的 3-replica chaos invariant — 那个 invariant 在 3-replica 下仍是正确的
  • 不重新分类 SQL Server CH60 在 2-replica 下的 first-soak (TS=143306 单 cycle PASS) — 概率事件,不是 deterministic
  • 仅说明在 2-replica 拓扑下 strict identity invariant 应该放宽成 AG 收敛 + 数据三轨 + cleanup 的多组合不变量

kubeblocks-tests 对应 chaos.sh patch 走独立 PR (apecloud/kubeblocks-tests#21),包含 profile-aware invariant 实现 + 2-replica N=10 验证。

Test plan

  • guide 正文 engine-neutral(没有 "SQL Server" / "AG" / "DMV" 类引擎名词出现在 §通用方法论 段)
  • case 单独放 sqlserver 专属内容
  • guide 引用了 case 作为 §引擎案例 链接
  • guide 列出 5 条 validation gate(3-rep N≥10 / 2-rep N≥10 / 2-rep identity-swap 实际出现 / data-integrity / closeout 分层)
  • README 索引 链接落地

🤖 Generated with Claude Code

…lserver case

Sediment from SQL Server CH60 chaos N=10 strength loop (TS=155303): when killing
the only sync secondary in a 2-replica topology, the primary loses quorum and
enters RESOLVING; AG re-election may not restore the original primary identity.
Strict `pre_delete_primary == after_primary` invariant produced probabilistic
FAIL (1/5 cycles) — root cause is test invariant not aligned with sync-replica
quorum model, not engine bug.

Guide is engine-neutral and applies to any sync-commit replication addon
(SQL Server AG, PostgreSQL synchronous_standby_names, MySQL Group Replication
AFTER, MongoDB writeConcern:majority, OceanBase paxos, TiDB pd majority).
Engine-specific RESOLVING / identity-swap / label-window timeline kept in
sqlserver case appendix per single-topic-per-doc rule.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 19, 2026

HOLD for docs gate. The guide/case split is directionally right and line budget is OK, but this PR cannot merge in current public form.

Blockers:

  1. Public hygiene: commit body still has an AI co-author trailer, and PR body still has a generated/tool footer. Please rewrite the branch so origin/main..HEAD commit messages and the PR body are clean. Use neutral wording like public hygiene grep clean; do not quote the forbidden trailer/footer text in the PR body.
  2. Version-skew wording too absolute: the guide says Affected by version skew: 不受 KB 版本影响. The topology idea is broadly engine-level, but role labels, probes, failover timing, and controller behavior can drift across KB/addon versions. Please calibrate to yes or partial, and state what needs revalidation.
  3. PR body stale scope: it still references the kubeblocks-tests PR as pending. Please verify whether the companion test PR / evidence state changed, then update the body so it reflects current status and boundaries.

Already checked: git diff --check clean; changed markdown links resolve; methodology guide is 114 lines and case is 85 lines, so no line-budget blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant