Skip to content
3 changes: 3 additions & 0 deletions docs/SKILL-INDEX.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@
- [`addon-resource-rich-cluster-parallel-test-guide.md`](addon-resource-rich-cluster-parallel-test-guide.md) — idc / idc1 / idc2 / idc4 这类资源充足集群上的并行测试调度方法:先测容量、按 suite lane 分批、逐步提升 N、namespace / evidence / cleanup 隔离、把环境压力和产品失败分开报告;KBE / KB runtime 在这些 IDC 环境中必须以 vcluster 为 SUT,host k8s 只承载 runner / helper / bootstrap;确认环境问题后通过 @Musk3 找飞书用户李国银;配套 skill `parallel-test-execution`
- [`addon-multi-ns-registry-scan-preflight-guide.md`](addon-multi-ns-registry-scan-preflight-guide.md) — 多 namespace / 多 topology 并发测试或 chaos suite 启动前把测试 scope 拆成 **verified scope vs scan-only future-gate** 两层:本轮跑过的 topology 写 verified 结论;未跑但 pre-flight scan 命中 docker.io 漏点的 topology 写 future-gate precondition。具体 application 是审计 live `ComponentVersion` + `ParametersDefinition.toolsSetup.toolConfigs[].image` 的 image source 一致性。N=2 MySQL grounded(task #5 functional 多 ns 并发 + task #6 chaos vcluster pre-patch)
- [`addon-soak-test-result-classification-guide.md`](addon-soak-test-result-classification-guide.md) — 长跑型测试(24h+ soak / chaos / fault-injection)出结果之后的 4-state 结果分类方法论:`invariant-break` / `product-path-failure` / `harness-race` / `external-environmental-cascade`。Q1-Q5 决策树(mermaid)+ N≥2 自验证最小证据门槛(harness-race / cascade 必须有同类 Succeed 或 baseline ±1σ 对照样本)+ ACCEPTED 判据(`invariant-break = 0 AND product-path-failure = 0`,**不是** fault total = 0)+ N=3/2 grounded 案例对照(CH30 harness-race + CH20 external-env-cascade)+ AG quorum non-sticky 行为附注。与 [`addon-test-acceptance-and-first-blocker-guide.md`](addon-test-acceptance-and-first-blocker-guide.md) 单次 fail 分层方法论互补(前者多次注入聚合维度)
- [`addon-chaos-pod-kill-vs-engine-internal-crash-guide.md`](addon-chaos-pod-kill-vs-engine-internal-crash-guide.md) — 同一引擎 "崩溃"分三种形态(K8s 层 pod-kill / 引擎 instance critical 进程崩溃 / 引擎 broker 辅助进程崩溃),三类走完全不同的恢复路径并可能触发完全不同的 failover 行为,必须分别测。文档给三轴 chaos 矩阵 + B 类下再分 instance / broker(master, worker) / listener 三子轴 + burn-in(同位置同类故障 ≥3 次) + 位置依赖性矩阵(primary vs active FSFO target vs non-target standby)+ 跨引擎适配 checklist + 反模式表;Oracle 案例见 `cases/oracle/oracle-chaos-pod-kill-vs-smon-kill-fsfo-asymmetry-case.md`

### 3. 环境 ready 前 / 环境层撞坑

Expand Down Expand Up @@ -166,6 +167,7 @@
- [`docs/addon-pr-reviewer-routing-guide.md`](addon-pr-reviewer-routing-guide.md) — addon 代码 PR reviewer routing 方法论:先查当前 PR touched files、初版 PR、addon 目录主要作者、最近相关功能作者,再按"最近功能维护者 → 初版 / 主要作者 → addon owner → fallback reviewer → 带证据问 Weston"选择 reviewer。包含 @Musk3 安全转发格式、Kevin 汇总表、Oracle PR #1276 / PR #764 反例,配套 callable skill `addon-pr-reviewer-routing`
- [`docs/addon-github-submission-discipline-guide.md`](addon-github-submission-discipline-guide.md) — 多 agent 协作 + GitHub 公开仓库的边界纪律:(1) AI provenance trailer(`Co-Authored-By: Claude` / `🤖 Generated with` / `noreply@anthropic.com`)不外漏的硬规则与兜底命令链(heredoc + `git commit --amend -m "$(... | sed)"` strip / push 前 grep 自检);(2) 多 agent 并发推同一 PR branch 的 cascade 事故响应 playbook(force-with-lease lemma:lease 锚 last-fetched remote tip 不防 fetch 后并发 push / 双向 `git log --oneline` ritual / dropped-commit owner self-recover / single-owner-execute 收口)。5 条 doctrine(A force-with-lease / B per-commit grep / C cascade single-owner-execute / D forensic 自查 / E content-delta verify)+ §5 cross-cutting rules(forensic self-review / Doctrine E shorthand / evidence-post obligation / 递归 self-application)
- [`docs/addon-soak-test-result-classification-guide.md`](addon-soak-test-result-classification-guide.md) — 长跑型测试(24h+ soak / chaos / fault-injection)出结果之后的结果分类方法论。**核心 framing**:fault total / PASS-FAIL 计数无法回答"是否 ACCEPTED",必须把每条 fault 注入按"哪一层先失败"落到 4-state schema:(1) `invariant-break`(不变量破坏 → ROLLBACK)/ (2) `product-path-failure`(产品恢复路径失败)/ (3) `harness-race`(测试工具时序竞争)/ (4) `external-environmental-cascade`(外部环境级联)。**判据**:Q1(bad_ack > 0)→ Q2(cluster 终态)→ Q3(OpsRequest Failed + N≥2 自验证)→ Q4(duration 超 mean+3σ + 外部事件关联 + 对照样本)→ product-pass(mermaid 流程图)。**N≥2 自验证最小证据门槛**:harness-race 需同类 Succeed 对照、cascade 需 baseline ±1σ 对照样本;单 sample 不能下"非产品"结论。**ACCEPTED 判据**:`invariant-break = 0 AND product-path-failure = 0`,harness-race / cascade 不阻塞但触发对应修复 ticket。grounded N=3 CH30 harness-race 对照(fault-026 vs 029/033)+ N=2 CH20 cascade negative-control(fault-028 21min outlier vs 031 95s 1σ 内)+ AG quorum non-sticky 3-transition 行为附注。与 [`addon-test-acceptance-and-first-blocker-guide.md`](addon-test-acceptance-and-first-blocker-guide.md) 单次 fail first-blocker 分层方法论形成互补对子(前者聚合维度,后者单次维度)
- [`docs/addon-chaos-pod-kill-vs-engine-internal-crash-guide.md`](addon-chaos-pod-kill-vs-engine-internal-crash-guide.md) — Addon chaos test 设计原则:同一 DB 实例的"崩溃"有三种**完全不同**的故障形态 — (A) K8s 层 pod-kill / 节点宕机;(B-instance) 引擎 critical 进程崩溃(Oracle SMON/LGWR / PG postmaster),引擎自行终止 instance;(B-broker) 仅 HA / 复制 / 监控相关进程崩溃(Oracle DMON/INSV/NSV* / MySQL semi-sync 线程),引擎悄悄重启该进程,instance / 容器不动。三类**走完全不同的恢复路径**并**可能触发完全不同的 failover 行为**,一类测了不等于测了另一类。文档给:双轴 / 三轴 chaos 矩阵、B 类下 instance vs broker(master, worker) vs listener 三子轴、burn-in 方法学(≥3 cycle / poll point 解耦 / topology rotation)、FSFO bimodal probabilism 段、位置依赖性矩阵(primary vs active FSFO target vs non-target standby)、跨引擎适配 checklist(含 listener watchdog 覆盖)、10 条反模式(含"single-shot ≠ verified" / "broker SUCCESS ≠ cluster healthy" / "不要从被杀成员 poll broker")。Oracle 19c grounded:rounds 1–16 见 `cases/oracle/oracle-chaos-pod-kill-vs-smon-kill-fsfo-asymmetry-case.md`

## 案例材料

Expand Down Expand Up @@ -197,6 +199,7 @@
- [`docs/cases/oracle/oracle-chart-vs-kb-schema-skew-multi-stage-case.md`](cases/oracle/oracle-chart-vs-kb-schema-skew-multi-stage-case.md) — Oracle 1.0.0-alpha.0 chart 在已发布 KB 装不上的三段反转:先误判「整代代差」、再误判「chart 字段路径错位」、最后查清是「chart 跟 KB main 上未发布 API(PR #10100 / #10109)」。同仓 `release-1.0` 分支才是答案。属 [`addon-chart-vs-kb-schema-skew-diagnosis-guide.md`](addon-chart-vs-kb-schema-skew-diagnosis-guide.md) 的工程现场补充
- [`docs/cases/oracle/oracle-12c-post-switchover-probe-cascade-kill.md`](cases/oracle/oracle-12c-post-switchover-probe-cascade-kill.md) — reconfigure_deep Run 1→3 闭环:Bug #12 (DBCA 跑期间 liveness 误杀,initialDelay=600 + 90s 重启窗口) + Bug #13 (post-switchover 慢控制面 flap readiness);3 layer fix(cmpd probe 参数 + liveness.sh 软失败 + checkDBStatus.sh best-effort dgmgrl);Run 3 全 PASS + RESTARTS=0 实证;属 [`addon-probe-timeout-and-soft-failure-guide.md`](addon-probe-timeout-and-soft-failure-guide.md) 工程现场补充
- [`docs/cases/oracle/oracle-12c-processes-cue-paramdef-range-case.md`](cases/oracle/oracle-12c-processes-cue-paramdef-range-case.md) — reconfigure_deep T22d FAIL:`processes: int & >=6` cue 太宽,10 通过 ValidatePhase → ORA-603/1092 → instance terminated → KB OpsRequest 卡 Running 25min+;fix `>=100` Run 3 验证生效(ValidatePhase reject within 10s);属 [`addon-paramdef-cue-range-validation-guide.md`](addon-paramdef-cue-range-validation-guide.md) 工程现场补充
- [`docs/cases/oracle/oracle-chaos-pod-kill-vs-smon-kill-fsfo-asymmetry-case.md`](cases/oracle/oracle-chaos-pod-kill-vs-smon-kill-fsfo-asymmetry-case.md) — Oracle 19c grounded chaos 案例库(`o19p15v9` cluster, MaxPerformance + ASYNC + FSFO threshold 30s):rounds 1-16 完整记录三轴 chaos(A pod-kill / B-instance SMON+LGWR / B-broker DMON+INSV / B-listener tnslsnr) × 三位置(primary / active FSFO target / non-target standby)+ burn-in(round 13)+ A+B-instance concurrent race (round 8)。**关键 finding**:(1) A vs B-instance 在 primary 上行为根本不同(A 走 pod recreate 自愈, B 走 FSFO failover),(2) FSFO trigger 是 bimodal probabilism(watchdog tick 相位 vs FSFO threshold 决定 fire 还是 self-recover),(3) F39 — `runOracle.sh` watchdog blind to tnslsnr loss → listener 死后 broker SUCCESS + observer 心跳正常 + 新客户连接全部 ORA-12541 → 无 watchdog 不自愈(Sev-1,PR #1320),(4) F38 — `setup_observer.sh` 无 timeout 等 instance OPEN + hardcode `${ORACLE_SID}_0`,primary 永久不恢复时 observer 永远起不来。属 [`addon-chaos-pod-kill-vs-engine-internal-crash-guide.md`](addon-chaos-pod-kill-vs-engine-internal-crash-guide.md) 工程现场补充

### OceanBase

Expand Down
Loading