apecloud · weicao · May 13, 2026
diff --git a/docs/SKILL-INDEX.md b/docs/SKILL-INDEX.md
@@ -91,6 +91,7 @@
 - [`addon-test-runner-portability-guide.md`](addon-test-runner-portability-guide.md) — 跨平台兼容坑 8 条：macOS bash 3.2 + `set -euo pipefail` 下 7 个 + GNU `seq` vs BSD `seq` zero-count 1 个（空数组、env-default 时机、`local x=$(cmd)`、`seq 1 0` 等）
 - [`addon-test-runner-cadence-discipline-guide.md`](addon-test-runner-cadence-discipline-guide.md) — 长时 runner 操作期间的固定节奏汇报纪律；cadence 是操作者义务，触发器必须独立于 runner 进程
 - [`addon-test-runner-write-after-bounded-role-gate-guide.md`](addon-test-runner-write-after-bounded-role-gate-guide.md) — runner 任何写步骤必须在 bounded、多视角 role-publish gate 之后；gate 通过后下游 primary 必须从 gate 重锚；gate 写探针只允许在 K8s role-label primary 上执行，secondary 不允许写复制表（避免内部 admin 用户穿透 read_only 后撞 1062）；含 alpha.58 preload-after-gate 与 alpha.59 secondary write probe 两条 case
+- [`addon-cluster-running-gate-budget-guide.md`](addon-cluster-running-gate-budget-guide.md) — 测试 runner 等 Cluster CR Running 的 T1 budget 应按 env P99 + 50% buffer 取（**不是**产品 normal 启动时间）；超 budget 不直接判 product RED，必须先收 phase / pod ready / events / nodes 四项观察再分层（env image-pull / env storage / env node-pressure / control-plane / product engine）；caller / helper / kubectl 三级预算下沉；附 OceanBase enterprise addon 1200s→1800s 默认调整案例（不外推到其它 env）
 
 ### 6. 协作 / GitHub 提交纪律
 

diff --git a/docs/addon-cluster-running-gate-budget-guide.md b/docs/addon-cluster-running-gate-budget-guide.md
@@ -0,0 +1,183 @@
+# Addon Test Runner — Cluster Running Gate Budget Sizing
+
+> **Audience**: addon test engineer / TL
+> **Status**: stable methodology
+> **Applies to**: any KB addon smoke / chaos / regression test that waits for a Cluster CR `status.phase=Running` (or analogous `wait_cluster_running` helper) at T1 or similar setup gate.
+> **Applies to KB version**: any.
+
+属于：方法论主题文档（不绑定单一引擎）。
+
+## 1. 这篇要解决的问题
+
+测试 runner 在 T1 / setup 阶段一般要等 Cluster CR 进入 `Running`，给定一个时间预算（如 `TIMEOUT_BOOTSTRAP=1200s` / `1800s`）。预算太短会撞「**环境慢启**」假阳性（cluster 实际能起来，只是慢），预算太长会撞「**产品 hang**」掩护（cluster 永远起不来但 runner 还在等）。
+
+**反模式 1 — 用产品 normal start 时间当预算**：拿引擎在干净环境下的最快冷启动 wall-clock（如 OB ~3 min）当 budget，撞 env-slow-start 时全部 false-positive 报 product hang。
+
+**反模式 2 — 用一个超大数字当预算**：把 budget 设到 1 小时以上，自己以为「保险」，结果 product 真的卡住时 runner 一动不动等到 budget 才退，没有现场。
+
+**目标**：runner 的 T1 budget 应该按 **env-slow-start 实测分布** 取值，并且独立于产品本身的 normal start。一旦超 budget，不要直接判 product RED，而是按 layer 分类。
+
+> **边界 framing**：这一篇只讲 T1 类「等 Cluster CR Running」的 budget。T-PITR.4 等 OpsRequest 等待的 budget、postReady 等待、单条 SQL timeout 都是另外的预算层，参考 `addon-postready-bounded-timeout-failure-classification-guide.md` §2 规则 E。
+
+## 2. 硬规则
+
+### 规则 A — T1 budget 必须独立可调，与产品 normal 启动时间分离
+
+```bash
+# 反例（hard-coded）
+wait_cluster_running "$CLUSTER" 300   # 产品 normal 启动 ~3min；budget 也只给 5min
+```
+
+```bash
+# OK：用环境变量，允许 caller override
+TIMEOUT_BOOTSTRAP="${TIMEOUT_BOOTSTRAP:-1800}"
+wait_cluster_running "$CLUSTER" "$TIMEOUT_BOOTSTRAP"
+```
+
+- 默认值取「实测 env P99 + 50% buffer」，不是产品 normal。
+- 任何 caller（CI、IDC、本地 dev、远程 vcluster）可以单独覆盖。
+
+### 规则 B — env-slow-start 实测分布是设默认的唯一依据
+
+测试组要维护一份「同 cluster scale + 同 storage backend + 同 image registry 路径」下 cluster 真实 Running 时间的样本：
+
+```
+sample_id   image_pull_s   pvc_bind_s   observer_boot_s   total_to_Running_s   env_notes
+N=1         15             3            68                86                   warm registry, fresh sc
+N=2         180            8            68                256                  cold registry pull
+N=3         3              3            72                78                   warm
+...
+```
+
+收 ≥10 个不同 env 状态下的样本，取 P99，向上加 50% buffer 当默认 budget。OB 实测在 vcluster + minio registry 场景下 P99 ~ 1800s（含 cold pull 的极端样本），故 OB harness 默认 `TIMEOUT_BOOTSTRAP=1800`，**不是** 因为产品需要 30 分钟才能 ready，而是因为 env 在 cold-pull / cold-PVC-attach 场景下可能拖到那么久。
+
+### 规则 C — 超 budget 不直接判 product RED；按 layer 分类
+
+`wait_cluster_running` 超时退出时，至少补 4 个观察项让上层分类：
+
+```bash
+if ! wait_cluster_running "$CLUSTER" "$TIMEOUT_BOOTSTRAP"; then
+  # 1. 产品视角：phase 是什么？
+  phase=$(kubectl get cluster "$CLUSTER" -o jsonpath='{.status.phase}')
+  # 2. K8s 视角：pod 真实 running 多少个？容器多少 ready？
+  pods_ready=$(kubectl get pod -l "app.kubernetes.io/instance=$CLUSTER" -o jsonpath='{range .items[*]}{.metadata.name}={.status.containerStatuses[*].ready}{"\n"}{end}')
+  # 3. 镜像视角：最近 ImagePull events 是否拖了 ImagePullBackOff / DeadlineExceeded？
+  events=$(kubectl get events --field-selector involvedObject.kind=Pod -n "$NAMESPACE" --sort-by='.lastTimestamp' | tail -30)
+  # 4. 节点视角：DiskPressure / MemoryPressure / NotReady？
+  nodes=$(kubectl get nodes -o wide)
+
+  echo "ERROR: T1 wait budget exceeded after ${TIMEOUT_BOOTSTRAP}s; classify layer before declaring product RED."
+  echo "phase=$phase"
+  echo "pods_ready=$pods_ready"
+  echo "events_tail=$events"
+  echo "nodes=$nodes"
+  exit 72   # CONTROL_PLANE_STALL or wider; let upper-layer classify
+fi
+```
+
+**分类指南**（参考 `addon-test-acceptance-and-first-blocker-guide.md` 5 层）：
+
+| 观察 | 第一层归类 |
+|---|---|
+| 没有任何 pod 被 schedule 到节点 | env / scheduler |
+| pods 被 schedule 但 ImagePullBackOff / `ErrImagePull` | env / image-pull |
+| pods PVC 长时间 Pending | env / storage |
+| 节点 DiskPressure / MemoryPressure | env / node-pressure |
+| pods 4/4 Ready 但 cluster phase 不 Running | control-plane / KB controller |
+| pods 已 Running 但引擎进程异常退出 / OOM / 拒绝服务 | product / engine |
+| K8s API 自身 EOF / TLS handshake timeout | env / k8s API entrance |
+
+### 规则 D — 超 budget 的样本不当 first-blocker 默认归到 product，**重跑前必先 layer 化**
+
+实战教训（参考附录 A.1）：postreadyfix3 一次 T1 1200s 超 budget，cluster 实际 ~30min 后才到 Running。**第一时间归类必须是「env / runner-window」**，不是「product hang」。
+
+判别 hint：
+- 若 budget 后再轮询，cluster 最终能到 Running，且无 product error code → **env / runner-window**。
+- 若 budget 后 cluster 仍永远 Stuck，且引擎 log 有死循环 / panic → **product / engine**。
+- 若 cluster 卡 `Creating` 但 KB controller 一直 retry 重新 reconcile → **control-plane**。
+
+### 规则 E — caller-side budget 与 helper-side budget 分层
+
+```
+runner harness top-level
+  └─ wait_cluster_running budget ($TIMEOUT_BOOTSTRAP)
+        └─ inner per-poll kubectl get (--request-timeout=15s)
+              └─ K8s API server-side timeout
+```
+
+外层 > 内层每一级。任何一级 swallow 信号（例如 inner kubectl get 跑无 `--request-timeout`，挂住 5min），都让 outer budget 计算失真。
+
+## 3. PR 评审 checklist
+
+新或改 `wait_cluster_running` / T1 类等待的 PR 按这 6 条扫：
+
+1. budget 是否走环境变量（如 `TIMEOUT_BOOTSTRAP`），不是硬编码？
+2. 默认值是否基于实测 env P99（而不是产品 normal）？
+3. 超 budget 退出有没有 4 项观察输出（cluster phase / pod ready / events / nodes）？
+4. 超 budget 的 exit reason 是不是 generic `CLUSTER_NOT_RUNNING_WITHIN_BUDGET`，留给上层分类？
+5. 内层 `kubectl get` 是不是带 `--request-timeout`，不会自己挂住超过 budget？
+6. closeout / 报告里有没有把「超 budget」直接写成 product RED？（不该）
+
+## 4. 反例 vs 正例
+
+### 反例 1（用产品 normal 时间当 budget）
+
+```bash
+TIMEOUT_BOOTSTRAP=180    # "OB 3 分钟应该够吧"
+wait_cluster_running "$CLUSTER" $TIMEOUT_BOOTSTRAP
+```
+
+cold-pull / cold-PVC 场景一律 false-positive 报 product hang。
+
+### 正例
+
+```bash
+# 实测 P99 + 50%
+TIMEOUT_BOOTSTRAP="${TIMEOUT_BOOTSTRAP:-1800}"
+if ! wait_cluster_running "$CLUSTER" "$TIMEOUT_BOOTSTRAP"; then
+  collect_layer_classification_evidence "$CLUSTER" "$NAMESPACE"   # 写 4 项观察
+  echo "ERROR: T1_cluster_running_gate_exceeded_${TIMEOUT_BOOTSTRAP}s"
+  exit 72
+fi
+```
+
+### 反例 2（超 budget 直接判 product RED）
+
+```bash
+if ! wait_cluster_running "$CLUSTER" 1200; then
+  echo "ERROR: PRODUCT BUG: cluster did not start"
+  fail_test
+fi
+```
+
+修：见正例 + §2 规则 C，要先收 phase / pod / events / nodes 再分层。
+
+## 5. 与其他 skill 的关系
+
+- `addon-bounded-eventual-convergence-guide.md` — T1 等待是 bounded retry 的 outer-level 实例；本 doc 是 bounded retry 在 cluster setup 场景的预算化形态。
+- `addon-postready-bounded-timeout-failure-classification-guide.md` — postReady 是 T-PITR.5 wait 的内层；本 doc 是 T1 wait 的同级方法论；二者预算层级见 §2 规则 E。
+- `addon-test-acceptance-and-first-blocker-guide.md` — §2 规则 C / D 的 layer 分类直接取自该 doc。
+- `addon-evidence-discipline-guide.md` — 超 budget 单样本不是「产品系统性证伪」，本 doc 与 evidence-discipline 同源。
+
+## Appendix A — OceanBase enterprise addon 案例
+
+### Case A.1 — postreadyfix3 vs postreadyfix4 的 T1 budget 调整
+
+OceanBase enterprise addon 的 PITR runtime 测试在 2026-05-13 早期 (`postreadyfix3` RUN_ID `pitr-runtime-postreadyfix3-20260513T004039Z`) 默认 `TIMEOUT_BOOTSTRAP=1200`。
+
+- T1 1200s 后 runner 报 `Cluster ob-pitr-prfix3-n1 did not reach Running within 1200 s`，phase=Creating。
+- 之后 runner 已退出，但 cluster 在大约 `creationTimestamp+30min` 时实际进入 `Running`，pods 4/4 Running。
+- 复盘后 first-blocker layer = **env / test-window**（不是 product RED）：
+  - 没有 product error code
+  - cluster 后续自然到 Running
+  - vcluster API entry mid-T1 有 TLS handshake timeout 瞬时（env 信号）
+- 修法：把 default `TIMEOUT_BOOTSTRAP` 从 `1200` 调到 `1800`（实测 env P99 + 50% buffer）。
+- 紧接 `postreadyfix4` RUN_ID `pitr-runtime-postreadyfix4-20260513T013850Z` 在新 budget 下 T1 仅 ~4 分钟就 Running，PASS 15/0/0。
+
+边界声明：1800s 默认仅在「idc4 vcluster + apecloud-registry.cn-zhangjiakou 镜像源 + apelocal-hostpath-default SC」这个 env scope 下经过 P99 + 50% 校准；其它 env 应独立校准。
+
+### Case A.2 — 后续 6 样本观测均 ≤ 4 分钟到 Running
+
+postreadyfix4 之后，OceanBase enterprise addon PITR 测试在两条独立运行路径（Mac+port-forward 与 idc4 host-runner+in-cluster DNS）下共 6 个样本，T1 实际耗时全部 ≤ 4 分钟。**这不能** 当作 product 验收证据；只支持「在该 6 样本的 env 状态下 T1 远在 1800s budget 内」。env cold-pull 情形未在该 6 样本中出现。
+
+附录到此为止。其它引擎或其它 env 的 T1 budget 数据请另起 appendix。