apecloud · weicao · May 13, 2026
diff --git a/docs/SKILL-INDEX.md b/docs/SKILL-INDEX.md
@@ -91,6 +91,7 @@
 - [`addon-test-runner-portability-guide.md`](addon-test-runner-portability-guide.md) — 跨平台兼容坑 8 条：macOS bash 3.2 + `set -euo pipefail` 下 7 个 + GNU `seq` vs BSD `seq` zero-count 1 个（空数组、env-default 时机、`local x=$(cmd)`、`seq 1 0` 等）
 - [`addon-test-runner-cadence-discipline-guide.md`](addon-test-runner-cadence-discipline-guide.md) — 长时 runner 操作期间的固定节奏汇报纪律；cadence 是操作者义务，触发器必须独立于 runner 进程
 - [`addon-test-runner-write-after-bounded-role-gate-guide.md`](addon-test-runner-write-after-bounded-role-gate-guide.md) — runner 任何写步骤必须在 bounded、多视角 role-publish gate 之后；gate 通过后下游 primary 必须从 gate 重锚；gate 写探针只允许在 K8s role-label primary 上执行，secondary 不允许写复制表（避免内部 admin 用户穿透 read_only 后撞 1062）；含 alpha.58 preload-after-gate 与 alpha.59 secondary write probe 两条 case
+- [`addon-runner-incluster-vcluster-access-pattern-guide.md`](addon-runner-incluster-vcluster-access-pattern-guide.md) — 远程 IDC / 跨网段 vcluster 上跑 addon 测试时，runner pod 在 host k8s 内 + in-cluster service DNS（`https://<vc-svc>.<vc-ns>.svc.cluster.local:443`）访问 vcluster API，把 workstation 网络踢出 critical path；workstation 仅用 `kubectl exec/cp` 做控制路径，证据回收走 `kubectl exec ... -- tar c \| tar x` pipe（避免 `kubectl cp` 撞 `etcdserver: request timed out`）；`nohup ... &` 让 exec 立刻退出不挂住 workstation；closeout 必须把 host API 影响与 runner 结果**独立**评估；附 OceanBase enterprise addon 从 workstation+port-forward 切到 idc4 host-runner+in-cluster DNS 的案例（仅 3 样本边界证据，不外推）
 
 ### 6. 协作 / GitHub 提交纪律
 

diff --git a/docs/addon-runner-incluster-vcluster-access-pattern-guide.md b/docs/addon-runner-incluster-vcluster-access-pattern-guide.md
@@ -0,0 +1,207 @@
+# Addon Test Runner — In-Cluster vcluster API Access Pattern (vs. Cross-Network port-forward)
+
+> **Audience**: addon test engineer / TL / runner harness owner
+> **Status**: stable methodology
+> **Applies to**: any KB addon test harness that needs to talk to a vcluster API where the vcluster runs inside a remote host k8s cluster (IDC, cloud, etc.) and the test author is on a separate workstation.
+> **Applies to KB version**: any.
+
+属于：方法论主题文档（不绑定单一引擎）。
+
+## 1. 这篇要解决的问题
+
+Addon 测试在 IDC / 远程 vcluster 上跑时，通常有三种访问 vcluster API 的拓扑：
+
+| 方案 | 形态 | 主要风险 |
+|---|---|---|
+| (1) workstation + `kubectl port-forward` | workstation 启一个 `kubectl port-forward -n <ns> svc/<vc-svc> 16443:443`，测试脚本在 workstation 本地用 `https://127.0.0.1:16443` 访问 | port-forward TCP 流静默死掉（进程活但连接断），workstation 端 kubectl 全部 timeout；mid-test 撞上 = T1 / postReady / cleanup 任一阶段被中断 |
+| (2) workstation + host NodePort | workstation 直接打 host k8s 的 NodePort（vcluster service 暴露成 NodePort） | 仍然依赖 workstation ↔ host k8s 网络稳定；NodePort 端口 firewall / SAN 不匹配等 |
+| (3) **runner pod 在 host k8s 内 + in-cluster service DNS** | 测试脚本运行在 host k8s 的一个 Pod 里，pod 用 `https://<vc-svc>.<vc-ns>.svc.cluster.local:443` 直接走 cluster DNS / network 解析 vcluster API | workstation 网络波动**完全不影响**测试本身；workstation 只用 `kubectl exec/cp` 做证据收集和启停，且这条控制路径与测试运行解耦 |
+
+**关键观察**：方案 (1)、(2) 把 workstation 网络稳定纳入 critical path；方案 (3) 把 workstation 网络踢出 critical path。在 cross-region / 远程 IDC 场景下，workstation 网络波动是高频背景噪音；测试 critical path 不应承担它。
+
+> **边界 framing**：本 doc 不要求所有团队都用方案 (3)。它讲的是「当你已经因方案 (1) / (2) 撞过 mid-test 网络中断时，方案 (3) 是消除该类 blocker 的结构性方法」，以及怎么落地（kubeconfig / RBAC / 证据回收 / 启停控制）。
+
+## 2. 硬规则
+
+### 规则 A — 测试 pod 用 in-cluster service DNS，不依赖 NodePort 也不依赖 workstation port-forward
+
+```yaml
+# OK：pod 内 kubeconfig 直接用 service DNS
+apiVersion: v1
+kind: Secret
+metadata:
+  name: vc-kubeconfig
+  namespace: addon-runner-host
+data:
+  kubeconfig: |
+    # base64 of:
+    # apiVersion: v1
+    # clusters:
+    # - cluster:
+    #     certificate-authority-data: <vcluster CA>
+    #     server: https://ob-vcluster.ob-vcluster.svc.cluster.local:443
+    #   name: vc
+    # ...
+```
+
+注意点：
+
+- `server` 字段必须是 in-cluster DNS，不是 `127.0.0.1:<port>`，不是 NodePort `<node-ip>:<port>`。
+- CA 必须是 vcluster 自己的 CA（与 workstation 用的 raw kubeconfig 同一个），不是 host k8s CA。
+- 客户端证书 / token 仍按 vcluster 的 admin / service-account 颁。
+
+### 规则 B — runner pod 在 host k8s 一个固定 namespace，长 Running，pod 内挂载 harness PVC 和 kubeconfig Secret
+
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: addon-runner
+  namespace: addon-runner-host
+spec:
+  containers:
+  - name: runner
+    image: <ci-image-with-kubectl-and-shell>
+    command: ["sh", "-c", "sleep infinity"]
+    env:
+    - name: KUBECONFIG
+      value: /kube/config
+    volumeMounts:
+    - mountPath: /kube
+      name: vc-kubeconfig
+      readOnly: true
+    - mountPath: /work
+      name: harness
+  volumes:
+  - name: vc-kubeconfig
+    secret:
+      secretName: vc-kubeconfig
+      items:
+      - key: kubeconfig
+        path: config
+  - name: harness
+    persistentVolumeClaim:
+      claimName: addon-runner-work
+```
+
+- `sleep infinity` 让 pod 长 Running，方便多次 `kubectl exec` 触发不同测试，不必每次创建 pod。
+- harness 代码 / evidence 输出在 PVC 上，跨重启保留。
+- kubeconfig 通过 Secret 注入，CI / 人都不直接在 pod 文件系统里塞 kubeconfig。
+
+### 规则 C — 启动 / 监控测试用 `kubectl exec`，**但不要让 exec 自己成为 critical path**
+
+```bash
+# 从 workstation 启测试：让 pod 内进程 nohup 后立刻退出 exec
+kubectl exec runner -- sh -c 'nohup bash /work/run-pitr.sh > /work/evidence/<RUN_ID>/runner.log 2>&1 & echo $!'
+```
+
+要点：
+
+- `nohup ... &` + `echo $!` 让 `kubectl exec` 立刻退出，不挂住 workstation。
+- 测试进程在 pod 内独立跑，workstation 网络后续即使断开，测试也照跑。
+- 后续 polling 也是「来一次 exec 看 tail」，不是「保持 long exec stream」。
+
+### 规则 D — 证据从 pod 拉回 workstation 用 `kubectl exec ... -- tar c | tar x`，不要 `kubectl cp`
+
+`kubectl cp` 实际上是 `kubectl exec ... -- tar c` 的封装，但实现里多了一层 archive normalize，远程 etcd / api 一旦慢就 `etcdserver: request timed out`。直接 `exec ... -- tar c <DIR>` 管道流下来更稳。
+
+```bash
+# OK：直接 tar pipe
+kubectl exec runner -- sh -c "cd /work/evidence && tar c <RUN_ID>" > /tmp/<RUN_ID>.tar
+tar xf /tmp/<RUN_ID>.tar -C ./local-evidence/
+```
+
+### 规则 E — 测试 critical path 不允许任何对 workstation 网络稳定性的硬假设
+
+closeout 必须独立单列「workstation ↔ host k8s API entry 是否在测试运行期出现 timeout / EOF」和「是否影响测试」两件事；二者解耦：
+
+```
+workstation -> host k8s API : 可能有 transient TLS handshake timeout / EOF
+   ↓
+runner pod -> vcluster API (in-cluster DNS) : 独立通路；与 workstation 网络无关
+```
+
+如果上方通路在测试运行期出过几次 timeout，下方通路的测试结果仍可以是 PASS / FAIL，二者**不**互相佐证、**不**互相 invalidate。
+
+## 3. PR / 设计评审 checklist
+
+5 条：
+
+1. Pod kubeconfig 的 `server` 字段是不是 in-cluster service DNS（不是 NodePort / 不是 127.0.0.1）？
+2. Pod 是否长 Running（`sleep infinity` 或类似），不是 every-test-one-pod？
+3. 启动测试是否 `nohup ... &` 让 exec 立刻退出（不挂住 workstation）？
+4. 证据回收是不是用 `kubectl exec ... -- tar c | tar x`，不是 `kubectl cp` 递归整个目录？
+5. closeout 模板里有没有 host API impact 段，与 runner 结果**独立**评估？
+
+## 4. 反例 vs 正例
+
+### 反例 1（workstation port-forward 上跑测试）
+
+```bash
+# workstation 启
+kubectl --kubeconfig host-k8s port-forward -n ob-vcluster svc/ob-vcluster 16443:443 &
+# workstation 跑测试
+KUBECONFIG=./vc-127.0.0.1 ./tests/pitr.sh
+```
+
+`port-forward` 进程活但 TCP 流静默死掉时，T1 / postReady / cleanup 任一阶段都可能撞 mid-test 中断，测试无法区分这是 product 卡还是网络断。
+
+### 正例（runner pod 在 host k8s 内 + in-cluster DNS）
+
+```bash
+# workstation 只做 control-plane
+kubectl exec addon-runner -- sh -c 'nohup bash /work/run-pitr.sh > /work/evidence/<RUN_ID>/runner.log 2>&1 & echo $!'
+
+# pod 内 KUBECONFIG=/kube/config server=https://<vc-svc>.<vc-ns>.svc.cluster.local:443
+# 测试与 workstation 网络解耦
+```
+
+workstation 网络波动期间 pod 内测试继续；workstation 恢复后用 `kubectl exec tail` / `kubectl exec tar c` 取证。
+
+### 反例 2（`kubectl cp` 整目录）
+
+```bash
+kubectl cp addon-runner:/work/evidence/<RUN_ID> ./local-evidence/<RUN_ID>
+```
+
+跨 etcd 慢时撞 `etcdserver: request timed out`，要 retry 几次，浪费 budget。
+
+修：用 §2 规则 D 的 `tar c | tar x` pipe。
+
+## 5. 与其他 skill 的关系
+
+- `addon-host-runner-job-pattern-guide.md` — 该 doc 讲 host-side runner Job 模式（每次新 Job pod）；本 doc 讲 long-running pod + in-cluster DNS 模式，互补：短任务用 Job 模式，长会话或多次互动用 sleep-infinity 模式。
+- `addon-test-runner-cadence-discipline-guide.md` — workstation 网络断开期间，测试 cadence 由 runner pod 内部维持，与本 doc §2 规则 C 一致。
+- `addon-test-acceptance-and-first-blocker-guide.md` — host API entry 失稳归 env L1；本 doc §2 规则 E 与其分层一致。
+- `addon-evidence-discipline-guide.md` — 单样本 host API 失稳不能写作 systemic env outage，与 evidence-discipline 一致。
+
+## Appendix A — OceanBase enterprise addon 案例
+
+### Case A.1 — 从 workstation+port-forward 切换到 idc4 host-runner+in-cluster DNS
+
+2026-05-13 之前，OceanBase enterprise addon PITR 测试在 workstation (Mac) 上跑，通过 `kubectl port-forward -n ob-vcluster svc/ob-vcluster 16443:443` 接 idc4 vcluster。期间撞到两类 mid-test 中断：
+
+1. **OpenAPI fetch transient** — `kubectl apply -f -` 子命令在拉 OpenAPI schema 时撞 `Client.Timeout exceeded while awaiting headers`，T-PITR.4 失败；30s 后自恢复。
+2. **port-forward TCP stale** — port-forward 进程活但流死，T1 wait 全部读到 empty phase，runner 在 budget 后报 cluster did not reach Running。
+
+切换到方案 (3) 之后：
+- 复用既有 host pod `oceanbase-runner-host/ob-runner-d1`（image `apecloud-registry.cn-zhangjiakou.cr.aliyuncs.com/apecloud/web-ui-test:main`，sleep infinity）
+- 挂载 vcluster kubeconfig Secret，server = `https://ob-vcluster.ob-vcluster.svc.cluster.local:443`
+- harness 同步进 pod 的 `/work/oceanbase/`（包括 `tests/pitr.sh` `f4b3375e...`、`lib/common.sh` `fdebb102...`、`lib/cluster.sh` `9e0e8ca7...`）
+- 启动用 `kubectl exec ... -- sh -c 'nohup bash tests/pitr.sh > ... 2>&1 & echo $!'`
+- 证据回收用 `kubectl exec ob-runner-d1 -- sh -c "cd /work/oceanbase/evidence && tar c <RUN_ID>" > /tmp/<RUN_ID>.tar`，避免 `kubectl cp` 撞 `etcdserver: request timed out`
+
+观测样本边界：在切换后的 3 个 PITR 顺序样本里，workstation→idc4 host k8s API 至少出现 2 次 TLS handshake timeout / EOF，**没有一次**影响 pod 内测试本身的执行或证据连续性（`runner.log` 无空白区间，pitr/ 证据连续）。仅作该 3 样本边界证据，**不**外推为「workstation 网络完全无影响」。
+
+### Case A.2 — 反例对照
+
+切换前 (`pitr-runtime-runner-hardening-N3-...`) 的一次 T1 失败：
+- runner 在 workstation Mac 上用 port-forward 访问 vcluster API
+- T1 wait 期间 port-forward TCP 流死，runner 读到 empty `phase=`
+- 1800s budget 后 runner 报 `Cluster ... did not reach Running within 1800 s`
+- 实际 cluster 已在 `creationTimestamp+几分钟` 时进入 Running，pods 4/4
+- closeout 分层 = **runner-harness 通道 / workstation pf staleness**，不是 product RED
+
+按 §1 表选方案 (3) 是该 case 的结构性修法；写运行期 keepalive 监控（参考 `addon-runner-portforward-staleness-keepalive-guide.md` 如有）只是过渡补丁，不消除架构层风险。
+
+附录到此为止。其它引擎或其它远程 cluster 拓扑请另起 appendix。