apecloud · weicao · May 13, 2026
diff --git a/docs/SKILL-INDEX.md b/docs/SKILL-INDEX.md
@@ -91,6 +91,7 @@
 - [`addon-test-runner-portability-guide.md`](addon-test-runner-portability-guide.md) — 跨平台兼容坑 8 条：macOS bash 3.2 + `set -euo pipefail` 下 7 个 + GNU `seq` vs BSD `seq` zero-count 1 个（空数组、env-default 时机、`local x=$(cmd)`、`seq 1 0` 等）
 - [`addon-test-runner-cadence-discipline-guide.md`](addon-test-runner-cadence-discipline-guide.md) — 长时 runner 操作期间的固定节奏汇报纪律；cadence 是操作者义务，触发器必须独立于 runner 进程
 - [`addon-test-runner-write-after-bounded-role-gate-guide.md`](addon-test-runner-write-after-bounded-role-gate-guide.md) — runner 任何写步骤必须在 bounded、多视角 role-publish gate 之后；gate 通过后下游 primary 必须从 gate 重锚；gate 写探针只允许在 K8s role-label primary 上执行，secondary 不允许写复制表（避免内部 admin 用户穿透 read_only 后撞 1062）；含 alpha.58 preload-after-gate 与 alpha.59 secondary write probe 两条 case
+- [`addon-runner-portforward-staleness-keepalive-guide.md`](addon-runner-portforward-staleness-keepalive-guide.md) — workstation 端 `kubectl port-forward` 静默 TCP 流死（进程活但流断）时的轻量 `/version` keepalive stop-gap：精确 pkill 模式不误杀其它 port-forward，INTERVAL 与 FAIL_THRESHOLD 分开调，监控只 probe + restart 不做请求代理，日志按行记录可供 closeout 反查；明写**首选结构性修法**是 runner 搬进 host k8s + in-cluster service DNS（见 `addon-runner-incluster-vcluster-access-pattern-guide.md`），本 doc 仅作过渡 stop-gap；附 OceanBase enterprise addon N=3 attempt 撞 pf-staleness + keepalive 落地 1 样本 + 后续切结构性修法 keepalive 退役案例（不外推为「永久消除 staleness 风险」）
 
 ### 6. 协作 / GitHub 提交纪律
 

diff --git a/docs/addon-runner-portforward-staleness-keepalive-guide.md b/docs/addon-runner-portforward-staleness-keepalive-guide.md
@@ -0,0 +1,185 @@
+# Addon Test Runner — `kubectl port-forward` TCP Staleness and Lightweight `/version` Keepalive (Stop-gap)
+
+> **Audience**: addon test engineer / TL / runner harness owner
+> **Status**: stop-gap methodology (preferred structural fix is the in-cluster pattern; see §5)
+> **Applies to**: addon test harnesses that still depend on a workstation-side `kubectl port-forward` channel to a remote vcluster / k8s API.
+> **Applies to KB version**: any.
+
+属于：方法论主题文档（不绑定单一引擎）。
+
+## 1. 这篇要解决的问题
+
+`kubectl port-forward -n <ns> svc/<svc> <local>:<remote>` 在 workstation 上长跑时存在一个**静默故障形态**：
+
+- 进程仍 alive（`ps` 可见 PID）。
+- 但其底层 TCP 流已经死掉（kept-alive 包丢、对端 socket 关闭、proxy 中间设备丢连接）。
+- workstation 上经过 `127.0.0.1:<local>` 的所有 kubectl 请求都返回 `EOF` / `Client.Timeout` / `connection refused after a previous attempt`。
+
+测试 runner 这时候**无法区分**：
+- 这是 remote API server 真的挂了？
+- 还是 vcluster 控制面变慢？
+- 还是 port-forward 这一根管道死了但 endpoint 端其实健康？
+
+实战表现（参考附录 A）：T1 wait 期间 workstation 端 5/5 `/version` 失败，但从 host k8s 直接看 vcluster pod / OB pod 全部 Running 健康；kill 老 port-forward 再起一个 → 立即 5/5 PASS。**根因是 workstation 端的 port-forward TCP 流死掉，不是远端真挂**。
+
+**目标（stop-gap）**：在 workstation 端启一个**轻量监控**，周期 probe `/version`，连续失败超阈值就 kill + 重启 port-forward。**让 workstation 看到的通道始终保持「活的」状态**。
+
+> **边界 framing**：本 doc 是 **stop-gap**，不是结构性修法。**首选**仍然是把 runner 搬进 host k8s，用 in-cluster service DNS（参考 `addon-runner-incluster-vcluster-access-pattern-guide.md`）。当 short-term 不能切结构时，本 doc 给一个最小 workstation 监控。
+
+## 2. 硬规则
+
+### 规则 A — 监控**只**做 probe + restart；**不**做请求代理 / 不做 retry
+
+```bash
+# OK: 简洁监控循环
+while true; do
+  if ! kubectl --kubeconfig=$KUBECONFIG_VC get --raw=/version >/dev/null 2>&1; then
+    fail=$((fail+1))
+    if [ "$fail" -ge "$FAIL_THRESHOLD" ]; then
+      pkill -f "kubectl port-forward -n $NS svc/$SVC $LOCAL:$REMOTE" || true
+      sleep 2
+      nohup kubectl --kubeconfig=$KUBECONFIG_HOST port-forward -n $NS svc/$SVC $LOCAL:$REMOTE >>"$LOG" 2>&1 &
+      sleep 5
+      fail=0
+    fi
+  else
+    fail=0
+  fi
+  sleep $INTERVAL
+done
+```
+
+**不允许**：
+- 在监控里 retry 业务请求（无关注请求语义，没法 retry）。
+- 缓存或代理任何请求 body。
+- 启 sidecar / 复杂网关。
+
+监控目的**只是**把通道恢复，业务侧的 retry / bounded wait 由 runner 自己负责。
+
+### 规则 B — probe 频率与失败阈值要分开调
+
+```bash
+INTERVAL=20         # 探测周期（秒）
+FAIL_THRESHOLD=2    # 连续失败 N 次才 restart
+```
+
+- `INTERVAL` 太小（< 5s）会把 workstation CPU / API server 压力浪费在 probe；太大（> 60s）发现死流要等太久。20–30s 一般合适。
+- `FAIL_THRESHOLD=2` 防误判：单次瞬时 API 抖动不触发 restart。
+- 别用 `FAIL_THRESHOLD=1`，每次轻微抖动都 kill + 重启 = 雪上加霜。
+
+### 规则 C — restart 必须按精确 pkill 模式，不能误杀别的 port-forward
+
+```bash
+# 反例：误杀整个 host 上的 port-forward
+pkill -f "kubectl port-forward"
+
+# OK: 精确匹配本监控管理的那一根
+pkill -f "kubectl port-forward -n $NS svc/$SVC $LOCAL:$REMOTE"
+```
+
+workstation 上可能同时跑多根 port-forward（其它项目 / 别的 vcluster），不要互相误杀。
+
+### 规则 D — 监控自己的存活也要可见
+
+```bash
+log() {
+  printf '%s pf-keepalive: %s\n' "$(date -u +%Y-%m-%dT%H:%M:%SZ)" "$*" >> "$LOG_FILE"
+}
+log "monitor starting; interval=${INTERVAL}s fail_threshold=${FAIL_THRESHOLD}"
+log "probe rc!=0 (consec=${CONSEC_FAIL})"
+log "stale connection: killing existing port-forward processes"
+log "starting fresh port-forward"
+log "probe rc=0 after ${prev_fail} prior fails; channel recovered"
+```
+
+每次状态变化（监控启动 / 探测失败 / kill + restart / 恢复）写一行到日志文件。closeout 时可以从日志统计：
+
+- 触发 restart 多少次？
+- 每次触发前后的 probe rc 是什么？
+- 是否对 runtime 阶段（T1 / postReady / readback / cleanup）造成中断？
+
+### 规则 E — 监控只算 **runner-harness 通道硬化**，**不**作为产品 / addon / KB / vcluster 修复证据
+
+closeout 模板里这个 keepalive 段必须明写：
+- monitor 启停 PID + uptime。
+- restart 触发次数 + 每次的 probe 前后状态（verbatim from log）。
+- runner 阶段是否被影响（T1 / T-PITR.4 / postReady / readback / cleanup 哪些与 restart 时间窗重叠）。
+- 「monitor 是 workstation 端 stop-gap，**不**改变产品 / vcluster / 控制面行为，**不**作为 PITR / addon / release 评估证据」。
+
+### 规则 F — 当结构性修法可行时（runner 搬进 host k8s + in-cluster DNS），keepalive 自动退役
+
+参考 `addon-runner-incluster-vcluster-access-pattern-guide.md`。新建测试或新一轮 portage 时**优先**走结构性方案，不要长期依赖本 doc 的 stop-gap。本 doc 仅留作过渡期 + 历史档案。
+
+## 3. PR 评审 checklist
+
+5 条：
+
+1. `pkill -f` 模式是不是精确匹配本监控管理的那根 port-forward？（不误杀其它）
+2. `INTERVAL` 与 `FAIL_THRESHOLD` 是不是分开可调，且都不是极端值？
+3. restart 是否走 `kill 老进程 + sleep + nohup 新进程`，不是 `kill -9` + 立刻 restart？
+4. 日志是不是按行写到文件，可供 closeout 反查 restart 次数和时间窗？
+5. closeout 模板中是不是单列「keepalive 行为段」，与 runner 结果**独立**评估？
+
+## 4. 反例 vs 正例
+
+### 反例 1（无监控，让 runner 自己撞死）
+
+workstation 上 port-forward 死掉，runner T1 wait 等到 1800s 才报 budget 超时，期间 host 端 cluster 早已健康。
+
+修：本 doc §2 监控脚本。
+
+### 反例 2（监控里夹 retry 业务请求）
+
+```bash
+# 反例：监控替 runner retry
+if ! kubectl get cluster ...; then
+  sleep 5
+  kubectl get cluster ...
+fi
+```
+
+业务 retry 应由 runner 自己写，监控只看 `/version` health。
+
+### 反例 3（不写日志）
+
+restart 触发后 closeout 没办法 forensic 复盘。修：本 doc §2 规则 D。
+
+## 5. 与其他 skill 的关系
+
+- `addon-runner-incluster-vcluster-access-pattern-guide.md` — **结构性修法**：runner 搬进 host k8s + in-cluster service DNS，从根本上消除 workstation port-forward 这一类风险。本 doc 是过渡 stop-gap，新场景**首选**那一篇。
+- `addon-runner-openapi-schema-fetch-brittleness-guide.md` — `--validate=false` 关 client-side validate 与 keepalive 互补：前者解决 schema fetch 抖动，后者解决底层 TCP 流死掉；二者**正交**。
+- `addon-test-runner-cadence-discipline-guide.md` — 监控属于操作者义务的延伸（cadence 触发器独立于 runner 进程）。
+- `addon-evidence-discipline-guide.md` — keepalive 单样本触发不等于「workstation 通道系统性不稳」，不外推。
+
+## Appendix A — OceanBase enterprise addon 案例
+
+### Case A.1 — N=3 attempt T1 wait 撞 port-forward staleness
+
+OceanBase enterprise addon PITR runtime 测试在 workstation Mac 上跑（RUN_ID `pitr-runtime-runner-hardening-N3-...`）撞到 T1 1800s budget 超时；事后查实际情况：
+
+- runner 在 budget 内一直读到 `phase=`（空），从 workstation 角度 vcluster API 完全不可达。
+- T1 budget 失败后 5/5 `/version` 也失败。
+- kill 老 port-forward 进程（PID 64734，活但 TCP 流死），起新一根（PID 86308） → 立即 5/5 PASS。
+- 从 host k8s 直接看：vcluster pod `ob-vcluster-0` Running 4h+，OB pod `ob-pitr-rh-n3-1-oceanbase-0/1` Running 43-44m，KB controller Running，nodes Ready —— **远端从未挂**。
+
+按本 doc §2 §3 的 first-blocker 分类 = **runner-harness 通道 / workstation pf-staleness**，**不是** product RED、**不是** vcluster 故障、**不是** addon 错。
+
+### Case A.2 — keepalive stop-gap 落地
+
+修法：写 `pf-keepalive-ob-vcluster.sh`（`interval=20s, fail_threshold=2`，pkill 精确匹配，nohup 新进程，写日志到 `/tmp/pf-keepalive-ob-vcluster.log`），与 PITR runner 并跑。
+
+观测（在 RUN_ID `pitr-runtime-pf-keepalive-N3-...` 这一个样本中）：
+- 监控 PID 87376，run 全程 alive（ELAPSED 47m21s）。
+- 触发 3 次 restart：05:33:26Z / 05:34:13Z / 05:38:51Z。
+- 每次触发前后 probe rc 在 `/tmp/pf-keepalive-ob-vcluster.log` 完整记录。
+- T1 PASS（**没有再现 N=3 attempt 的 1800s budget 撞墙**），T-PITR.4 PASS，postReady Completed，readback `1,2,3,4`，cleanup 三态干净。
+
+**单样本观测仅证明**：在该 1 个样本中 keepalive 成功兜底。**不能**外推为「keepalive 永久消除 pf-staleness 风险」。
+
+### Case A.3 — 后续切结构性修法，keepalive 退役
+
+OceanBase enterprise addon 之后切到 `addon-runner-incluster-vcluster-access-pattern-guide.md` 的方案（runner pod 在 idc4 host k8s 内 + in-cluster service DNS），workstation 端 port-forward 不再是 critical path，keepalive 不再启动。本 case 留作过渡期记录。
+
+边界声明：keepalive 落地的 1 样本不外推；切结构性修法后的 3 样本也不外推为「workstation 网络永远不影响 runner」。
+
+附录到此为止。其它引擎或其它 stop-gap keepalive 场景请另起 appendix。