From d4597531f60c8fc6d55607e348cf8a3105f22461 Mon Sep 17 00:00:00 2001 From: Wei Cao Date: Wed, 13 May 2026 20:38:56 +0800 Subject: [PATCH] docs: add runner OpenAPI schema fetch brittleness + --validate=false guide Methodology body covers: - Why kubectl apply default client-side OpenAPI schema fetch is a brittle path that hangs / cancels on unstable remote API entry - 5 hard rules: server-side admission must NOT be relaxed; --validate=false only on identified brittle apply points (never global alias); change is runner-harness hardening only (NOT product / addon / KB fix evidence); 3-class apply failure classification (server admission / RBAC / network); consolidate brittle apply points into helper for single grep coverage - patch-gate 3 indicators - 5-point PR review checklist - 2 anti-pattern vs correct-pattern pairs Appendix A is OceanBase enterprise addon T-PITR.4 apply hit Client.Timeout on OpenAPI body read case (RUN_ID pitr-runtime-postreadyfix-N2-...) + runner-side helper landing on tests/pitr.sh sha 6edf74fc -> f4b3375e + later structural fix via addon-runner-incluster-vcluster-access-pattern. Explicit boundary: helper landing only proves apply succeeded in 1 sample; NOT release-ready evidence. --- docs/SKILL-INDEX.md | 1 + ...-openapi-schema-fetch-brittleness-guide.md | 222 ++++++++++++++++++ 2 files changed, 223 insertions(+) create mode 100644 docs/addon-runner-openapi-schema-fetch-brittleness-guide.md diff --git a/docs/SKILL-INDEX.md b/docs/SKILL-INDEX.md index 9576711..7a9a8b8 100644 --- a/docs/SKILL-INDEX.md +++ b/docs/SKILL-INDEX.md @@ -91,6 +91,7 @@ - [`addon-test-runner-portability-guide.md`](addon-test-runner-portability-guide.md) — 跨平台兼容坑 8 条:macOS bash 3.2 + `set -euo pipefail` 下 7 个 + GNU `seq` vs BSD `seq` zero-count 1 个(空数组、env-default 时机、`local x=$(cmd)`、`seq 1 0` 等) - [`addon-test-runner-cadence-discipline-guide.md`](addon-test-runner-cadence-discipline-guide.md) — 长时 runner 操作期间的固定节奏汇报纪律;cadence 是操作者义务,触发器必须独立于 runner 进程 - [`addon-test-runner-write-after-bounded-role-gate-guide.md`](addon-test-runner-write-after-bounded-role-gate-guide.md) — runner 任何写步骤必须在 bounded、多视角 role-publish gate 之后;gate 通过后下游 primary 必须从 gate 重锚;gate 写探针只允许在 K8s role-label primary 上执行,secondary 不允许写复制表(避免内部 admin 用户穿透 read_only 后撞 1062);含 alpha.58 preload-after-gate 与 alpha.59 secondary write probe 两条 case +- [`addon-runner-openapi-schema-fetch-brittleness-guide.md`](addon-runner-openapi-schema-fetch-brittleness-guide.md) — `kubectl apply` 默认会先拉 OpenAPI schema 做 client-side 校验,远端 API 抖动会让 apply 在拉 body 时 cancel;只在 brittle apply 点用 `--validate=false` helper(不是全局 alias)关 client-side validate,server-side admission 仍生效;改动归 runner-harness 通道硬化,**不**作为产品 / addon / KB 稳定证据;apply 失败 3 类分类(server admission / RBAC / network transient);附 OceanBase enterprise addon T-PITR.4 撞 OpenAPI fetch 瞬时失败 + helper 落地案例(仅作 motivation,不外推 release-ready) ### 6. 协作 / GitHub 提交纪律 diff --git a/docs/addon-runner-openapi-schema-fetch-brittleness-guide.md b/docs/addon-runner-openapi-schema-fetch-brittleness-guide.md new file mode 100644 index 0000000..180dea1 --- /dev/null +++ b/docs/addon-runner-openapi-schema-fetch-brittleness-guide.md @@ -0,0 +1,222 @@ +# Addon Test Runner — `kubectl apply` Client-Side OpenAPI Fetch Brittleness and Safe `--validate=false` Use + +> **Audience**: addon test engineer / TL / runner harness owner +> **Status**: stable methodology +> **Applies to**: any test harness that issues `kubectl apply -f -` (or `kubectl apply -f `) against a remote / unstable k8s API entry where the client may hang on OpenAPI schema download. +> **Applies to KB version**: any. + +属于:方法论主题文档(不绑定单一引擎)。 + +## 1. 这篇要解决的问题 + +`kubectl apply` 默认在执行前会先从 API server 拉一份 **OpenAPI schema** 做客户端校验(client-side validate)。这条路径有两个隐藏风险: + +1. **OpenAPI schema 拉取走的是独立的 HTTP body 流**,远端 API 一旦慢/抖动,body 读到一半 `Client.Timeout exceeded while awaiting headers` / `request canceled while waiting for connection` / `context deadline exceeded`,**整个 `kubectl apply` 失败**,但实际 manifest 没有问题,apiserver 也没有 admission 拒绝。 +2. **测试 runner 在这个时刻没有信号能区分**「manifest 错」/「server 真的拒了」/「client 拉 schema 网络抖动」。三类都表现为 `kubectl apply` 退出 rc != 0。 + +实战表现(参考附录 A):在一个 PITR runtime 测试的 T-PITR.4 阶段,`kubectl apply -f -` 撞 `failed to download openapi: ... net/http: request canceled (Client.Timeout or context cancellation while reading body)`,OpsRequest 没有被创建;client retry 后正常。**这不是 manifest 错,也不是 server 拒**——是 client validate 路径的网络抖动。 + +**目标**:把 client-side schema fetch 这条 brittle path 从 runner critical path 中剥离;但**前提是不放松实际的 server-side admission validation**。 + +> **边界 framing**:这一篇只讲 `kubectl apply` 的 client-side validation 关闭条件;不是说"所有 validation 都可以关"。server-side admission(CR 字段约束、webhook、defaulting)必须仍然生效。 + +## 2. 硬规则 + +### 规则 A — Server-side admission 不可放松;client-side validate 可在 brittle 路径上关闭 + +``` +client side:kubectl pull OpenAPI -> validate manifest fields against schema + ↓ pass +server side:apiserver admission -> CR validation webhook -> defaulting -> persist +``` + +`--validate=false` 关掉的是上面**第一层**(client side)。第二层(server side)由 apiserver 决定,关不掉。 + +**判断「safe to disable client validate」的两个充分条件**: + +1. 这个 apply 的 manifest **CR kind 在 cluster 上有 server-side validation**(CRD `openAPIV3Schema` 已下发,或 admission webhook 起作用)。 +2. 该 manifest 是 **runner 自己模板化生成**(不是用户手写),错的可能性只来自 runner 模板 bug,不是 typo / 字段名错。 + +两个条件都成立时,client validate 关掉的语义损失 ≈ 0;网络稳定性收益 = 消除 OpenAPI fetch 这一类瞬时失败。 + +### 规则 B — 关 client validate 的写法只对 brittle 路径上的 apply 用,不全局关 + +```bash +# 反例:全局 alias +alias kubectl='kubectl --validate=false' +``` + +```bash +# OK:只在被识别的 brittle apply 点用 +kubectl apply --validate=false -f - < # grep -c 'addon_apply_brittle' tests/*.sh +brittle_apply_raw_kubectl= # grep -cE 'kubectl apply ' tests/*.sh - 减去用 helper 的;期望 0 或可知列表 +brittle_apply_uses_validate_false= # helper 内部含 --validate=false +``` + +patch-gate 仅证明 helper 在位与覆盖度,**不**证明产品 / addon 稳定。 + +## 4. PR 评审 checklist + +5 条: + +1. `--validate=false` 是否只在 helper / 已识别的 brittle apply 点用,不是全局 alias? +2. 同一 PR 中是否声明 server-side validation 仍生效(webhook / CRD schema 存在)? +3. 是否给 apply 失败的退出码 / reason 字符串分类(server admission / RBAC / network),不是 `exit 1` 一把抓? +4. closeout 模板有没有把这个改动当 runner-harness(不是产品修复)? +5. 旧的「曾撞过 OpenAPI fetch 失败」的样本有没有作为该改动的 motivation 单独记录(不外推 / 不当 release 证据)? + +## 5. 反例 vs 正例 + +### 反例 1(全局 alias) + +```bash +alias kubectl='kubectl --validate=false' +./tests/all.sh +``` + +任何 apply 都没有 client-side validation,未来手写 manifest 字段错误时只能等 server 拒——增加调试成本。 + +### 正例(helper + 局部) + +```bash +# lib/common.sh +addon_apply_brittle() { + kubectl apply --validate=false -f - "$@" +} + +# tests/some-test.sh +addon_apply_brittle <