Skip to content

docs: add cluster Running gate budget sizing guide#123

Open
weicao wants to merge 1 commit into
mainfrom
feature/addon-cluster-running-gate-budget
Open

docs: add cluster Running gate budget sizing guide#123
weicao wants to merge 1 commit into
mainfrom
feature/addon-cluster-running-gate-budget

Conversation

@weicao
Copy link
Copy Markdown
Contributor

@weicao weicao commented May 13, 2026

Summary

New methodology doc addon-cluster-running-gate-budget-guide.md covering how to size test-runner T1 wait_cluster_running budget based on env P99 (not product normal start time), and how to classify the failure layer when the budget is exceeded.

Body (generic methodology, version-agnostic, no engine binding):

  • §1: Anti-patterns (using product normal time / using arbitrary huge number)
  • §2: 5 hard rules — env-variable budget, P99+50% default from env sample distribution, 4 observation outputs on timeout, generic exit reason for upper-layer classification, 3-level budget nesting (caller > helper > kubectl --request-timeout)
  • §2 Rule C: 7-row layer classification hint table
  • §2 Rule D: "exceeded budget is NOT product RED until classified"
  • §3: 6-point PR review checklist
  • §4: 2 anti-pattern vs correct-pattern pairs
  • §5: relation to addon-bounded-eventual-convergence-guide.md, addon-postready-bounded-timeout-failure-classification-guide.md, addon-test-acceptance-and-first-blocker-guide.md, addon-evidence-discipline-guide.md

Appendix A is OceanBase enterprise addon postreadyfix3 (1200s false-positive due to env-slow-start cold pull) → postreadyfix4 (1800s default after env P99+50% calibration) case. Explicit boundary: 1800s default only calibrated for idc4 vcluster + apecloud registry + apelocal-hostpath SC scope; other env must independently calibrate. 6 subsequent samples T1 <= 4min, NOT product validation / NOT release-ready.

SKILL-INDEX.md updated: added entry under ### 5. 改造 runner / 工具链.

Test plan

  • Manual: methodology body has zero OB-specific commands
  • Manual: appendix has explicit boundary
  • Manual: cross-doc references resolve

🤖 Generated with Claude Code

Methodology body covers:
- Why test-runner T1 budget must be sized by env P99 (not product normal)
- 5 hard rules: env-variable budget, P99+50% default, 4 observation outputs
  on timeout, generic exit reason for upper-layer classification, 3-level
  budget nesting (caller > helper > kubectl --request-timeout)
- 7-row layer classification hint table (env scheduler / env image-pull /
  env storage / env node-pressure / control-plane / product / k8s API entry)
- 6-point PR review checklist
- 2 anti-pattern pairs

Appendix A is OceanBase enterprise addon postreadyfix3 (1200s false-positive)
vs postreadyfix4 (1800s default after env P99+50% calibration) case with
explicit "do not extrapolate to other env" boundary; 6 subsequent samples
under same env scope had T1 <= 4min, NOT product validation.
@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 13, 2026

Blocking for merge:

  1. PR body still contains 🤖 Generated with [Claude Code].... Public PR body must not include AI/tool attribution.
  2. The new guide intro has only 4 standard fields. Please add > **Affected by version skew**: ... after Applies to KB version.
  3. This PR links to addon-postready-bounded-timeout-failure-classification-guide.md from PR docs: add postReady bounded timeout + failure classification guide #122, which is not on main yet. Either merge/rebase after docs: add postReady bounded timeout + failure classification guide #122 lands, or mark the cross-link as pending and keep this PR blocked until the target exists.

Content framing is otherwise in the right shape: budget by env distribution, classify before product RED, and appendix scope stays narrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant