Skip to content

docs(controller): add root-cause vs fail-fast fix direction guide#240

Closed
weicao wants to merge 1 commit into
mainfrom
docs/controller-root-cause-vs-fail-fast-fix-direction
Closed

docs(controller): add root-cause vs fail-fast fix direction guide#240
weicao wants to merge 1 commit into
mainfrom
docs/controller-root-cause-vs-fail-fast-fix-direction

Conversation

@weicao
Copy link
Copy Markdown
Contributor

@weicao weicao commented May 19, 2026

Problem

Reviewing controller bug fixes this week exposed a recurring failure mode: a PR adds an early fail-fast check at the entry that originally hit the bug, the symptom disappears in that one scenario, but the root cause in the controller state-machine remains. Any other entry that writes the same illegal state later reproduces the bug.

Field example: PR #10254 v1 (27fa72416) added schema validation in pkg/operations/reconfigure.go so the Reconfigure Ops entry rejects an invalid request before writing ComponentParameter.Spec.Desired. That stops the Valkey false-success scene from that one entry, but nothing prevents another writer (Cluster API, controller reconcile, or a future ops type) from putting the same illegal value into Desired and reaching the same false-success state. Three rounds of westonnnn pushback later, v3 (08a7c6482) moved the protection into the ComponentParameter controller's own processing of Spec.Desired, where the protection covers every entry.

If we do not write down this lesson, the team will keep accepting fail-fast PRs as "fix" when they are actually short-term workarounds; the root-cause work then gets pushed off indefinitely.

Scope

New short guide docs/controller/addon-controller-root-cause-vs-fail-fast-fix-direction-guide.md (91 lines, one topic per file per addon-docs-writing).

Why

Two hard rules need to land in the team workflow:

  1. For any controller bug fix, the design phase must answer "which controller owns the illegal state, and can the protection live in that controller's processing chain"; if yes, the protection must be there (root cause). Single-entry fail-fast is only a short-term workaround when (a) the root-cause fix needs a long refactor and (b) a real user is already hitting it.
  2. The reviewer's first question on any controller bug fix PR must be "is this fail-fast at one entry, or fail-closed in the chain that owns the state". If the PR is fail-fast only and does not record a long-term root-cause direction in its body, the reviewer must block it.

The guide expresses both as decision tree, anti-pattern table, and a concrete case appendix from PR #10254 v1-v6.

Verification

  • addon-docs-writing 3Q self-check passed (jargon removed, 60-second readability, first body paragraph states the problem).
  • Length 91 lines, well under the 150-line methodology limit.
  • Case appendix uses real PR / commit SHAs; design boundary shift recorded against westonnnn's five pushback messages.
  • Cross-references to companion controller-dev guides (addon-controller-patch-identity-preservation-guide.md, addon-controller-pr-author-motivation-and-review-channel-guide.md, addon-controller-diagnostic-pr-scope-guide.md) without rule duplication.

Related

@weicao
Copy link
Copy Markdown
Contributor Author

weicao commented May 19, 2026

Superseded by Lily — coordination race; @lily takes single-owner per team consensus (msg 631df426). Draft content moved via DM. Reviewers will land on @lily's PR.

@weicao weicao closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant