feat(config): add on_failure enum for rollout continuation#317
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates SchemaBot’s configuration, storage schema, and claim predicate to replace the (previously unshipped) halt_on_failure boolean with an on_failure enum (halt | continue | pause, defaulting to halt). It also keeps halt_on_failure as a deprecated alias that normalizes at config load, aiming to settle the durable persisted shape before the feature ships.
Changes:
- Introduces
storage.OnFailure{Halt,Continue,Pause}constants and persistsApplyOperation.OnFailureas a string enum instead ofHaltOnFailure *bool. - Updates MySQL storage schema and store logic to use
apply_operations.on_failure(including the sibling-gating exemption logic). - Adds config loading normalization + validation for
on_failure, including rejectingpauseand rejecting configs that set bothon_failureandhalt_on_failure.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/storage/types.go | Adds on-failure enum constants and replaces ApplyOperation.HaltOnFailure with ApplyOperation.OnFailure. |
| pkg/storage/mysqlstore/apply_operations.go | Writes/reads on_failure, defaults empty to halt, and updates the sibling-gate exemption to key on on_failure = 'continue'. |
| pkg/storage/mysqlstore/apply_operations_test.go | Updates claim-loop tests to use and assert persistence of on_failure. |
| pkg/schema/mysql/apply_operations.sql | Changes the apply_operations table definition to store on_failure varchar(16) NOT NULL DEFAULT 'halt'. |
| pkg/api/plan_handlers.go | Switches apply-operation creation to resolve and store on_failure. |
| pkg/api/config.go | Adds EnvironmentConfig.OnFailure, normalizes deprecated halt_on_failure, validates allowed values, and updates the resolver to OnFailure(...). |
| pkg/api/config_test.go | Updates and expands tests for the new resolver, alias normalization, and validation rules. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Introduce on_failure: halt|continue|pause (default halt) as the per-environment rollout-continuation policy, before the knob ships. Behaviour matches the prior unshipped boolean for halt/continue; pause is validated-rejected until its release machinery lands.
9067642 to
c3dd8b5
Compare
|
Reviewed for correctness and fail-closed behavior. The enum lands cleanly:
Verdict: LGTM with the above nits. 🤖 This review was generated by Claude Code (claude-fable-5) with maintainer approval. |
aparajon
left a comment
There was a problem hiding this comment.
Approving — fail-closed enum handling verified end to end. Nits in my earlier comment (stale presentation doc comment, predicate-level fail-closed test) are non-blocking.
🤖 Approval issued by Claude Code (claude-fable-5) with maintainer approval.
The on_failure column is additive; keeping the live halt_on_failure column means a rolling deploy never drops a column older pods still query. A follow-up removes it once every pod runs the on_failure code.
Thanks for the note:
|
…m predicate Only exact "continue" exempts a terminal-failed earlier sibling; any other stored value must keep it blocking. Pins the safety property even if config validation ever stops rejecting unknown values.
…enum * origin/main: (22 commits) refactor(operator): derive applies.state from all operation rows (#318) fix(operator): make PlanetScale setup-phase applies recoverable after crash (#302) test(cli): add onboard integration coverage (#332) fix(github): run all webhook-spawned goroutines with panic recovery (#329) fix(github): retain locks and check state when a PR closes with an in-flight apply (#326) fix(github): support PR comment start command (#324) feat(cli): add `onboard` and `pull` commands (#323) feat(api): expose routing tern client (#322) feat(tern): add routing client (#316) feat(api): add live schema pull primitive (#313) fix(operator): stop conflict check from failing stopped or retryable tasks (#271) fix(operator): derive apply state from tasks when stop finds no active work (#262) fix(github): fail closed on staging-first gate when target env is unlisted (#301) fix(operator): stop the engine for all active states before recording stop (#269) fix(config): validate revert_window_duration and token references (#275) fix(storage): make rows-affected checks correct under production changed-rows semantics (#307) fix(operator): write apply_operations row when creating applies via the Tern client (#268) fix(github): enforce actor authorization for rollback and unlock commands (#264) bump spirit (#321) fix(api): reject null schema_files namespaces instead of panicking (#278) ...
#317 replaced the apply_operations halt_on_failure *bool with the on_failure string enum. Resolve the presentation HaltOnFailure flag from on_failure at the storage boundary, failing closed to halting for any value other than "continue".
Patch 2 of the on_failure migration. #317 moved all code to the on_failure enum and kept halt_on_failure additively so in-flight old pods were unaffected; nothing reads the column now, so remove it. Must deploy only after the on_failure code is fully rolled out.
* feat(github): render multi-deployment apply comment from presentation model MD-10b: aggregate header (state, per-status counts, single next-action), an at-a-glance per-deployment summary, and a <details> per deployment that reuses the single-deployment renderer for full per-table fidelity. Dormant until callers dispatch on operation-row count; single-deployment UX unchanged. * fix(github): render executable next-action commands and escape multi-deployment summaries Address review on the multi-deployment apply comment: suggest only the apply-id command forms today's CLI accepts (cutover/start) and the retry form for a failed apply, since revert applies only to a deployment in its post-cutover window and no --deployment flag exists yet. HTML-escape config-derived deployment names/labels, and render a placeholder for a deployment with no detail. Document that ApplyID/Environment are always set so the renderer need not guard them. * feat(github): dispatch apply comment on deployment count MD-10c: map apply_operations to the presentation model and route the status comment to the single- or multi-deployment renderer by operation count. Dormant until the observer passes operations; single-deployment and legacy applies render byte-for-byte as today. * fix(github): map on_failure enum to the presentation halt flag #317 replaced the apply_operations halt_on_failure *bool with the on_failure string enum. Resolve the presentation HaltOnFailure flag from on_failure at the storage boundary, failing closed to halting for any value other than "continue".
What and Why?
Introduces an
on_failureenum (halt | continue | pause, defaulthalt) as the per-environment rollout-continuation policy, replacing thehalt_on_failureboolean knob. Behaviour matches the boolean for the two existing values;pauseis parsed but validation-rejected until its release machinery lands.on_failure: halt -> failed sibling blocks the rest of the rollout (default)
on_failure: continue -> terminal-failed sibling is exempted; rollout continues
on_failure: pause -> rejected in validation (follow-up)
EnvironmentConfig.OnFailure+ServerConfig.OnFailure(database, environment)resolver, defaulting tohalt.deploymentsmap; rejects unknown values andpause.on_failure varchar(16) NOT NULL DEFAULT 'halt'(mirrorscutover_policy); picked up byEnsureSchema.FindNextApplyOperationexempts a terminal-failedearlier sibling only whenon_failure = 'continue'; every other value — including unrecognized ones — fails closed and keeps blocking.Dormant until the multi-deployment fan-out is wired (one operation per apply today), so the field, column, and predicate are no-ops at runtime regardless of value.
Schema rollout safety (two-step)
apply_operations.halt_on_failurealready ships in the live schema, andEnsureSchemareconciles a renamed column by drop+add, not rename. Dropping it in the same change that addson_failurewould break still-running old pods mid-deploy, since they still queryhalt_on_failure. So this PR is the additive step — it addson_failureand keepshalt_on_failure; a follow-up removes the old column once every pod runs theon_failurecode.Patch 1 (this PR): add on_failure, keep halt_on_failure -> old pods unaffected
Patch 2 (follow-up): drop halt_on_failure -> no code references it
The new
on_failurecolumn is additive (defaulthalt), so no existing data is touched.Roadmap / upcoming PRs
This PR (the
on_failureenum surface) is merged. The runtime continuation behaviour it gates lands in the B-track continuation-projection follow-ups; cross-referenced below so the enum surface and its consumers stay correlated.on_failureenum surface (config/storage/predicate), additive columnapplies.stateover allapply_operationsrows ("Option B")continueprojection — hold apply active until siblings settle (B2)stophalts remaining siblings undercontinue(B3)halt_on_failurecolumn (patch 2 above)on_failure: pauseas a release gateon_failureinto dependency-graph edges (remove runtime knob)