Skip to content

feat(config): add on_failure enum for rollout continuation#317

Merged
Kiran01bm merged 4 commits into
mainfrom
kiran01bm/on-failure-enum
Jun 14, 2026
Merged

feat(config): add on_failure enum for rollout continuation#317
Kiran01bm merged 4 commits into
mainfrom
kiran01bm/on-failure-enum

Conversation

@Kiran01bm

@Kiran01bm Kiran01bm commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

What and Why?

Introduces an on_failure enum (halt | continue | pause, default halt) as the per-environment rollout-continuation policy, replacing the halt_on_failure boolean knob. Behaviour matches the boolean for the two existing values; pause is parsed but validation-rejected until its release machinery lands.

on_failure: halt -> failed sibling blocks the rest of the rollout (default)
on_failure: continue -> terminal-failed sibling is exempted; rollout continues
on_failure: pause -> rejected in validation (follow-up)

  • Config: EnvironmentConfig.OnFailure + ServerConfig.OnFailure(database, environment) resolver, defaulting to halt.
  • Validation: requires a deployments map; rejects unknown values and pause.
  • Storage: adds per-row column on_failure varchar(16) NOT NULL DEFAULT 'halt' (mirrors cutover_policy); picked up by EnsureSchema.
  • Claim predicate: FindNextApplyOperation exempts a terminal-failed earlier sibling only when on_failure = 'continue'; every other value — including unrecognized ones — fails closed and keeps blocking.

Dormant until the multi-deployment fan-out is wired (one operation per apply today), so the field, column, and predicate are no-ops at runtime regardless of value.

Schema rollout safety (two-step)

apply_operations.halt_on_failure already ships in the live schema, and EnsureSchema reconciles a renamed column by drop+add, not rename. Dropping it in the same change that adds on_failure would break still-running old pods mid-deploy, since they still query halt_on_failure. So this PR is the additive step — it adds on_failure and keeps halt_on_failure; a follow-up removes the old column once every pod runs the on_failure code.

Patch 1 (this PR): add on_failure, keep halt_on_failure -> old pods unaffected
Patch 2 (follow-up): drop halt_on_failure -> no code references it

The new on_failure column is additive (default halt), so no existing data is touched.

Roadmap / upcoming PRs

This PR (the on_failure enum surface) is merged. The runtime continuation behaviour it gates lands in the B-track continuation-projection follow-ups; cross-referenced below so the enum surface and its consumers stay correlated.

Upcoming change Status PR
on_failure enum surface (config/storage/predicate), additive column ✅ Merged #317
Aggregate applies.state over all apply_operations rows ("Option B") ✅ Merged #318
Policy-aware continue projection — hold apply active until siblings settle (B2) Next TBD
Eager failure surfacing + stop halts remaining siblings under continue (B3) Planned TBD
Drop halt_on_failure column (patch 2 above) Planned TBD
on_failure: pause as a release gate Planned (depends on B2 + pause/release machinery) TBD
Compile on_failure into dependency-graph edges (remove runtime knob) Future (blocked on north-star Phases 4–5) TBD

Copilot AI review requested due to automatic review settings June 12, 2026 04:02

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates SchemaBot’s configuration, storage schema, and claim predicate to replace the (previously unshipped) halt_on_failure boolean with an on_failure enum (halt | continue | pause, defaulting to halt). It also keeps halt_on_failure as a deprecated alias that normalizes at config load, aiming to settle the durable persisted shape before the feature ships.

Changes:

  • Introduces storage.OnFailure{Halt,Continue,Pause} constants and persists ApplyOperation.OnFailure as a string enum instead of HaltOnFailure *bool.
  • Updates MySQL storage schema and store logic to use apply_operations.on_failure (including the sibling-gating exemption logic).
  • Adds config loading normalization + validation for on_failure, including rejecting pause and rejecting configs that set both on_failure and halt_on_failure.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/storage/types.go Adds on-failure enum constants and replaces ApplyOperation.HaltOnFailure with ApplyOperation.OnFailure.
pkg/storage/mysqlstore/apply_operations.go Writes/reads on_failure, defaults empty to halt, and updates the sibling-gate exemption to key on on_failure = 'continue'.
pkg/storage/mysqlstore/apply_operations_test.go Updates claim-loop tests to use and assert persistence of on_failure.
pkg/schema/mysql/apply_operations.sql Changes the apply_operations table definition to store on_failure varchar(16) NOT NULL DEFAULT 'halt'.
pkg/api/plan_handlers.go Switches apply-operation creation to resolve and store on_failure.
pkg/api/config.go Adds EnvironmentConfig.OnFailure, normalizes deprecated halt_on_failure, validates allowed values, and updates the resolver to OnFailure(...).
pkg/api/config_test.go Updates and expands tests for the new resolver, alias normalization, and validation rules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Kiran01bm Kiran01bm marked this pull request as ready for review June 12, 2026 04:12
@Kiran01bm Kiran01bm requested review from aparajon and morgo as code owners June 12, 2026 04:12
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

Introduce on_failure: halt|continue|pause (default halt) as the per-environment
rollout-continuation policy, before the knob ships. Behaviour matches the prior
unshipped boolean for halt/continue; pause is validated-rejected until its
release machinery lands.
@Kiran01bm Kiran01bm force-pushed the kiran01bm/on-failure-enum branch from 9067642 to c3dd8b5 Compare June 12, 2026 12:03
@Kiran01bm Kiran01bm changed the title feat(config): replace halt_on_failure boolean with on_failure enum feat(config): add on_failure enum for rollout continuation Jun 12, 2026
@aparajon

Copy link
Copy Markdown
Collaborator

Reviewed for correctness and fail-closed behavior. The enum lands cleanly: pause parsed but validation-rejected, default halt enforced at resolver, insert fallback, and column default; the claim predicate exempts only exact on_failure = 'continue' with a failed earlier sibling, so unknown values fail closed. Three small items:

  1. pkg/presentation/presentation.go:42Operation.HaltOnFailure's doc comment still describes dereferencing the stored *bool, which this PR deletes. Suggest updating the comment — or carrying the on_failure enum through the presentation input instead of a bool, so a future pause value doesn't need a second field re-plumb. (Note this field is also touched by feat(github): dispatch apply comment on deployment count #320; whichever merges second needs to absorb the change.)

  2. Fail-closed proof at the predicate — could we add one claim-predicate test where a row carries an unrecognized on_failure value (e.g. pause) and assert the failed sibling still blocks? Config validation prevents it reaching storage today, but a storage-level test pins the safety property if validation ever changes.

  3. PR body nit — the column replacement relies on EnsureSchema drop+add; worth a one-line mention that any environment that ran the prior column loses its values (fine since it never shipped, but good to state).

Verdict: LGTM with the above nits.


🤖 This review was generated by Claude Code (claude-fable-5) with maintainer approval.

@aparajon aparajon left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving — fail-closed enum handling verified end to end. Nits in my earlier comment (stale presentation doc comment, predicate-level fail-closed test) are non-blocking.

🤖 Approval issued by Claude Code (claude-fable-5) with maintainer approval.

The on_failure column is additive; keeping the live halt_on_failure
column means a rolling deploy never drops a column older pods still
query. A follow-up removes it once every pod runs the on_failure code.
@Kiran01bm

Copy link
Copy Markdown
Collaborator Author
  • pkg/presentation/presentation.go:42Operation.HaltOnFailure's doc comment still describes dereferencing the stored *bool, which this PR deletes. Suggest updating the comment — or carrying the on_failure enum through the presentation input instead of a bool, so a future pause value doesn't need a second field re-plumb. (Note this field is also touched by feat(github): dispatch apply comment on deployment count #320; whichever merges second needs to absorb the change.)
  • Fail-closed proof at the predicate — could we add one claim-predicate test where a row carries an unrecognized on_failure value (e.g. pause) and assert the failed sibling still blocks? Config validation prevents it reaching storage today, but a storage-level test pins the safety property if validation ever changes.
  • PR body nit — the column replacement relies on EnsureSchema drop+add; worth a one-line mention that any environment that ran the prior column loses its values (fine since it never shipped, but good to state).

Thanks for the note:

  1. presentation.go:42 doc comment → not actionable in feat(config): add on_failure enum for rollout continuation #317 as pkg/presentation/ doesn't exist on this branch and there are no HaltOnFailure references here as that field lives in feat(github): dispatch apply comment on deployment count #320 and since pull/317 removes the stored *bool, pull/320's stale doc comment is pull/320's fix to make when it rebases. Nothing to change here.
  2. fail-closed proof at the predicate → done ✅ - added UnrecognizedOnFailureBlocksFailedSibling stores pause (an unrecognized value) on the failed sibling and asserts the rollout stays blocked, pinning the safety property even if config validation changes.
  3. with the two-patch split, feat(config): add on_failure enum for rollout continuation #317 no longer drops the column — on_failure is purely additive, so no data is lost. The drop is deferred to Patch 2.

…m predicate

Only exact "continue" exempts a terminal-failed earlier sibling; any
other stored value must keep it blocking. Pins the safety property even
if config validation ever stops rejecting unknown values.
…enum

* origin/main: (22 commits)
  refactor(operator): derive applies.state from all operation rows (#318)
  fix(operator): make PlanetScale setup-phase applies recoverable after crash (#302)
  test(cli): add onboard integration coverage (#332)
  fix(github): run all webhook-spawned goroutines with panic recovery (#329)
  fix(github): retain locks and check state when a PR closes with an in-flight apply (#326)
  fix(github): support PR comment start command (#324)
  feat(cli): add `onboard` and `pull` commands (#323)
  feat(api): expose routing tern client (#322)
  feat(tern): add routing client (#316)
  feat(api): add live schema pull primitive (#313)
  fix(operator): stop conflict check from failing stopped or retryable tasks (#271)
  fix(operator): derive apply state from tasks when stop finds no active work (#262)
  fix(github): fail closed on staging-first gate when target env is unlisted (#301)
  fix(operator): stop the engine for all active states before recording stop (#269)
  fix(config): validate revert_window_duration and token references (#275)
  fix(storage): make rows-affected checks correct under production changed-rows semantics (#307)
  fix(operator): write apply_operations row when creating applies via the Tern client (#268)
  fix(github): enforce actor authorization for rollback and unlock commands (#264)
  bump spirit (#321)
  fix(api): reject null schema_files namespaces instead of panicking (#278)
  ...
@Kiran01bm Kiran01bm merged commit c9c9a9d into main Jun 14, 2026
27 checks passed
@Kiran01bm Kiran01bm deleted the kiran01bm/on-failure-enum branch June 14, 2026 01:05
Kiran01bm added a commit that referenced this pull request Jun 14, 2026
#317 replaced the apply_operations halt_on_failure *bool with the
on_failure string enum. Resolve the presentation HaltOnFailure flag from
on_failure at the storage boundary, failing closed to halting for any
value other than "continue".
Kiran01bm added a commit that referenced this pull request Jun 14, 2026
Patch 2 of the on_failure migration. #317 moved all code to the
on_failure enum and kept halt_on_failure additively so in-flight old
pods were unaffected; nothing reads the column now, so remove it.
Must deploy only after the on_failure code is fully rolled out.
Kiran01bm added a commit that referenced this pull request Jun 14, 2026
* feat(github): render multi-deployment apply comment from presentation model

MD-10b: aggregate header (state, per-status counts, single next-action),
an at-a-glance per-deployment summary, and a <details> per deployment that
reuses the single-deployment renderer for full per-table fidelity. Dormant
until callers dispatch on operation-row count; single-deployment UX unchanged.

* fix(github): render executable next-action commands and escape multi-deployment summaries

Address review on the multi-deployment apply comment: suggest only the
apply-id command forms today's CLI accepts (cutover/start) and the retry
form for a failed apply, since revert applies only to a deployment in its
post-cutover window and no --deployment flag exists yet. HTML-escape
config-derived deployment names/labels, and render a placeholder for a
deployment with no detail. Document that ApplyID/Environment are always
set so the renderer need not guard them.

* feat(github): dispatch apply comment on deployment count

MD-10c: map apply_operations to the presentation model and route the
status comment to the single- or multi-deployment renderer by operation
count. Dormant until the observer passes operations; single-deployment
and legacy applies render byte-for-byte as today.

* fix(github): map on_failure enum to the presentation halt flag

#317 replaced the apply_operations halt_on_failure *bool with the
on_failure string enum. Resolve the presentation HaltOnFailure flag from
on_failure at the storage boundary, failing closed to halting for any
value other than "continue".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants