feat(operator): project apply state with on_failure continue policy#337
Conversation
Under on_failure "continue" a terminally failed deployment no longer terminalizes the parent apply while siblings are still in flight: the apply is held running until the rollout settles, then takes the failed verdict. Every other policy fails closed and terminalizes immediately.
There was a problem hiding this comment.
Pull request overview
This PR makes parent applies.state derivation policy-aware for multi-operation rollouts by introducing a new rollout projection that honors on_failure: continue (hold the apply active until all sibling operations settle, then fail), while failing closed for all other policies.
Changes:
- Add
state.RolloutChildandstate.DeriveRolloutApplyState, building on the existingDeriveApplyStatebut modulating the failed case underon_failure: continue. - Update both aggregate apply-state writers (operator and local client) to use the new rollout-aware projection.
- Add unit tests covering the new projection and its integration into operator/local client projections.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/state/apply.go | Adds RolloutChild and DeriveRolloutApplyState to compute policy-aware apply state across operations. |
| pkg/state/apply_test.go | Adds truth-table tests for DeriveRolloutApplyState. |
| pkg/tern/local_apply_grouped.go | Switches aggregate apply-state derivation to DeriveRolloutApplyState and includes policy documentation. |
| pkg/tern/local_client_test.go | Adds tests ensuring continue/default policy behavior in local aggregate derivation. |
| pkg/api/operator.go | Switches operator aggregate derivation to DeriveRolloutApplyState and updates docs. |
| pkg/api/operator_test.go | Adds tests verifying operator-side projection under continue/default policies. |
Comments suppressed due to low confidence (1)
pkg/api/operator.go:761
- The terminal-to-non-terminal guard will trip in the exact scenario this PR introduces: a parent apply row may already be in a terminal state (failed) from a per-operation drive, but DeriveRolloutApplyState can correctly project it back to running while sibling operations are still in flight under on_failure: continue. Returning an error here would prevent the apply from being held active and effectively re-introduce the premature terminalization the PR is trying to eliminate.
If the intent is to only forbid reviving other terminal states, consider allowing failed → running when the derived state is running (the only non-terminal state DeriveRolloutApplyState returns from a failed base case).
if state.IsTerminalApplyState(apply.State) && !state.IsTerminalApplyState(derived) {
return fmt.Errorf("derive apply state for terminal apply %s (%d): child operations derive non-terminal state %q from parent state %q",
apply.ApplyIdentifier, apply.ID, derived, apply.State)
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
What and Why?
Makes the parent apply's state projection policy-aware so a single deployment's
terminal failure no longer terminalizes the whole apply under
on_failure: continue.The apply is held active until every sibling deployment settles, then takes the
failed verdict. Every other policy (
halt,pause, unrecognized) fails closed andterminalizes immediately — only the exact value
continueis continuable, matchingthe claim predicate.
continuegoverns rollout continuation, never the apply's pass/fail verdict: acontinuable failure still settles to
failedonce the rollout is done.Changes
DeriveRolloutApplyState+RolloutChildtopkg/state; it builds onDeriveApplyStateand only modulates the failed case.updateApplyStateFromOperationsand the LocalClient'sderiveAggregateApplyState.ContinueOnFailureis computed at each call site from the operation's storedpolicy, keeping
pkg/statefree of a storage dependency.No behavioural change for single-operation applies (one failed op is already
terminal), so this is dormant until the multi-deployment fan-out wires more than
one operation per apply.
Roadmap / upcoming PRs
This PR is the B2 projection slice. The continuation track is sequenced so each PR
is a small, reviewable step; the
on_failureenum surface (#317) is the policy input itconsumes.
applies.stateas the aggregate over allapply_operationsrows ("Option B")continueprojection (hold apply active until siblings settle)running_degradedactive-with-known-failure stateDeferred to B3 (fan-out-gated)
This PR makes the projection policy-aware; the operation-state persistence and the
apply-level terminal side-effects still treat one operation's engine result as
authoritative for the whole apply. Both are correct for single-op (so this PR is a safe
no-op today) and only need fixing once more than one operation per apply exists.
continuemarkOperationFromApplyStaterunning→ mirror takes the "leave claimable" branch → failed op never storedfailed; sibling gate reads wrongrunning→ stampscompleted_at, metrics, observers, stops polling earlyfailed → runningupdateApplyStateFromOperationsfailed → runningwhen the projection holds active