Skip to content

feat(operator): project apply state with on_failure continue policy#337

Merged
Kiran01bm merged 1 commit into
mainfrom
kiran01bm/rollout-continue-projection
Jun 14, 2026
Merged

feat(operator): project apply state with on_failure continue policy#337
Kiran01bm merged 1 commit into
mainfrom
kiran01bm/rollout-continue-projection

Conversation

@Kiran01bm

@Kiran01bm Kiran01bm commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

What and Why?

Makes the parent apply's state projection policy-aware so a single deployment's
terminal failure no longer terminalizes the whole apply under on_failure: continue.
The apply is held active until every sibling deployment settles, then takes the
failed verdict. Every other policy (halt, pause, unrecognized) fails closed and
terminalizes immediately — only the exact value continue is continuable, matching
the claim predicate.

continue governs rollout continuation, never the apply's pass/fail verdict: a
continuable failure still settles to failed once the rollout is done.

child operations ──▶ DeriveRolloutApplyState
  base = DeriveApplyState(states)
  base != failed                       ─▶ base
  failed + any non-continuable child   ─▶ failed   (fail closed)
  failed + all children terminal       ─▶ failed   (verdict)
  failed + continuable + sibling live  ─▶ running   (hold active)

Changes

  • Add DeriveRolloutApplyState + RolloutChild to pkg/state; it builds on
    DeriveApplyState and only modulates the failed case.
  • Route both apply-state writers through it: the operator's
    updateApplyStateFromOperations and the LocalClient's deriveAggregateApplyState.
  • ContinueOnFailure is computed at each call site from the operation's stored
    policy, keeping pkg/state free of a storage dependency.

No behavioural change for single-operation applies (one failed op is already
terminal), so this is dormant until the multi-deployment fan-out wires more than
one operation per apply.

Roadmap / upcoming PRs

This PR is the B2 projection slice. The continuation track is sequenced so each PR
is a small, reviewable step; the on_failure enum surface (#317) is the policy input it
consumes.

Upcoming change Status PR
B1 — derive applies.state as the aggregate over all apply_operations rows ("Option B") ✅ Merged #318
B2 — policy-aware continue projection (hold apply active until siblings settle) This PR #337
B3 — eager failure surfacing + atomic write (see follow-ups below) Next TBD
B4running_degraded active-with-known-failure state Future TBD

Deferred to B3 (fan-out-gated)

This PR makes the projection policy-aware; the operation-state persistence and the
apply-level terminal side-effects still treat one operation's engine result as
authoritative for the whole apply. Both are correct for single-op (so this PR is a safe
no-op today) and only need fixing once more than one operation per apply exists.

Issue Where Fan-out behaviour under continue Fix
Operation failure not persisted operator markOperationFromApplyState parent held running → mirror takes the "leave claimable" branch → failed op never stored failed; sibling gate reads wrong persist the operation's own drive result first, then re-derive the parent from operation rows
Terminal side-effects off per-op result LocalClient progress finalize engine result terminal while projection is running → stamps completed_at, metrics, observers, stops polling early gate apply-level terminal side-effects on the projected apply state, not the per-operation engine result
Terminal guard rejects failed → running updateApplyStateFromOperations guard errors on the legitimate hold-active transition relax to allow failed → running when the projection holds active

Under on_failure "continue" a terminally failed deployment no longer
terminalizes the parent apply while siblings are still in flight: the
apply is held running until the rollout settles, then takes the failed
verdict. Every other policy fails closed and terminalizes immediately.
Copilot AI review requested due to automatic review settings June 14, 2026 03:21

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes parent applies.state derivation policy-aware for multi-operation rollouts by introducing a new rollout projection that honors on_failure: continue (hold the apply active until all sibling operations settle, then fail), while failing closed for all other policies.

Changes:

  • Add state.RolloutChild and state.DeriveRolloutApplyState, building on the existing DeriveApplyState but modulating the failed case under on_failure: continue.
  • Update both aggregate apply-state writers (operator and local client) to use the new rollout-aware projection.
  • Add unit tests covering the new projection and its integration into operator/local client projections.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/state/apply.go Adds RolloutChild and DeriveRolloutApplyState to compute policy-aware apply state across operations.
pkg/state/apply_test.go Adds truth-table tests for DeriveRolloutApplyState.
pkg/tern/local_apply_grouped.go Switches aggregate apply-state derivation to DeriveRolloutApplyState and includes policy documentation.
pkg/tern/local_client_test.go Adds tests ensuring continue/default policy behavior in local aggregate derivation.
pkg/api/operator.go Switches operator aggregate derivation to DeriveRolloutApplyState and updates docs.
pkg/api/operator_test.go Adds tests verifying operator-side projection under continue/default policies.
Comments suppressed due to low confidence (1)

pkg/api/operator.go:761

  • The terminal-to-non-terminal guard will trip in the exact scenario this PR introduces: a parent apply row may already be in a terminal state (failed) from a per-operation drive, but DeriveRolloutApplyState can correctly project it back to running while sibling operations are still in flight under on_failure: continue. Returning an error here would prevent the apply from being held active and effectively re-introduce the premature terminalization the PR is trying to eliminate.

If the intent is to only forbid reviving other terminal states, consider allowing failed → running when the derived state is running (the only non-terminal state DeriveRolloutApplyState returns from a failed base case).

	if state.IsTerminalApplyState(apply.State) && !state.IsTerminalApplyState(derived) {
		return fmt.Errorf("derive apply state for terminal apply %s (%d): child operations derive non-terminal state %q from parent state %q",
			apply.ApplyIdentifier, apply.ID, derived, apply.State)
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/api/operator.go
Comment thread pkg/tern/local_apply_grouped.go
@Kiran01bm Kiran01bm marked this pull request as ready for review June 14, 2026 03:58
@Kiran01bm Kiran01bm requested review from aparajon and morgo as code owners June 14, 2026 03:58
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@Kiran01bm Kiran01bm merged commit f19cb40 into main Jun 14, 2026
28 checks passed
@Kiran01bm Kiran01bm deleted the kiran01bm/rollout-continue-projection branch June 14, 2026 22:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants