Skip to content

fix(tern): finalize applies on the rollout-projected state, not per-op engine result#347

Open
Kiran01bm wants to merge 2 commits into
mainfrom
kiran01bm/rollout-continue-terminal-side-effects
Open

fix(tern): finalize applies on the rollout-projected state, not per-op engine result#347
Kiran01bm wants to merge 2 commits into
mainfrom
kiran01bm/rollout-continue-terminal-side-effects

Conversation

@Kiran01bm

Copy link
Copy Markdown
Collaborator

What

Apply-level terminal side-effects — stamping completed_at, dropping the
active-applies metric, completing pending stop requests, tearing down observers,
and stopping polling — now gate on the rollout-projected apply state
(deriveAggregateApplyState over all operation rows) rather than the current
operation's engine result. Extracted into a small applyQuiesceDecision helper
used by both the grouped driver and the progress writer.

Why

Under on_failure=continue, one operation can reach a terminal engine result
(e.g. failed) while the rollout projection legitimately holds the apply running
because siblings are still in flight. Keying finalization off the per-operation
result would prematurely stamp completed_at on a running apply, double-drop the
active-applies metric, tear down observers, and stop polling before siblings
finish. Gating on the projection fixes that and keeps a completed last
operation from terminalizing a failed-projected parent.

This is a no-op for single-operation applies: the projection collapses to the
current operation's derived state, so today's behavior is unchanged. It only
takes effect once an apply owns more than one operation.

Stacking / sequencing

Branches off main (its tern files are disjoint from the operation-state-first
persistence work in #343), but it pairs with that change and should land after
it. The next slice — making the projected apply-state writes compare-and-swap so
concurrent sibling drives stop last-write-wins clobbering — stacks on this branch.

…p engine result

Apply-level terminal side-effects (completed_at, active-applies metric, pending
stop completion, observer teardown, stop-polling) now gate on the rollout
projection over all operation rows instead of the current operation's engine
result, so an on_failure=continue rollout holds the apply active until siblings
settle. No-op for single-operation applies.
Copilot AI review requested due to automatic review settings June 15, 2026 05:08

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts Tern’s apply finalization logic so apply-level terminal side-effects (e.g., stamping completed_at, stopping polling, observer teardown) are gated on the rollout-projected apply state derived across sibling operations, rather than the current operation’s engine result. This prevents premature “terminalization” under on_failure=continue when one sibling reaches a terminal engine state while others are still in flight.

Changes:

  • Switch progress polling and grouped driver quiescing decisions to use the rollout-projected apply state (deriveAggregateApplyState), via a shared applyQuiesceDecision helper.
  • Remove taskStateToApplyState and update documentation to reflect that apply state is derived by aggregation rather than single-task mapping.
  • Add unit coverage for applyQuiesceDecision.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pkg/tern/state_converters.go Removes now-obsolete single-task → apply-state helper.
pkg/tern/local_client.go Updates Progress to terminalize/quiesce based on rollout projection and shared helper.
pkg/tern/local_apply_grouped.go Introduces applyQuiesceDecision and gates grouped driver side-effects on projected apply state.
pkg/tern/apply_states_test.go Adds unit test coverage for applyQuiesceDecision.
pkg/state/README.md Updates state-derivation documentation after removing taskStateToApplyState.
Comments suppressed due to low confidence (1)

pkg/tern/local_apply_grouped.go:677

  • When the projected apply state is STOPPED, completed_at should remain nil (STOPPED is resumable). This block currently stamps completed_at for all quiesced states except failed_retryable, which conflicts with markApplyStopped() and sequential finalization behavior.
		if retryableFailure {
			apply.CompletedAt = nil
		} else {
			apply.CompletedAt = &now
		}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/tern/local_client.go Outdated
Comment thread pkg/tern/local_client.go
Comment thread pkg/tern/local_apply_grouped.go Outdated
…re message from task rows

Stopped is terminal but resumable, so quiescing must not stamp completed_at.
Gate stamping on a dedicated resumable check. Under on_failure=continue the
rollout projection can resolve an apply to a failure via a sibling while the
finishing operation's engine result is non-failed, leaving the apply persisted
as failed with no message; fall back to the failed task rows in that case.
@Kiran01bm Kiran01bm marked this pull request as ready for review June 15, 2026 06:28
@Kiran01bm Kiran01bm requested review from aparajon and morgo as code owners June 15, 2026 06:28
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@aparajon

Copy link
Copy Markdown
Collaborator

I think this still needs one more split before merge: “the claimed operation is done” and “the aggregate apply should quiesce” are separate decisions.

In handleAtomicProgressTick, the current operation’s tasks can derive completed/failed while deriveAggregateApplyState correctly keeps the parent apply running because a sibling is still pending/running under on_failure: continue. Gating the return/stop-polling decision only on applyQuiesceDecision(apply.State) means the current operation drive can keep polling even though its own operation is done.

Suggested shape:

  • derive the current operation state from the scoped task set;
  • if the current operation is terminal / retry-paused, return from this operation drive so the operator can persist the claimed apply_operation row;
  • only run apply-level side effects (completed_at, active-apply metric, terminal observer, stop request completion) when the rollout-projected aggregate apply state quiesces.

The followup in #343 addresses the operator-side “mark operation from its own tasks” bridge, but it still depends on this drive returning when the current operation has settled. A focused test for “current op completed/failed, sibling pending, aggregate parent remains running” would cover the case.

This comment was generated by Amp.

@aparajon

Copy link
Copy Markdown
Collaborator

I’m also worried about the Progress path using the whole apply’s task set for aggregate derivation.

currentApplyTasks := filterTasksByApply(tasks, activeTask.ApplyID) can include tasks from multiple apply_operation_ids once an apply fans out. Passing that mixed set into deriveAggregateApplyState makes applyOperationIDForTasks fail; today that failure is treated the same as the legacy/no-operation-model case and falls back to task-only derivation. That can bypass sibling operation rows and on_failure: continue policy, so a progress poll could clobber the rollout-projected parent state from an ambiguous task set.

I think this path should derive the current operation from tasks scoped to the active task’s apply_operation_id, and deriveAggregateApplyState should distinguish “no operation model” from “mixed/missing operation IDs” so the latter fails closed instead of falling back to task-only derivation.

A focused test would be: one apply with two operation IDs, a failed continue sibling plus a pending/running sibling, then a Progress poll must not task-derive the parent to a terminal state or ignore the sibling operation rows.

This comment was generated by Amp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants