refactor(operator): derive applies.state from all operation rows#318
Conversation
The per-deployment drive sourced the aggregate applies.state from only its own task states, which would clobber sibling state once an apply owns more than one operation. Derive it as the rollout projection over the apply's operation rows instead, folding in the current deployment's state.
There was a problem hiding this comment.
Pull request overview
This PR refactors how applies.state is derived so it reflects the aggregate rollout state across all apply_operations rows for an apply (with the current deployment’s freshly-derived per-operation state folded in), preventing one deployment’s drive loop from incorrectly “clobbering” the parent apply state once multi-deployment fan-out is enabled.
Changes:
- Introduce
LocalClient.deriveAggregateApplyStateto derive apply state from siblingapply_operationsrows (plus current operation’s derived state). - Update both LocalClient drive/progress sites to use the new aggregate derivation instead of task-only derivation.
- Add unit tests to validate behavior for single-op, pending-sibling, and list-error fallback scenarios.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| pkg/tern/local_client.go | Switch apply-state update during progress to use aggregate apply-operation projection. |
| pkg/tern/local_client_test.go | Add unit tests + lightweight store stub for aggregate apply-state derivation. |
| pkg/tern/local_apply_grouped.go | Add deriveAggregateApplyState helper and use it in grouped/atomic progress ticking. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ssing op deriveAggregateApplyState could panic when the apply operation store is not configured, and silently dropped the current deployment's state when its row was absent from the sibling set. Both now fall back to the current deployment's derived state, consistent with the read-error path.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
The direction here is exactly right — both drive sites projecting
Minor: the list-then-write isn't atomic, so concurrent sibling drives will last-write-wins from possibly stale reads. It converges on the next poll, but a comment (or conditional update) before concurrent drives become real would help. Since this PR already has an approval, suggest addressing these as new commits rather than a force-push. 🤖 This review was generated by Claude Code (claude-fable-5) with maintainer approval. |
A transient sibling-read failure on the last-finishing deployment must not let one operation's terminal derivation become the whole apply's state while other operations are still in flight. Degraded paths now refuse to terminalize and leave the stored state for the next poll.
|
for posterity - flow visualization for each issue
|
all of these are good findings - thank you - fix plan below: 1,3 and 4 (one part of it) are done in this pr - 2 and the other part of 4 will be dealt in the upcoming pr - |
…-aggregate-from-operations * origin/main: (21 commits) fix(operator): make PlanetScale setup-phase applies recoverable after crash (#302) test(cli): add onboard integration coverage (#332) fix(github): run all webhook-spawned goroutines with panic recovery (#329) fix(github): retain locks and check state when a PR closes with an in-flight apply (#326) fix(github): support PR comment start command (#324) feat(cli): add `onboard` and `pull` commands (#323) feat(api): expose routing tern client (#322) feat(tern): add routing client (#316) feat(api): add live schema pull primitive (#313) fix(operator): stop conflict check from failing stopped or retryable tasks (#271) fix(operator): derive apply state from tasks when stop finds no active work (#262) fix(github): fail closed on staging-first gate when target env is unlisted (#301) fix(operator): stop the engine for all active states before recording stop (#269) fix(config): validate revert_window_duration and token references (#275) fix(storage): make rows-affected checks correct under production changed-rows semantics (#307) fix(operator): write apply_operations row when creating applies via the Tern client (#268) fix(github): enforce actor authorization for rollback and unlock commands (#264) bump spirit (#321) fix(api): reject null schema_files namespaces instead of panicking (#278) fix(spirit): only infer namespace when a single schema namespace exists (#277) ... # Conflicts: # pkg/tern/local_client_test.go
…s are in use The undetermined-aggregate guard fired on every degraded path, including applies that do not use the operation model (tasks without an apply_operation_id, or no operation store). Those have no siblings, so a terminal per-task derivation is authoritative; failing closed stranded single-writer/legacy applies at running. Fail closed only when the tasks carry an apply_operation_id but the sibling rows cannot be read.
…enum * origin/main: (22 commits) refactor(operator): derive applies.state from all operation rows (#318) fix(operator): make PlanetScale setup-phase applies recoverable after crash (#302) test(cli): add onboard integration coverage (#332) fix(github): run all webhook-spawned goroutines with panic recovery (#329) fix(github): retain locks and check state when a PR closes with an in-flight apply (#326) fix(github): support PR comment start command (#324) feat(cli): add `onboard` and `pull` commands (#323) feat(api): expose routing tern client (#322) feat(tern): add routing client (#316) feat(api): add live schema pull primitive (#313) fix(operator): stop conflict check from failing stopped or retryable tasks (#271) fix(operator): derive apply state from tasks when stop finds no active work (#262) fix(github): fail closed on staging-first gate when target env is unlisted (#301) fix(operator): stop the engine for all active states before recording stop (#269) fix(config): validate revert_window_duration and token references (#275) fix(storage): make rows-affected checks correct under production changed-rows semantics (#307) fix(operator): write apply_operations row when creating applies via the Tern client (#268) fix(github): enforce actor authorization for rollback and unlock commands (#264) bump spirit (#321) fix(api): reject null schema_files namespaces instead of panicking (#278) ...
What and Why ?
The per-deployment drive sources the aggregate
applies.statefrom only its own task states. With one operation per apply that is correct, but once an apply owns more than one operation, one deployment's drive would overwrite the rollout-level state from just its own tasks — e.g. marking the applycompletedwhile a sibling is stillpending.This change derives
applies.stateas the rollout projection over all of the apply's operation rows, folding in the current deployment's freshly-derived state, via a newderiveAggregateApplyStatehelper used at both drive sites.before: apply.State = DeriveApplyState(taskStates(thisOpTasks))
after: apply.State = DeriveApplyState(state of every operation row, this op folded in)
Behaviour-identical at one operation per apply (the sibling set is the single current op, and
DeriveApplyStateround-trips its own outputs). When the sibling rows cannot be read, the projection falls back to the current op's derived state only if that state is non-terminal; a terminal derivation on incomplete sibling information is refused (the helper reports the projection as undetermined and the caller leaves the stored value for the next poll to reconcile), so a transient read failure on one deployment cannot terminalize the whole apply while siblings are still in flight. Once the multi-deployment fan-out lands, a non-terminal sibling keeps the apply non-terminal, eliminating the clobber that would otherwise re-terminalize the parent mid-rollout.Dormant until the fan-out is wired (one operation per apply today).
Roadmap / upcoming PRs
This is the apply-state / continuation projection track. Follow-ups, each independently reviewable:
applies.statefrom all operation rows (no sibling clobber)continueprojection: hold apply active until all siblings settle — incl. stampingCompletedAtfrom the rollout aggregate, not from a single finished opstophalts remaining siblings undercontinue— incl. surfacingErrorMessagefrom the aggregate (not the last op) and a conditional/CASapplies.statewrite to drop last-write-wins on concurrent sibling drivesrunning_degradedstate for active-with-known-failure (fast follow)Merge note
This branch merges
main. The diff includes one unrelated line inpkg/storage/mysqlstore/apply_operations_test.go— fixing aFindNextApplyOperationcall that was missing itsownerargument onmainand broke the integration build. Pure test-arity fix; no production behavior change.