Skip to content

refactor(operator): derive applies.state from all operation rows#318

Merged
Kiran01bm merged 5 commits into
mainfrom
kiran01bm/apply-state-aggregate-from-operations
Jun 13, 2026
Merged

refactor(operator): derive applies.state from all operation rows#318
Kiran01bm merged 5 commits into
mainfrom
kiran01bm/apply-state-aggregate-from-operations

Conversation

@Kiran01bm

@Kiran01bm Kiran01bm commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

What and Why ?

The per-deployment drive sources the aggregate applies.state from only its own task states. With one operation per apply that is correct, but once an apply owns more than one operation, one deployment's drive would overwrite the rollout-level state from just its own tasks — e.g. marking the apply completed while a sibling is still pending.

This change derives applies.state as the rollout projection over all of the apply's operation rows, folding in the current deployment's freshly-derived state, via a new deriveAggregateApplyState helper used at both drive sites.

before: apply.State = DeriveApplyState(taskStates(thisOpTasks))
after: apply.State = DeriveApplyState(state of every operation row, this op folded in)

Behaviour-identical at one operation per apply (the sibling set is the single current op, and DeriveApplyState round-trips its own outputs). When the sibling rows cannot be read, the projection falls back to the current op's derived state only if that state is non-terminal; a terminal derivation on incomplete sibling information is refused (the helper reports the projection as undetermined and the caller leaves the stored value for the next poll to reconcile), so a transient read failure on one deployment cannot terminalize the whole apply while siblings are still in flight. Once the multi-deployment fan-out lands, a non-terminal sibling keeps the apply non-terminal, eliminating the clobber that would otherwise re-terminalize the parent mid-rollout.

Dormant until the fan-out is wired (one operation per apply today).

Roadmap / upcoming PRs

This is the apply-state / continuation projection track. Follow-ups, each independently reviewable:

Ref Upcoming change Status PR
B1 Derive applies.state from all operation rows (no sibling clobber) This PR #318
B2 continue projection: hold apply active until all siblings settle — incl. stamping CompletedAt from the rollout aggregate, not from a single finished op Planned TBD
B3 Eager failure surfacing + stop halts remaining siblings under continue — incl. surfacing ErrorMessage from the aggregate (not the last op) and a conditional/CAS applies.state write to drop last-write-wins on concurrent sibling drives Planned TBD
B4 running_degraded state for active-with-known-failure (fast follow) Future TBD

Merge note

This branch merges main. The diff includes one unrelated line in pkg/storage/mysqlstore/apply_operations_test.go — fixing a FindNextApplyOperation call that was missing its owner argument on main and broke the integration build. Pure test-arity fix; no production behavior change.

The per-deployment drive sourced the aggregate applies.state from only its
own task states, which would clobber sibling state once an apply owns more
than one operation. Derive it as the rollout projection over the apply's
operation rows instead, folding in the current deployment's state.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors how applies.state is derived so it reflects the aggregate rollout state across all apply_operations rows for an apply (with the current deployment’s freshly-derived per-operation state folded in), preventing one deployment’s drive loop from incorrectly “clobbering” the parent apply state once multi-deployment fan-out is enabled.

Changes:

  • Introduce LocalClient.deriveAggregateApplyState to derive apply state from sibling apply_operations rows (plus current operation’s derived state).
  • Update both LocalClient drive/progress sites to use the new aggregate derivation instead of task-only derivation.
  • Add unit tests to validate behavior for single-op, pending-sibling, and list-error fallback scenarios.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
pkg/tern/local_client.go Switch apply-state update during progress to use aggregate apply-operation projection.
pkg/tern/local_client_test.go Add unit tests + lightweight store stub for aggregate apply-state derivation.
pkg/tern/local_apply_grouped.go Add deriveAggregateApplyState helper and use it in grouped/atomic progress ticking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/tern/local_apply_grouped.go Outdated
…ssing op

deriveAggregateApplyState could panic when the apply operation store is not
configured, and silently dropped the current deployment's state when its row
was absent from the sibling set. Both now fall back to the current
deployment's derived state, consistent with the read-error path.
@Kiran01bm Kiran01bm marked this pull request as ready for review June 12, 2026 12:08
@Kiran01bm Kiran01bm requested review from aparajon and morgo as code owners June 12, 2026 12:08
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@aparajon

Copy link
Copy Markdown
Collaborator

The direction here is exactly right — both drive sites projecting applies.state over the full sibling set is the correct inversion, and the fresh-reload terminal checks preserve "started applies remain authoritative." Three items before merge, one of them blocking-level for the fan-out future this PR exists for:

  1. deriveAggregateApplyState degraded paths can terminalize from one operation — every fallback (ListByApply error, nil store, empty ops, missing current row) derives from the current op alone, which can return a terminal state while sibling states are unknown. Once fan-out lands, a transient storage read failure on the last-finishing deployment marks the whole apply completed with siblings still pending — storage uncertainty converted into a passing aggregate. Suggest the fallback refuse to terminalize: either return an error so the caller skips the state write (keeping the stored value), or only fall back when the single-op derivation is non-terminal.

  2. CompletedAt/ErrorMessage still stamped from the current op's engine result — under fan-out, a finished op with pending siblings stamps CompletedAt on a running apply. Fine to defer, but worth tracking alongside this work.

  3. PR description — the "Ref" section links an internal repository. Per repo policy, internal references should be removed from public PR descriptions before merge.

Minor: the list-then-write isn't atomic, so concurrent sibling drives will last-write-wins from possibly stale reads. It converges on the next poll, but a comment (or conditional update) before concurrent drives become real would help.

Since this PR already has an approval, suggest addressing these as new commits rather than a force-push.


🤖 This review was generated by Claude Code (claude-fable-5) with maintainer approval.

A transient sibling-read failure on the last-finishing deployment must
not let one operation's terminal derivation become the whole apply's
state while other operations are still in flight. Degraded paths now
refuse to terminalize and leave the stored state for the next poll.
@Kiran01bm

Copy link
Copy Markdown
Collaborator Author

for posterity - flow visualization for each issue

Minor: the list-then-write isn't atomic, so concurrent sibling drives will last-write-wins from possibly stale reads. It converges on the next poll, but a comment (or conditional update) before concurrent drives become real would help.

list-then-write race (stale-read clobber)
       op A drive                         op B drive
           │                                    │
 t0  read ops {A:completed,                     │    
              B:completed}                      │
     derive "completed"                         │
     (snapshot held)                            │
           │                            t1  op B → failed
           │                                write applies.state = failed
           │                                    │
 t2  WRITE applies.state = completed ──────────-┘
     (using stale t0 snapshot)
           │
           ▼
   applies.state = completed   ◀── overwrites B's "failed"
           │
           ▼
   merge gate sampling now ──▶ 🟢 (failed rollout looks green)
           │
           ▼
   next poll: reader sees B:failed ──▶ converges, but clobber already happened

deriveAggregateApplyState degraded paths can terminalize from one operation — every fallback (ListByApply error, nil store, empty ops, missing current row) derives from the current op alone, which can return a terminal state while sibling states are unknown. Once fan-out lands, a transient storage read failure on the last-finishing deployment marks the whole apply completed with siblings still pending — storage uncertainty converted into a passing aggregate. Suggest the fallback refuse to terminalize: either return an error so the caller skips the state write (keeping the stored value), or only fall back when the single-op derivation is non-terminal.

degraded fallback terminalizes from one op

Apply #7 ── owns ──┬── op A (us-east)  task: copying ──▶ row = running
                   └── op B (us-west)  tasks: all done

  B's drive tick wants to update applies.state
                   │
                   ▼
        deriveAggregateApplyState
                   │
          ListByApply(apply) ──▶ ✗ transient timeout / deadlock
                   │
                   ▼
        FALLBACK: derive from B's tasks only ──▶ "completed"
                   │
                   ▼
   reload guard: stored state terminal?  ── no (it's "running") ──▶ WRITE
                   │
                   ▼
        applies.state = completed   ◀── but op A still copying
                   │
                   ▼
   check-run / merge gate reads "completed" ──▶ 🟢 false "all done"

@Kiran01bm

Kiran01bm commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator Author
  1. deriveAggregateApplyState degraded paths can terminalize from one operation — every fallback (ListByApply error, nil store, empty ops, missing current row) derives from the current op alone, which can return a terminal state while sibling states are unknown. Once fan-out lands, a transient storage read failure on the last-finishing deployment marks the whole apply completed with siblings still pending — storage uncertainty converted into a passing aggregate. Suggest the fallback refuse to terminalize: either return an error so the caller skips the state write (keeping the stored value), or only fall back when the single-op derivation is non-terminal.

  2. CompletedAt/ErrorMessage still stamped from the current op's engine result — under fan-out, a finished op with pending siblings stamps CompletedAt on a running apply. Fine to defer, but worth tracking alongside this work.

  3. PR description — the "Ref" section links an internal repository. Per repo policy, internal references should be removed from public PR descriptions before merge.

  4. Minor: the list-then-write isn't atomic, so concurrent sibling drives will last-write-wins from possibly stale reads. It converges on the next poll, but a comment (or conditional update) before concurrent drives become real would help.

all of these are good findings - thank you - fix plan below:

1,3 and 4 (one part of it) are done in this pr - 2 and the other part of 4 will be dealt in the upcoming pr -

item 2 (CompletedAt/ErrorMessage from current op)
   ├─ CompletedAt stamped on a running apply ──────────▶ B2 (hold active until siblings settle)
   └─ ErrorMessage = last op's error, first lost ──────▶ B3 (eager failure surfacing)

item 4 (non-atomic list-then-write, last-write-wins)
   └─ conditional/CAS applies.state write ─────────────▶ B3 (before concurrent sibling 

…-aggregate-from-operations

* origin/main: (21 commits)
  fix(operator): make PlanetScale setup-phase applies recoverable after crash (#302)
  test(cli): add onboard integration coverage (#332)
  fix(github): run all webhook-spawned goroutines with panic recovery (#329)
  fix(github): retain locks and check state when a PR closes with an in-flight apply (#326)
  fix(github): support PR comment start command (#324)
  feat(cli): add `onboard` and `pull` commands (#323)
  feat(api): expose routing tern client (#322)
  feat(tern): add routing client (#316)
  feat(api): add live schema pull primitive (#313)
  fix(operator): stop conflict check from failing stopped or retryable tasks (#271)
  fix(operator): derive apply state from tasks when stop finds no active work (#262)
  fix(github): fail closed on staging-first gate when target env is unlisted (#301)
  fix(operator): stop the engine for all active states before recording stop (#269)
  fix(config): validate revert_window_duration and token references (#275)
  fix(storage): make rows-affected checks correct under production changed-rows semantics (#307)
  fix(operator): write apply_operations row when creating applies via the Tern client (#268)
  fix(github): enforce actor authorization for rollback and unlock commands (#264)
  bump spirit (#321)
  fix(api): reject null schema_files namespaces instead of panicking (#278)
  fix(spirit): only infer namespace when a single schema namespace exists (#277)
  ...

# Conflicts:
#	pkg/tern/local_client_test.go
…s are in use

The undetermined-aggregate guard fired on every degraded path, including
applies that do not use the operation model (tasks without an
apply_operation_id, or no operation store). Those have no siblings, so a
terminal per-task derivation is authoritative; failing closed stranded
single-writer/legacy applies at running. Fail closed only when the tasks
carry an apply_operation_id but the sibling rows cannot be read.
@Kiran01bm Kiran01bm merged commit e9e777b into main Jun 13, 2026
27 checks passed
@Kiran01bm Kiran01bm deleted the kiran01bm/apply-state-aggregate-from-operations branch June 13, 2026 06:42
Kiran01bm added a commit that referenced this pull request Jun 14, 2026
…enum

* origin/main: (22 commits)
  refactor(operator): derive applies.state from all operation rows (#318)
  fix(operator): make PlanetScale setup-phase applies recoverable after crash (#302)
  test(cli): add onboard integration coverage (#332)
  fix(github): run all webhook-spawned goroutines with panic recovery (#329)
  fix(github): retain locks and check state when a PR closes with an in-flight apply (#326)
  fix(github): support PR comment start command (#324)
  feat(cli): add `onboard` and `pull` commands (#323)
  feat(api): expose routing tern client (#322)
  feat(tern): add routing client (#316)
  feat(api): add live schema pull primitive (#313)
  fix(operator): stop conflict check from failing stopped or retryable tasks (#271)
  fix(operator): derive apply state from tasks when stop finds no active work (#262)
  fix(github): fail closed on staging-first gate when target env is unlisted (#301)
  fix(operator): stop the engine for all active states before recording stop (#269)
  fix(config): validate revert_window_duration and token references (#275)
  fix(storage): make rows-affected checks correct under production changed-rows semantics (#307)
  fix(operator): write apply_operations row when creating applies via the Tern client (#268)
  fix(github): enforce actor authorization for rollback and unlock commands (#264)
  bump spirit (#321)
  fix(api): reject null schema_files namespaces instead of panicking (#278)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants