refactor(operator): derive applies.state from all operation rows by Kiran01bm · Pull Request #318 · block/schemabot

Kiran01bm · 2026-06-12T04:11:02Z

What and Why ?

The per-deployment drive sources the aggregate applies.state from only its own task states. With one operation per apply that is correct, but once an apply owns more than one operation, one deployment's drive would overwrite the rollout-level state from just its own tasks — e.g. marking the apply completed while a sibling is still pending.

This change derives applies.state as the rollout projection over all of the apply's operation rows, folding in the current deployment's freshly-derived state, via a new deriveAggregateApplyState helper used at both drive sites.

before: apply.State = DeriveApplyState(taskStates(thisOpTasks))
after: apply.State = DeriveApplyState(state of every operation row, this op folded in)

Behaviour-identical at one operation per apply (the sibling set is the single current op, and DeriveApplyState round-trips its own outputs). When the sibling rows cannot be read, the projection falls back to the current op's derived state only if that state is non-terminal; a terminal derivation on incomplete sibling information is refused (the helper reports the projection as undetermined and the caller leaves the stored value for the next poll to reconcile), so a transient read failure on one deployment cannot terminalize the whole apply while siblings are still in flight. Once the multi-deployment fan-out lands, a non-terminal sibling keeps the apply non-terminal, eliminating the clobber that would otherwise re-terminalize the parent mid-rollout.

Dormant until the fan-out is wired (one operation per apply today).

Roadmap / upcoming PRs

This is the apply-state / continuation projection track. Follow-ups, each independently reviewable:

Ref	Upcoming change	Status	PR
B1	Derive `applies.state` from all operation rows (no sibling clobber)	This PR	#318
B2	`continue` projection: hold apply active until all siblings settle — incl. stamping `CompletedAt` from the rollout aggregate, not from a single finished op	Planned	TBD
B3	Eager failure surfacing + `stop` halts remaining siblings under `continue` — incl. surfacing `ErrorMessage` from the aggregate (not the last op) and a conditional/CAS `applies.state` write to drop last-write-wins on concurrent sibling drives	Planned	TBD
B4	`running_degraded` state for active-with-known-failure (fast follow)	Future	TBD

Merge note

This branch merges main. The diff includes one unrelated line in pkg/storage/mysqlstore/apply_operations_test.go — fixing a FindNextApplyOperation call that was missing its owner argument on main and broke the integration build. Pure test-arity fix; no production behavior change.

The per-deployment drive sourced the aggregate applies.state from only its own task states, which would clobber sibling state once an apply owns more than one operation. Derive it as the rollout projection over the apply's operation rows instead, folding in the current deployment's state.

Copilot

Pull request overview

This PR refactors how applies.state is derived so it reflects the aggregate rollout state across all apply_operations rows for an apply (with the current deployment’s freshly-derived per-operation state folded in), preventing one deployment’s drive loop from incorrectly “clobbering” the parent apply state once multi-deployment fan-out is enabled.

Changes:

Introduce LocalClient.deriveAggregateApplyState to derive apply state from sibling apply_operations rows (plus current operation’s derived state).
Update both LocalClient drive/progress sites to use the new aggregate derivation instead of task-only derivation.
Add unit tests to validate behavior for single-op, pending-sibling, and list-error fallback scenarios.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
pkg/tern/local_client.go	Switch apply-state update during progress to use aggregate apply-operation projection.
pkg/tern/local_client_test.go	Add unit tests + lightweight store stub for aggregate apply-state derivation.
pkg/tern/local_apply_grouped.go	Add `deriveAggregateApplyState` helper and use it in grouped/atomic progress ticking.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ssing op deriveAggregateApplyState could panic when the apply operation store is not configured, and silently dropped the current deployment's state when its row was absent from the sibling set. Both now fall back to the current deployment's derived state, consistent with the read-error path.

chatgpt-codex-connector · 2026-06-12T12:08:11Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

aparajon · 2026-06-12T16:45:01Z

The direction here is exactly right — both drive sites projecting applies.state over the full sibling set is the correct inversion, and the fresh-reload terminal checks preserve "started applies remain authoritative." Three items before merge, one of them blocking-level for the fan-out future this PR exists for:

deriveAggregateApplyState degraded paths can terminalize from one operation — every fallback (ListByApply error, nil store, empty ops, missing current row) derives from the current op alone, which can return a terminal state while sibling states are unknown. Once fan-out lands, a transient storage read failure on the last-finishing deployment marks the whole apply completed with siblings still pending — storage uncertainty converted into a passing aggregate. Suggest the fallback refuse to terminalize: either return an error so the caller skips the state write (keeping the stored value), or only fall back when the single-op derivation is non-terminal.
CompletedAt/ErrorMessage still stamped from the current op's engine result — under fan-out, a finished op with pending siblings stamps CompletedAt on a running apply. Fine to defer, but worth tracking alongside this work.
PR description — the "Ref" section links an internal repository. Per repo policy, internal references should be removed from public PR descriptions before merge.

Minor: the list-then-write isn't atomic, so concurrent sibling drives will last-write-wins from possibly stale reads. It converges on the next poll, but a comment (or conditional update) before concurrent drives become real would help.

Since this PR already has an approval, suggest addressing these as new commits rather than a force-push.

🤖 This review was generated by Claude Code (claude-fable-5) with maintainer approval.

A transient sibling-read failure on the last-finishing deployment must not let one operation's terminal derivation become the whole apply's state while other operations are still in flight. Degraded paths now refuse to terminalize and leave the stored state for the next poll.

Kiran01bm · 2026-06-13T05:21:14Z

for posterity - flow visualization for each issue

Minor: the list-then-write isn't atomic, so concurrent sibling drives will last-write-wins from possibly stale reads. It converges on the next poll, but a comment (or conditional update) before concurrent drives become real would help.

list-then-write race (stale-read clobber)
       op A drive                         op B drive
           │                                    │
 t0  read ops {A:completed,                     │    
              B:completed}                      │
     derive "completed"                         │
     (snapshot held)                            │
           │                            t1  op B → failed
           │                                write applies.state = failed
           │                                    │
 t2  WRITE applies.state = completed ──────────-┘
     (using stale t0 snapshot)
           │
           ▼
   applies.state = completed   ◀── overwrites B's "failed"
           │
           ▼
   merge gate sampling now ──▶ 🟢 (failed rollout looks green)
           │
           ▼
   next poll: reader sees B:failed ──▶ converges, but clobber already happened

deriveAggregateApplyState degraded paths can terminalize from one operation — every fallback (ListByApply error, nil store, empty ops, missing current row) derives from the current op alone, which can return a terminal state while sibling states are unknown. Once fan-out lands, a transient storage read failure on the last-finishing deployment marks the whole apply completed with siblings still pending — storage uncertainty converted into a passing aggregate. Suggest the fallback refuse to terminalize: either return an error so the caller skips the state write (keeping the stored value), or only fall back when the single-op derivation is non-terminal.

degraded fallback terminalizes from one op

Apply #7 ── owns ──┬── op A (us-east)  task: copying ──▶ row = running
                   └── op B (us-west)  tasks: all done

  B's drive tick wants to update applies.state
                   │
                   ▼
        deriveAggregateApplyState
                   │
          ListByApply(apply) ──▶ ✗ transient timeout / deadlock
                   │
                   ▼
        FALLBACK: derive from B's tasks only ──▶ "completed"
                   │
                   ▼
   reload guard: stored state terminal?  ── no (it's "running") ──▶ WRITE
                   │
                   ▼
        applies.state = completed   ◀── but op A still copying
                   │
                   ▼
   check-run / merge gate reads "completed" ──▶ 🟢 false "all done"

Kiran01bm · 2026-06-13T05:28:18Z

deriveAggregateApplyState degraded paths can terminalize from one operation — every fallback (ListByApply error, nil store, empty ops, missing current row) derives from the current op alone, which can return a terminal state while sibling states are unknown. Once fan-out lands, a transient storage read failure on the last-finishing deployment marks the whole apply completed with siblings still pending — storage uncertainty converted into a passing aggregate. Suggest the fallback refuse to terminalize: either return an error so the caller skips the state write (keeping the stored value), or only fall back when the single-op derivation is non-terminal.

CompletedAt/ErrorMessage still stamped from the current op's engine result — under fan-out, a finished op with pending siblings stamps CompletedAt on a running apply. Fine to defer, but worth tracking alongside this work.

PR description — the "Ref" section links an internal repository. Per repo policy, internal references should be removed from public PR descriptions before merge.

Minor: the list-then-write isn't atomic, so concurrent sibling drives will last-write-wins from possibly stale reads. It converges on the next poll, but a comment (or conditional update) before concurrent drives become real would help.

all of these are good findings - thank you - fix plan below:

1,3 and 4 (one part of it) are done in this pr - 2 and the other part of 4 will be dealt in the upcoming pr -

item 2 (CompletedAt/ErrorMessage from current op)
   ├─ CompletedAt stamped on a running apply ──────────▶ B2 (hold active until siblings settle)
   └─ ErrorMessage = last op's error, first lost ──────▶ B3 (eager failure surfacing)

item 4 (non-atomic list-then-write, last-write-wins)
   └─ conditional/CAS applies.state write ─────────────▶ B3 (before concurrent sibling

…-aggregate-from-operations * origin/main: (21 commits) fix(operator): make PlanetScale setup-phase applies recoverable after crash (#302) test(cli): add onboard integration coverage (#332) fix(github): run all webhook-spawned goroutines with panic recovery (#329) fix(github): retain locks and check state when a PR closes with an in-flight apply (#326) fix(github): support PR comment start command (#324) feat(cli): add `onboard` and `pull` commands (#323) feat(api): expose routing tern client (#322) feat(tern): add routing client (#316) feat(api): add live schema pull primitive (#313) fix(operator): stop conflict check from failing stopped or retryable tasks (#271) fix(operator): derive apply state from tasks when stop finds no active work (#262) fix(github): fail closed on staging-first gate when target env is unlisted (#301) fix(operator): stop the engine for all active states before recording stop (#269) fix(config): validate revert_window_duration and token references (#275) fix(storage): make rows-affected checks correct under production changed-rows semantics (#307) fix(operator): write apply_operations row when creating applies via the Tern client (#268) fix(github): enforce actor authorization for rollback and unlock commands (#264) bump spirit (#321) fix(api): reject null schema_files namespaces instead of panicking (#278) fix(spirit): only infer namespace when a single schema namespace exists (#277) ... # Conflicts: # pkg/tern/local_client_test.go

…s are in use The undetermined-aggregate guard fired on every degraded path, including applies that do not use the operation model (tasks without an apply_operation_id, or no operation store). Those have no siblings, so a terminal per-task derivation is authoritative; failing closed stranded single-writer/legacy applies at running. Fail closed only when the tasks carry an apply_operation_id but the sibling rows cannot be read.

…enum * origin/main: (22 commits) refactor(operator): derive applies.state from all operation rows (#318) fix(operator): make PlanetScale setup-phase applies recoverable after crash (#302) test(cli): add onboard integration coverage (#332) fix(github): run all webhook-spawned goroutines with panic recovery (#329) fix(github): retain locks and check state when a PR closes with an in-flight apply (#326) fix(github): support PR comment start command (#324) feat(cli): add `onboard` and `pull` commands (#323) feat(api): expose routing tern client (#322) feat(tern): add routing client (#316) feat(api): add live schema pull primitive (#313) fix(operator): stop conflict check from failing stopped or retryable tasks (#271) fix(operator): derive apply state from tasks when stop finds no active work (#262) fix(github): fail closed on staging-first gate when target env is unlisted (#301) fix(operator): stop the engine for all active states before recording stop (#269) fix(config): validate revert_window_duration and token references (#275) fix(storage): make rows-affected checks correct under production changed-rows semantics (#307) fix(operator): write apply_operations row when creating applies via the Tern client (#268) fix(github): enforce actor authorization for rollback and unlock commands (#264) bump spirit (#321) fix(api): reject null schema_files namespaces instead of panicking (#278) ...

Copilot AI review requested due to automatic review settings June 12, 2026 04:11

Copilot started reviewing on behalf of Kiran01bm June 12, 2026 04:11 View session

Kiran01bm mentioned this pull request Jun 12, 2026

feat(config): add on_failure enum for rollout continuation #317

Merged

Copilot AI reviewed Jun 12, 2026

View reviewed changes

Comment thread pkg/tern/local_apply_grouped.go Outdated

Kiran01bm marked this pull request as ready for review June 12, 2026 12:08

Kiran01bm requested review from aparajon and morgo as code owners June 12, 2026 12:08

aparajon approved these changes Jun 12, 2026

View reviewed changes

Kiran01bm added 2 commits June 13, 2026 15:45

Kiran01bm merged commit e9e777b into main Jun 13, 2026
27 checks passed

Kiran01bm deleted the kiran01bm/apply-state-aggregate-from-operations branch June 13, 2026 06:42

Kiran01bm mentioned this pull request Jun 14, 2026

feat(operator): project apply state with on_failure continue policy #337

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(operator): derive applies.state from all operation rows#318

refactor(operator): derive applies.state from all operation rows#318
Kiran01bm merged 5 commits into
mainfrom
kiran01bm/apply-state-aggregate-from-operations

Kiran01bm commented Jun 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented Jun 12, 2026

Uh oh!

aparajon commented Jun 12, 2026

Uh oh!

Kiran01bm commented Jun 13, 2026

Uh oh!

Kiran01bm commented Jun 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Kiran01bm commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What and Why ?

Roadmap / upcoming PRs

Merge note

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented Jun 12, 2026

Uh oh!

aparajon commented Jun 12, 2026

Uh oh!

Kiran01bm commented Jun 13, 2026

Uh oh!

Kiran01bm commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Kiran01bm commented Jun 12, 2026 •

edited

Loading

Kiran01bm commented Jun 13, 2026 •

edited

Loading