fix(operator): make PlanetScale setup-phase applies recoverable after crash by morgo · Pull Request #302 · block/schemabot

morgo · 2026-06-11T16:32:51Z

A PlanetScale apply moves through engine setup-phase states — preparing_branch, applying_branch_changes, validating_branch, creating_deploy_request, validating_deploy_request — before per-table work begins. These states matched no clause of the operator claim queries: not pending, not in the claimable set, not terminal. So a worker crash mid-setup left the apply non-terminal and permanently unclaimable, with no operator able to recover it.

This adds the setup-phase states to claimableApplyStates, so a stale-heartbeat apply stuck in setup is reclaimed for recovery just like the other in-flight states. A healthy worker keeps its heartbeat fresh, so an in-progress setup is never claimed out from under it; terminal states stay excluded. Both the apply-level (FindNextApply, ClaimApplyByID) and operation-level (FindNextApplyOperation) claim paths share this set and now treat setup-phase applies consistently.

worker crash during setup
  apply.state = applying_branch_changes (non-terminal)
  heartbeat goes stale
    before: matches no claim clause -> wedged forever
    after:  matches stale-active clause -> operator reclaims and resumes

Merge order: This should land after #265 (resume with persisted engine state). On main today, a claimed setup-phase Vitess apply resumes through the path that re-runs engine setup without persisted state, which can duplicate branch/deploy-request work. Merging this before #265 converts "wedged after crash" into "recovered wrongly," which is worse. The deleted negative test is correct only once #265 is in.

🤖 Generated with Claude Code

… crash An apply driven by the PlanetScale engine transitions through setup-phase states (preparing_branch, applying_branch_changes, validating_branch, creating_deploy_request, validating_deploy_request) before per-table work begins. These states matched no clause of the operator claim queries: not pending, not in the claimable set, not terminal. A worker crash mid-setup left the apply non-terminal and permanently unclaimable, so no operator could ever recover it. Add the setup-phase states to claimableApplyStates so a stale-heartbeat apply stuck in setup is reclaimable for recovery, exactly like the other in-flight states. A healthy worker keeps its heartbeat fresh, so it is never claimed out from under itself. Both the apply-level (FindNextApply, ClaimApplyByID) and operation-level (FindNextApplyOperation) claim paths share this set and now treat setup-phase applies consistently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copilot

Pull request overview

This PR fixes a recovery gap in the operator claim logic where PlanetScale engine setup-phase applies (e.g., preparing_branch, applying_branch_changes, validating_branch, creating_deploy_request, validating_deploy_request) could become permanently unclaimable after a worker crash, leaving them stuck non-terminal without any operator being able to resume them.

Changes:

Adds PlanetScale setup-phase states to the shared claimableApplyStates set so stale-heartbeat setup applies can be reclaimed.
Adds unit test coverage for reclaiming stale setup-phase applies via both apply-level and operation-level claim paths.
Updates integration claimability expectations to treat setup-phase PlanetScale states as claimable (when stale) and removes the prior “not claimed” integration test.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
pkg/storage/mysqlstore/applies.go	Expands `claimableApplyStates()` to include PlanetScale setup-phase states and updates rationale comment.
pkg/storage/mysqlstore/applies_test.go	Adds tests ensuring stale setup-phase applies are reclaimable via `FindNextApply` and `ClaimApplyByID`.
pkg/storage/mysqlstore/apply_operations_test.go	Adds an operation-claim-path test verifying recovery can lease a stale operation and then claim its setup-phase parent apply.
integration/operator_test.go	Updates claimability matrix to expect setup-phase PlanetScale states to be claimable; removes obsolete “not claimed” test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ngine The setup-phase states (preparing_branch through validating_deploy_request) only occur for the PlanetScale engine, so the recovery tests now build their fixtures on a Vitess lock. createTestApplyWithStateAndEnv derives the apply engine from the lock's database type via storage.EngineForType, so a Vitess lock yields a planetscale apply with db_type=vitess — matching the real recovery path instead of a misleading mysql/spirit pairing. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

aparajon · 2026-06-12T16:46:31Z

The state set is exactly the five setup-phase states (terminal, failed_retryable, and stopped correctly excluded — each has its own path), all three claim paths share the same predicate, and the heartbeat-before-engine-apply ordering means a healthy worker can't be claimed out from under during long branch waits. The fixture engine/db-type mismatch from earlier review was properly fixed.

One sequencing concern that should gate the merge: this should land after #265 (resume with persisted engine state). On main today, a claimed setup-phase Vitess apply resumes through the path that re-runs engine setup without persisted state — which can duplicate branch/deploy-request work. The deleted negative test was the guard for exactly that scenario; deleting it is correct once #265 is in, but merging this first converts "wedged after crash" into "recovered wrongly," which is worse. Suggest noting the dependency in the PR body (or stacking the branches) so merge order is explicit.

Smaller: the tests prove claimability at both levels, but nothing exercises the full claim → resume → engine-reattach flow for a setup-phase crash (branch exists, no deploy request yet). With #265's tests adjacent this is acceptable; a combined integration test would close the loop.

Verdict: LGTM contingent on landing after #265.

🤖 This review was generated by Claude Code (claude-fable-5) with maintainer approval.

aparajon

Approving with an explicit sequencing condition: land this only after #265 (resume with persisted engine state) merges with its save-error fix. On main today, a claimed setup-phase Vitess apply would resume through the path that can duplicate engine work — the deleted negative test was the guard for exactly that. The claimable state set itself is verified correct.

🤖 Approval issued by Claude Code (claude-fable-5) with maintainer approval.

…-aggregate-from-operations * origin/main: (21 commits) fix(operator): make PlanetScale setup-phase applies recoverable after crash (#302) test(cli): add onboard integration coverage (#332) fix(github): run all webhook-spawned goroutines with panic recovery (#329) fix(github): retain locks and check state when a PR closes with an in-flight apply (#326) fix(github): support PR comment start command (#324) feat(cli): add `onboard` and `pull` commands (#323) feat(api): expose routing tern client (#322) feat(tern): add routing client (#316) feat(api): add live schema pull primitive (#313) fix(operator): stop conflict check from failing stopped or retryable tasks (#271) fix(operator): derive apply state from tasks when stop finds no active work (#262) fix(github): fail closed on staging-first gate when target env is unlisted (#301) fix(operator): stop the engine for all active states before recording stop (#269) fix(config): validate revert_window_duration and token references (#275) fix(storage): make rows-affected checks correct under production changed-rows semantics (#307) fix(operator): write apply_operations row when creating applies via the Tern client (#268) fix(github): enforce actor authorization for rollback and unlock commands (#264) bump spirit (#321) fix(api): reject null schema_files namespaces instead of panicking (#278) fix(spirit): only infer namespace when a single schema namespace exists (#277) ... # Conflicts: # pkg/tern/local_client_test.go

…enum * origin/main: (22 commits) refactor(operator): derive applies.state from all operation rows (#318) fix(operator): make PlanetScale setup-phase applies recoverable after crash (#302) test(cli): add onboard integration coverage (#332) fix(github): run all webhook-spawned goroutines with panic recovery (#329) fix(github): retain locks and check state when a PR closes with an in-flight apply (#326) fix(github): support PR comment start command (#324) feat(cli): add `onboard` and `pull` commands (#323) feat(api): expose routing tern client (#322) feat(tern): add routing client (#316) feat(api): add live schema pull primitive (#313) fix(operator): stop conflict check from failing stopped or retryable tasks (#271) fix(operator): derive apply state from tasks when stop finds no active work (#262) fix(github): fail closed on staging-first gate when target env is unlisted (#301) fix(operator): stop the engine for all active states before recording stop (#269) fix(config): validate revert_window_duration and token references (#275) fix(storage): make rows-affected checks correct under production changed-rows semantics (#307) fix(operator): write apply_operations row when creating applies via the Tern client (#268) fix(github): enforce actor authorization for rollback and unlock commands (#264) bump spirit (#321) fix(api): reject null schema_files namespaces instead of panicking (#278) ...

Copilot AI review requested due to automatic review settings June 11, 2026 16:32

Copilot started reviewing on behalf of morgo June 11, 2026 16:33 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Comment thread pkg/storage/mysqlstore/applies_test.go Outdated

Comment thread pkg/storage/mysqlstore/applies_test.go Outdated

Comment thread pkg/storage/mysqlstore/apply_operations_test.go Outdated

morgo marked this pull request as ready for review June 12, 2026 14:32

morgo requested review from Kiran01bm and aparajon as code owners June 12, 2026 14:32

aparajon approved these changes Jun 12, 2026

View reviewed changes

morgo merged commit 57ad251 into main Jun 13, 2026
27 checks passed

morgo deleted the fix/claimable-setup-phase-states branch June 13, 2026 03:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(operator): make PlanetScale setup-phase applies recoverable after crash#302

fix(operator): make PlanetScale setup-phase applies recoverable after crash#302
morgo merged 2 commits into
mainfrom
fix/claimable-setup-phase-states

morgo commented Jun 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aparajon commented Jun 12, 2026

Uh oh!

aparajon left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

morgo commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aparajon commented Jun 12, 2026

Uh oh!

aparajon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

morgo commented Jun 11, 2026 •

edited

Loading