fix(operator): make PlanetScale setup-phase applies recoverable after crash#302
Conversation
… crash An apply driven by the PlanetScale engine transitions through setup-phase states (preparing_branch, applying_branch_changes, validating_branch, creating_deploy_request, validating_deploy_request) before per-table work begins. These states matched no clause of the operator claim queries: not pending, not in the claimable set, not terminal. A worker crash mid-setup left the apply non-terminal and permanently unclaimable, so no operator could ever recover it. Add the setup-phase states to claimableApplyStates so a stale-heartbeat apply stuck in setup is reclaimable for recovery, exactly like the other in-flight states. A healthy worker keeps its heartbeat fresh, so it is never claimed out from under itself. Both the apply-level (FindNextApply, ClaimApplyByID) and operation-level (FindNextApplyOperation) claim paths share this set and now treat setup-phase applies consistently. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes a recovery gap in the operator claim logic where PlanetScale engine setup-phase applies (e.g., preparing_branch, applying_branch_changes, validating_branch, creating_deploy_request, validating_deploy_request) could become permanently unclaimable after a worker crash, leaving them stuck non-terminal without any operator being able to resume them.
Changes:
- Adds PlanetScale setup-phase states to the shared
claimableApplyStatesset so stale-heartbeat setup applies can be reclaimed. - Adds unit test coverage for reclaiming stale setup-phase applies via both apply-level and operation-level claim paths.
- Updates integration claimability expectations to treat setup-phase PlanetScale states as claimable (when stale) and removes the prior “not claimed” integration test.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pkg/storage/mysqlstore/applies.go | Expands claimableApplyStates() to include PlanetScale setup-phase states and updates rationale comment. |
| pkg/storage/mysqlstore/applies_test.go | Adds tests ensuring stale setup-phase applies are reclaimable via FindNextApply and ClaimApplyByID. |
| pkg/storage/mysqlstore/apply_operations_test.go | Adds an operation-claim-path test verifying recovery can lease a stale operation and then claim its setup-phase parent apply. |
| integration/operator_test.go | Updates claimability matrix to expect setup-phase PlanetScale states to be claimable; removes obsolete “not claimed” test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ngine The setup-phase states (preparing_branch through validating_deploy_request) only occur for the PlanetScale engine, so the recovery tests now build their fixtures on a Vitess lock. createTestApplyWithStateAndEnv derives the apply engine from the lock's database type via storage.EngineForType, so a Vitess lock yields a planetscale apply with db_type=vitess — matching the real recovery path instead of a misleading mysql/spirit pairing. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
The state set is exactly the five setup-phase states (terminal, One sequencing concern that should gate the merge: this should land after #265 (resume with persisted engine state). On main today, a claimed setup-phase Vitess apply resumes through the path that re-runs engine setup without persisted state — which can duplicate branch/deploy-request work. The deleted negative test was the guard for exactly that scenario; deleting it is correct once #265 is in, but merging this first converts "wedged after crash" into "recovered wrongly," which is worse. Suggest noting the dependency in the PR body (or stacking the branches) so merge order is explicit. Smaller: the tests prove claimability at both levels, but nothing exercises the full claim → resume → engine-reattach flow for a setup-phase crash (branch exists, no deploy request yet). With #265's tests adjacent this is acceptable; a combined integration test would close the loop. Verdict: LGTM contingent on landing after #265. 🤖 This review was generated by Claude Code (claude-fable-5) with maintainer approval. |
aparajon
left a comment
There was a problem hiding this comment.
Approving with an explicit sequencing condition: land this only after #265 (resume with persisted engine state) merges with its save-error fix. On main today, a claimed setup-phase Vitess apply would resume through the path that can duplicate engine work — the deleted negative test was the guard for exactly that. The claimable state set itself is verified correct.
🤖 Approval issued by Claude Code (claude-fable-5) with maintainer approval.
…-aggregate-from-operations * origin/main: (21 commits) fix(operator): make PlanetScale setup-phase applies recoverable after crash (#302) test(cli): add onboard integration coverage (#332) fix(github): run all webhook-spawned goroutines with panic recovery (#329) fix(github): retain locks and check state when a PR closes with an in-flight apply (#326) fix(github): support PR comment start command (#324) feat(cli): add `onboard` and `pull` commands (#323) feat(api): expose routing tern client (#322) feat(tern): add routing client (#316) feat(api): add live schema pull primitive (#313) fix(operator): stop conflict check from failing stopped or retryable tasks (#271) fix(operator): derive apply state from tasks when stop finds no active work (#262) fix(github): fail closed on staging-first gate when target env is unlisted (#301) fix(operator): stop the engine for all active states before recording stop (#269) fix(config): validate revert_window_duration and token references (#275) fix(storage): make rows-affected checks correct under production changed-rows semantics (#307) fix(operator): write apply_operations row when creating applies via the Tern client (#268) fix(github): enforce actor authorization for rollback and unlock commands (#264) bump spirit (#321) fix(api): reject null schema_files namespaces instead of panicking (#278) fix(spirit): only infer namespace when a single schema namespace exists (#277) ... # Conflicts: # pkg/tern/local_client_test.go
…enum * origin/main: (22 commits) refactor(operator): derive applies.state from all operation rows (#318) fix(operator): make PlanetScale setup-phase applies recoverable after crash (#302) test(cli): add onboard integration coverage (#332) fix(github): run all webhook-spawned goroutines with panic recovery (#329) fix(github): retain locks and check state when a PR closes with an in-flight apply (#326) fix(github): support PR comment start command (#324) feat(cli): add `onboard` and `pull` commands (#323) feat(api): expose routing tern client (#322) feat(tern): add routing client (#316) feat(api): add live schema pull primitive (#313) fix(operator): stop conflict check from failing stopped or retryable tasks (#271) fix(operator): derive apply state from tasks when stop finds no active work (#262) fix(github): fail closed on staging-first gate when target env is unlisted (#301) fix(operator): stop the engine for all active states before recording stop (#269) fix(config): validate revert_window_duration and token references (#275) fix(storage): make rows-affected checks correct under production changed-rows semantics (#307) fix(operator): write apply_operations row when creating applies via the Tern client (#268) fix(github): enforce actor authorization for rollback and unlock commands (#264) bump spirit (#321) fix(api): reject null schema_files namespaces instead of panicking (#278) ...
A PlanetScale apply moves through engine setup-phase states —
preparing_branch,applying_branch_changes,validating_branch,creating_deploy_request,validating_deploy_request— before per-table work begins. These states matched no clause of the operator claim queries: notpending, not in the claimable set, not terminal. So a worker crash mid-setup left the apply non-terminal and permanently unclaimable, with no operator able to recover it.This adds the setup-phase states to
claimableApplyStates, so a stale-heartbeat apply stuck in setup is reclaimed for recovery just like the other in-flight states. A healthy worker keeps its heartbeat fresh, so an in-progress setup is never claimed out from under it; terminal states stay excluded. Both the apply-level (FindNextApply,ClaimApplyByID) and operation-level (FindNextApplyOperation) claim paths share this set and now treat setup-phase applies consistently.Merge order: This should land after #265 (resume with persisted engine state). On
maintoday, a claimed setup-phase Vitess apply resumes through the path that re-runs engine setup without persisted state, which can duplicate branch/deploy-request work. Merging this before #265 converts "wedged after crash" into "recovered wrongly," which is worse. The deleted negative test is correct only once #265 is in.🤖 Generated with Claude Code