Skip to content

fix(operator): make PlanetScale setup-phase applies recoverable after crash#302

Merged
morgo merged 2 commits into
mainfrom
fix/claimable-setup-phase-states
Jun 13, 2026
Merged

fix(operator): make PlanetScale setup-phase applies recoverable after crash#302
morgo merged 2 commits into
mainfrom
fix/claimable-setup-phase-states

Conversation

@morgo

@morgo morgo commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

A PlanetScale apply moves through engine setup-phase states — preparing_branch, applying_branch_changes, validating_branch, creating_deploy_request, validating_deploy_request — before per-table work begins. These states matched no clause of the operator claim queries: not pending, not in the claimable set, not terminal. So a worker crash mid-setup left the apply non-terminal and permanently unclaimable, with no operator able to recover it.

This adds the setup-phase states to claimableApplyStates, so a stale-heartbeat apply stuck in setup is reclaimed for recovery just like the other in-flight states. A healthy worker keeps its heartbeat fresh, so an in-progress setup is never claimed out from under it; terminal states stay excluded. Both the apply-level (FindNextApply, ClaimApplyByID) and operation-level (FindNextApplyOperation) claim paths share this set and now treat setup-phase applies consistently.

worker crash during setup
  apply.state = applying_branch_changes (non-terminal)
  heartbeat goes stale
    before: matches no claim clause -> wedged forever
    after:  matches stale-active clause -> operator reclaims and resumes

Merge order: This should land after #265 (resume with persisted engine state). On main today, a claimed setup-phase Vitess apply resumes through the path that re-runs engine setup without persisted state, which can duplicate branch/deploy-request work. Merging this before #265 converts "wedged after crash" into "recovered wrongly," which is worse. The deleted negative test is correct only once #265 is in.

🤖 Generated with Claude Code

… crash

An apply driven by the PlanetScale engine transitions through setup-phase
states (preparing_branch, applying_branch_changes, validating_branch,
creating_deploy_request, validating_deploy_request) before per-table work
begins. These states matched no clause of the operator claim queries: not
pending, not in the claimable set, not terminal. A worker crash mid-setup left
the apply non-terminal and permanently unclaimable, so no operator could ever
recover it.

Add the setup-phase states to claimableApplyStates so a stale-heartbeat apply
stuck in setup is reclaimable for recovery, exactly like the other in-flight
states. A healthy worker keeps its heartbeat fresh, so it is never claimed out
from under itself. Both the apply-level (FindNextApply, ClaimApplyByID) and
operation-level (FindNextApplyOperation) claim paths share this set and now
treat setup-phase applies consistently.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 11, 2026 16:32

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a recovery gap in the operator claim logic where PlanetScale engine setup-phase applies (e.g., preparing_branch, applying_branch_changes, validating_branch, creating_deploy_request, validating_deploy_request) could become permanently unclaimable after a worker crash, leaving them stuck non-terminal without any operator being able to resume them.

Changes:

  • Adds PlanetScale setup-phase states to the shared claimableApplyStates set so stale-heartbeat setup applies can be reclaimed.
  • Adds unit test coverage for reclaiming stale setup-phase applies via both apply-level and operation-level claim paths.
  • Updates integration claimability expectations to treat setup-phase PlanetScale states as claimable (when stale) and removes the prior “not claimed” integration test.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
pkg/storage/mysqlstore/applies.go Expands claimableApplyStates() to include PlanetScale setup-phase states and updates rationale comment.
pkg/storage/mysqlstore/applies_test.go Adds tests ensuring stale setup-phase applies are reclaimable via FindNextApply and ClaimApplyByID.
pkg/storage/mysqlstore/apply_operations_test.go Adds an operation-claim-path test verifying recovery can lease a stale operation and then claim its setup-phase parent apply.
integration/operator_test.go Updates claimability matrix to expect setup-phase PlanetScale states to be claimable; removes obsolete “not claimed” test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/storage/mysqlstore/applies_test.go Outdated
Comment thread pkg/storage/mysqlstore/applies_test.go Outdated
Comment thread pkg/storage/mysqlstore/apply_operations_test.go Outdated
…ngine

The setup-phase states (preparing_branch through validating_deploy_request)
only occur for the PlanetScale engine, so the recovery tests now build their
fixtures on a Vitess lock. createTestApplyWithStateAndEnv derives the apply
engine from the lock's database type via storage.EngineForType, so a Vitess
lock yields a planetscale apply with db_type=vitess — matching the real
recovery path instead of a misleading mysql/spirit pairing.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@morgo morgo marked this pull request as ready for review June 12, 2026 14:32
@morgo morgo requested review from Kiran01bm and aparajon as code owners June 12, 2026 14:32
@aparajon

Copy link
Copy Markdown
Collaborator

The state set is exactly the five setup-phase states (terminal, failed_retryable, and stopped correctly excluded — each has its own path), all three claim paths share the same predicate, and the heartbeat-before-engine-apply ordering means a healthy worker can't be claimed out from under during long branch waits. The fixture engine/db-type mismatch from earlier review was properly fixed.

One sequencing concern that should gate the merge: this should land after #265 (resume with persisted engine state). On main today, a claimed setup-phase Vitess apply resumes through the path that re-runs engine setup without persisted state — which can duplicate branch/deploy-request work. The deleted negative test was the guard for exactly that scenario; deleting it is correct once #265 is in, but merging this first converts "wedged after crash" into "recovered wrongly," which is worse. Suggest noting the dependency in the PR body (or stacking the branches) so merge order is explicit.

Smaller: the tests prove claimability at both levels, but nothing exercises the full claim → resume → engine-reattach flow for a setup-phase crash (branch exists, no deploy request yet). With #265's tests adjacent this is acceptable; a combined integration test would close the loop.

Verdict: LGTM contingent on landing after #265.


🤖 This review was generated by Claude Code (claude-fable-5) with maintainer approval.

@aparajon aparajon left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with an explicit sequencing condition: land this only after #265 (resume with persisted engine state) merges with its save-error fix. On main today, a claimed setup-phase Vitess apply would resume through the path that can duplicate engine work — the deleted negative test was the guard for exactly that. The claimable state set itself is verified correct.

🤖 Approval issued by Claude Code (claude-fable-5) with maintainer approval.

@morgo morgo merged commit 57ad251 into main Jun 13, 2026
27 checks passed
@morgo morgo deleted the fix/claimable-setup-phase-states branch June 13, 2026 03:46
Kiran01bm added a commit that referenced this pull request Jun 13, 2026
…-aggregate-from-operations

* origin/main: (21 commits)
  fix(operator): make PlanetScale setup-phase applies recoverable after crash (#302)
  test(cli): add onboard integration coverage (#332)
  fix(github): run all webhook-spawned goroutines with panic recovery (#329)
  fix(github): retain locks and check state when a PR closes with an in-flight apply (#326)
  fix(github): support PR comment start command (#324)
  feat(cli): add `onboard` and `pull` commands (#323)
  feat(api): expose routing tern client (#322)
  feat(tern): add routing client (#316)
  feat(api): add live schema pull primitive (#313)
  fix(operator): stop conflict check from failing stopped or retryable tasks (#271)
  fix(operator): derive apply state from tasks when stop finds no active work (#262)
  fix(github): fail closed on staging-first gate when target env is unlisted (#301)
  fix(operator): stop the engine for all active states before recording stop (#269)
  fix(config): validate revert_window_duration and token references (#275)
  fix(storage): make rows-affected checks correct under production changed-rows semantics (#307)
  fix(operator): write apply_operations row when creating applies via the Tern client (#268)
  fix(github): enforce actor authorization for rollback and unlock commands (#264)
  bump spirit (#321)
  fix(api): reject null schema_files namespaces instead of panicking (#278)
  fix(spirit): only infer namespace when a single schema namespace exists (#277)
  ...

# Conflicts:
#	pkg/tern/local_client_test.go
Kiran01bm added a commit that referenced this pull request Jun 14, 2026
…enum

* origin/main: (22 commits)
  refactor(operator): derive applies.state from all operation rows (#318)
  fix(operator): make PlanetScale setup-phase applies recoverable after crash (#302)
  test(cli): add onboard integration coverage (#332)
  fix(github): run all webhook-spawned goroutines with panic recovery (#329)
  fix(github): retain locks and check state when a PR closes with an in-flight apply (#326)
  fix(github): support PR comment start command (#324)
  feat(cli): add `onboard` and `pull` commands (#323)
  feat(api): expose routing tern client (#322)
  feat(tern): add routing client (#316)
  feat(api): add live schema pull primitive (#313)
  fix(operator): stop conflict check from failing stopped or retryable tasks (#271)
  fix(operator): derive apply state from tasks when stop finds no active work (#262)
  fix(github): fail closed on staging-first gate when target env is unlisted (#301)
  fix(operator): stop the engine for all active states before recording stop (#269)
  fix(config): validate revert_window_duration and token references (#275)
  fix(storage): make rows-affected checks correct under production changed-rows semantics (#307)
  fix(operator): write apply_operations row when creating applies via the Tern client (#268)
  fix(github): enforce actor authorization for rollback and unlock commands (#264)
  bump spirit (#321)
  fix(api): reject null schema_files namespaces instead of panicking (#278)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants