fix(operator): resume recovered applies with persisted engine state by morgo · Pull Request #265 · block/schemabot

morgo · 2026-06-10T15:12:28Z

Restart recovery of grouped applies ignored the engine resume state SchemaBot had persisted: the grouped resume path called the engine with a context-only ResumeState, polled progress with nil resume state, sent bare DDL strings without namespace/table identity, and ResumeApply only used the grouped path for defer-cutover applies. For Vitess this meant a recovered apply opened a new branch and deploy request — duplicating the still-in-flight one — and then polled "no active schema change" forever while its heartbeat kept the lease alive.

The grouped resume now loads the persisted per-operation engine resume state and hands it to Apply, so the engine reattaches to its existing work. Absent state is expected for Spirit (it reattaches via durable checkpoints keyed by the schema change context); a storage read failure aborts the recovery attempt without writing terminal state so a later owner retries against intact storage.
The engine's returned resume state is persisted and threaded into progress polling, so the recovered apply can observe the deploy request and keep stored state current.
Resume changes are rebuilt from the stored tasks' namespace/table/DDL, restoring per-table progress matching. VSchema tasks carry no DDL, so they are kept out of the table changes and instead flag their namespace with the vschema_changed metadata, matching the fresh apply path so the engine re-applies vschema.json on resume instead of executing an empty statement.
ResumeApply dispatches on the same grouped-apply predicate as initial dispatch, so non-defer-cutover Vitess applies recover through one grouped engine apply instead of one fresh apply per task.

Before (restart recovery, Vitess):
  ResumeApply ──> eng.Apply(context-only state) ──> new branch + new deploy request (duplicate)
              └─> poll(resume state = nil)      ──> "no active schema change" forever (wedged)

After:
  ResumeApply ──> load persisted resume state ──> eng.Apply(context + metadata)
              │                                    └─> reattaches to existing deploy request
              └─> poll(engine-returned resume state) ──> live progress, updates persisted

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR fixes restart recovery for grouped applies so SchemaBot resumes in-flight engine work using the persisted engine resume state, rather than accidentally starting duplicate schema changes (notably for Vitess/PlanetScale deploy requests). It also improves progress polling by threading the engine-provided resume state through polling and persisting updates as they arrive.

Changes:

Load and pass persisted per-apply-operation engine ResumeState into grouped resume Apply, and persist any updated resume state returned by the engine.
Rebuild grouped resume Changes from stored tasks (namespace/table/DDL) so per-table progress mapping is consistent after recovery.
Resume Vitess applies via the grouped-apply path whenever usesGroupedApply(apply) is true (not only defer-cutover).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
pkg/tern/local_control_resume.go	Loads persisted engine resume state for grouped recovery, rebuilds changes from tasks, and threads/persists resume state through Apply + progress polling.
pkg/tern/local_apply_grouped.go	Factors out `loadEngineResumeStateForOperation` for reuse by grouped recovery logic.
pkg/tern/local_client_test.go	Adds unit tests for grouped resume state loading behavior and rebuilt change grouping.
pkg/tern/local_client_integration_test.go	Adds integration coverage asserting Vitess grouped recovery reattaches with persisted state and polling persists updated state.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

aparajon · 2026-06-12T16:46:29Z

The overall direction is a major safety improvement — persisted-state resume with engine-side validation (re-fetching the deploy request, falling back on deleted/errored ones, diffing the branch rather than replaying DDL) is the right model, and the tests genuinely prove the crash/restart round-trip. One fix needed before merge, one worth considering:

Save failure after a successful reattach terminally fails an in-flight apply — in the post-Accepted path, a saveEngineResumeState error returns a plain error, so the caller's errors.Is(err, errGroupedResumeStateUnavailable) misses it and routes to the failure handler; since the engine-error retry classification doesn't apply to Vitess, the apply is marked terminally failed while the reattached deploy request keeps running on the provider. Storage uncertainty must not become terminal state on an in-flight apply. Suggest wrapping this error (and the empty-Metadata invariant error, which takes the same caller path) in errGroupedResumeStateUnavailable or an equivalent non-terminal exit so the apply stays recoverable.
Vitess fallback on ErrEngineResumeStateNotFound — falling back to the fresh path is mostly safe (branch names are deterministic and branch creation is reuse-on-exists), but state-persistence failures are warn-swallowed in both paths, so "not found" can also mean "persist failed after a deploy request was created," and deploy-request creation has no existing-DR dedupe. Consider failing closed when the apply's state shows setup progressed past branch preparation yet no persisted state exists.

Verdict: needs changes for item 1; happy to re-review quickly.

🤖 This review was generated by Claude Code (claude-fable-5) with maintainer approval.

Grouped resume now loads the persisted engine resume state and rebuilds changes from the stored tasks' namespace/table identity, so recovery reattaches to in-flight engine work instead of launching a duplicate schema change and polling without state forever. ResumeApply also dispatches on the same grouped-apply predicate as initial dispatch, so non-defer-cutover Vitess applies recover through one grouped engine apply rather than one fresh apply per task. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

VSchema tasks have no DDL; their work is applying the namespace's vschema.json. When rebuilding engine changes for a grouped resume, exclude vschema tasks from TableChanges and instead flag their namespace with vschema_changed metadata, mirroring the fresh apply path. This prevents the engine from executing an empty statement and ensures vschema.json is re-applied on resume for VSchema-only and VSchema+DDL applies. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…e fails After the engine accepts a grouped resume, a deploy request is running on the provider. A failure to persist the returned resume state, or an accepted resume that returns no progress-tracking metadata, is storage/tracking uncertainty — not a failed schema change. Both post-accept exits now wrap errGroupedResumeStateUnavailable so the caller routes them as recoverable and the apply owner exits without writing terminal failure, leaving the apply for operator retry against the in-flight work instead of abandoning it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 10, 2026 15:12

Copilot started reviewing on behalf of morgo June 10, 2026 15:12 View session

Copilot AI reviewed Jun 10, 2026

View reviewed changes

Comment thread pkg/tern/local_control_resume.go Outdated

morgo marked this pull request as ready for review June 12, 2026 13:33

morgo requested review from Kiran01bm and aparajon as code owners June 12, 2026 13:33

aparajon mentioned this pull request Jun 12, 2026

fix(operator): make PlanetScale setup-phase applies recoverable after crash #302

Merged

morgo and others added 3 commits June 12, 2026 11:01

morgo force-pushed the fix/resume-persisted-engine-state branch from 495e8f6 to cc950f5 Compare June 12, 2026 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(operator): resume recovered applies with persisted engine state#265

fix(operator): resume recovered applies with persisted engine state#265
morgo wants to merge 3 commits into
mainfrom
fix/resume-persisted-engine-state

morgo commented Jun 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

aparajon commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

morgo commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

aparajon commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

morgo commented Jun 10, 2026 •

edited

Loading