Skip to content

fix(operator): resume recovered applies with persisted engine state#265

Open
morgo wants to merge 3 commits into
mainfrom
fix/resume-persisted-engine-state
Open

fix(operator): resume recovered applies with persisted engine state#265
morgo wants to merge 3 commits into
mainfrom
fix/resume-persisted-engine-state

Conversation

@morgo

@morgo morgo commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Restart recovery of grouped applies ignored the engine resume state SchemaBot had persisted: the grouped resume path called the engine with a context-only ResumeState, polled progress with nil resume state, sent bare DDL strings without namespace/table identity, and ResumeApply only used the grouped path for defer-cutover applies. For Vitess this meant a recovered apply opened a new branch and deploy request — duplicating the still-in-flight one — and then polled "no active schema change" forever while its heartbeat kept the lease alive.

  • The grouped resume now loads the persisted per-operation engine resume state and hands it to Apply, so the engine reattaches to its existing work. Absent state is expected for Spirit (it reattaches via durable checkpoints keyed by the schema change context); a storage read failure aborts the recovery attempt without writing terminal state so a later owner retries against intact storage.
  • The engine's returned resume state is persisted and threaded into progress polling, so the recovered apply can observe the deploy request and keep stored state current.
  • Resume changes are rebuilt from the stored tasks' namespace/table/DDL, restoring per-table progress matching. VSchema tasks carry no DDL, so they are kept out of the table changes and instead flag their namespace with the vschema_changed metadata, matching the fresh apply path so the engine re-applies vschema.json on resume instead of executing an empty statement.
  • ResumeApply dispatches on the same grouped-apply predicate as initial dispatch, so non-defer-cutover Vitess applies recover through one grouped engine apply instead of one fresh apply per task.
Before (restart recovery, Vitess):
  ResumeApply ──> eng.Apply(context-only state) ──> new branch + new deploy request (duplicate)
              └─> poll(resume state = nil)      ──> "no active schema change" forever (wedged)

After:
  ResumeApply ──> load persisted resume state ──> eng.Apply(context + metadata)
              │                                    └─> reattaches to existing deploy request
              └─> poll(engine-returned resume state) ──> live progress, updates persisted

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings June 10, 2026 15:12

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes restart recovery for grouped applies so SchemaBot resumes in-flight engine work using the persisted engine resume state, rather than accidentally starting duplicate schema changes (notably for Vitess/PlanetScale deploy requests). It also improves progress polling by threading the engine-provided resume state through polling and persisting updates as they arrive.

Changes:

  • Load and pass persisted per-apply-operation engine ResumeState into grouped resume Apply, and persist any updated resume state returned by the engine.
  • Rebuild grouped resume Changes from stored tasks (namespace/table/DDL) so per-table progress mapping is consistent after recovery.
  • Resume Vitess applies via the grouped-apply path whenever usesGroupedApply(apply) is true (not only defer-cutover).

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
pkg/tern/local_control_resume.go Loads persisted engine resume state for grouped recovery, rebuilds changes from tasks, and threads/persists resume state through Apply + progress polling.
pkg/tern/local_apply_grouped.go Factors out loadEngineResumeStateForOperation for reuse by grouped recovery logic.
pkg/tern/local_client_test.go Adds unit tests for grouped resume state loading behavior and rebuilt change grouping.
pkg/tern/local_client_integration_test.go Adds integration coverage asserting Vitess grouped recovery reattaches with persisted state and polling persists updated state.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/tern/local_control_resume.go Outdated
@morgo morgo marked this pull request as ready for review June 12, 2026 13:33
@morgo morgo requested review from Kiran01bm and aparajon as code owners June 12, 2026 13:33
@aparajon

Copy link
Copy Markdown
Collaborator

The overall direction is a major safety improvement — persisted-state resume with engine-side validation (re-fetching the deploy request, falling back on deleted/errored ones, diffing the branch rather than replaying DDL) is the right model, and the tests genuinely prove the crash/restart round-trip. One fix needed before merge, one worth considering:

  1. Save failure after a successful reattach terminally fails an in-flight apply — in the post-Accepted path, a saveEngineResumeState error returns a plain error, so the caller's errors.Is(err, errGroupedResumeStateUnavailable) misses it and routes to the failure handler; since the engine-error retry classification doesn't apply to Vitess, the apply is marked terminally failed while the reattached deploy request keeps running on the provider. Storage uncertainty must not become terminal state on an in-flight apply. Suggest wrapping this error (and the empty-Metadata invariant error, which takes the same caller path) in errGroupedResumeStateUnavailable or an equivalent non-terminal exit so the apply stays recoverable.

  2. Vitess fallback on ErrEngineResumeStateNotFound — falling back to the fresh path is mostly safe (branch names are deterministic and branch creation is reuse-on-exists), but state-persistence failures are warn-swallowed in both paths, so "not found" can also mean "persist failed after a deploy request was created," and deploy-request creation has no existing-DR dedupe. Consider failing closed when the apply's state shows setup progressed past branch preparation yet no persisted state exists.

Verdict: needs changes for item 1; happy to re-review quickly.


🤖 This review was generated by Claude Code (claude-fable-5) with maintainer approval.

morgo and others added 3 commits June 12, 2026 11:01
Grouped resume now loads the persisted engine resume state and rebuilds
changes from the stored tasks' namespace/table identity, so recovery
reattaches to in-flight engine work instead of launching a duplicate
schema change and polling without state forever. ResumeApply also
dispatches on the same grouped-apply predicate as initial dispatch, so
non-defer-cutover Vitess applies recover through one grouped engine
apply rather than one fresh apply per task.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
VSchema tasks have no DDL; their work is applying the namespace's
vschema.json. When rebuilding engine changes for a grouped resume,
exclude vschema tasks from TableChanges and instead flag their namespace
with vschema_changed metadata, mirroring the fresh apply path. This
prevents the engine from executing an empty statement and ensures
vschema.json is re-applied on resume for VSchema-only and VSchema+DDL
applies.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e fails

After the engine accepts a grouped resume, a deploy request is running on
the provider. A failure to persist the returned resume state, or an accepted
resume that returns no progress-tracking metadata, is storage/tracking
uncertainty — not a failed schema change. Both post-accept exits now wrap
errGroupedResumeStateUnavailable so the caller routes them as recoverable and
the apply owner exits without writing terminal failure, leaving the apply for
operator retry against the in-flight work instead of abandoning it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@morgo morgo force-pushed the fix/resume-persisted-engine-state branch from 495e8f6 to cc950f5 Compare June 12, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants