fix(operator): resume recovered applies with persisted engine state#265
fix(operator): resume recovered applies with persisted engine state#265morgo wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes restart recovery for grouped applies so SchemaBot resumes in-flight engine work using the persisted engine resume state, rather than accidentally starting duplicate schema changes (notably for Vitess/PlanetScale deploy requests). It also improves progress polling by threading the engine-provided resume state through polling and persisting updates as they arrive.
Changes:
- Load and pass persisted per-apply-operation engine
ResumeStateinto grouped resumeApply, and persist any updated resume state returned by the engine. - Rebuild grouped resume
Changesfrom stored tasks (namespace/table/DDL) so per-table progress mapping is consistent after recovery. - Resume Vitess applies via the grouped-apply path whenever
usesGroupedApply(apply)is true (not only defer-cutover).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| pkg/tern/local_control_resume.go | Loads persisted engine resume state for grouped recovery, rebuilds changes from tasks, and threads/persists resume state through Apply + progress polling. |
| pkg/tern/local_apply_grouped.go | Factors out loadEngineResumeStateForOperation for reuse by grouped recovery logic. |
| pkg/tern/local_client_test.go | Adds unit tests for grouped resume state loading behavior and rebuilt change grouping. |
| pkg/tern/local_client_integration_test.go | Adds integration coverage asserting Vitess grouped recovery reattaches with persisted state and polling persists updated state. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
The overall direction is a major safety improvement — persisted-state resume with engine-side validation (re-fetching the deploy request, falling back on deleted/errored ones, diffing the branch rather than replaying DDL) is the right model, and the tests genuinely prove the crash/restart round-trip. One fix needed before merge, one worth considering:
Verdict: needs changes for item 1; happy to re-review quickly. 🤖 This review was generated by Claude Code (claude-fable-5) with maintainer approval. |
Grouped resume now loads the persisted engine resume state and rebuilds changes from the stored tasks' namespace/table identity, so recovery reattaches to in-flight engine work instead of launching a duplicate schema change and polling without state forever. ResumeApply also dispatches on the same grouped-apply predicate as initial dispatch, so non-defer-cutover Vitess applies recover through one grouped engine apply rather than one fresh apply per task. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
VSchema tasks have no DDL; their work is applying the namespace's vschema.json. When rebuilding engine changes for a grouped resume, exclude vschema tasks from TableChanges and instead flag their namespace with vschema_changed metadata, mirroring the fresh apply path. This prevents the engine from executing an empty statement and ensures vschema.json is re-applied on resume for VSchema-only and VSchema+DDL applies. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e fails After the engine accepts a grouped resume, a deploy request is running on the provider. A failure to persist the returned resume state, or an accepted resume that returns no progress-tracking metadata, is storage/tracking uncertainty — not a failed schema change. Both post-accept exits now wrap errGroupedResumeStateUnavailable so the caller routes them as recoverable and the apply owner exits without writing terminal failure, leaving the apply for operator retry against the in-flight work instead of abandoning it. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
495e8f6 to
cc950f5
Compare
Restart recovery of grouped applies ignored the engine resume state SchemaBot had persisted: the grouped resume path called the engine with a context-only
ResumeState, polled progress with nil resume state, sent bare DDL strings without namespace/table identity, andResumeApplyonly used the grouped path for defer-cutover applies. For Vitess this meant a recovered apply opened a new branch and deploy request — duplicating the still-in-flight one — and then polled "no active schema change" forever while its heartbeat kept the lease alive.Apply, so the engine reattaches to its existing work. Absent state is expected for Spirit (it reattaches via durable checkpoints keyed by the schema change context); a storage read failure aborts the recovery attempt without writing terminal state so a later owner retries against intact storage.vschema_changedmetadata, matching the fresh apply path so the engine re-appliesvschema.jsonon resume instead of executing an empty statement.ResumeApplydispatches on the same grouped-apply predicate as initial dispatch, so non-defer-cutover Vitess applies recover through one grouped engine apply instead of one fresh apply per task.🤖 Generated with Claude Code