fix(engine): deploy and rediscover schema change context on PlanetScale resume#306
fix(engine): deploy and rediscover schema change context on PlanetScale resume#306morgo wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes crash-recovery correctness in the PlanetScale engine by ensuring resumed applies (a) actually start a recovered deploy request that was created but never deployed, and (b) re-run Vitess migration_context discovery so per-shard progress can work after resume.
Changes:
- Extracted a shared
migration_contextdiscovery retry helper that honors context cancellation between polls. - Updated the resume path to deploy recovered non-deferred deploy requests stuck in
ready, and to re-capture/discover Vitessmigration_contextaround resume deploys. - Added unit tests for resume deploy behavior and the deploy gating predicate.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pkg/engine/planetscale/progress.go | Adds a retry helper for Vitess migration_context discovery with bounded polling and ctx cancellation. |
| pkg/engine/planetscale/apply.go | Uses the new discovery helper, fixes resume behavior to deploy recovered requests when needed, and resolves migration_context after resume deploys. |
| pkg/engine/planetscale/apply_test.go | Adds tests covering resume deploy gating and resume behavior for several recovered deploy-request states. |
Comments suppressed due to low confidence (1)
pkg/engine/planetscale/apply.go:439
- After discovering the Vitess migration_context, the code returns it in the ApplyResult but never persists it via OnStateChange. If the worker crashes after discovery (or immediately after deploy) but before Apply returns, the stored ResumeState will still contain the tern apply identifier, and subsequent resume/progress will have empty per-shard Vitess progress indefinitely.
migrationContext := e.discoverMigrationContextWithRetry(ctx, client, req.Database, req.Credentials, existingContexts)
meta, err := encodePSMetadata(&psMetadata{
BranchName: branchName,
DeployRequestID: dr.Number,
DeployRequestURL: dr.HtmlURL,
IsInstant: useInstant,
})
if err != nil {
return nil, fmt.Errorf("encode metadata for deploy request #%d: %w", dr.Number, err)
}
return &engine.ApplyResult{
Accepted: true,
Message: fmt.Sprintf("Deploy request #%d created", dr.Number),
ResumeState: &engine.ResumeState{
MigrationContext: migrationContext,
Metadata: meta,
},
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Both fixes address real resume gaps, the deploy-gating predicate (
Nits: the Verdict: needs changes for items 1–3. 🤖 This review was generated by Claude Code (claude-fable-5) with maintainer approval. |
…esume When a worker crashes between creating a PlanetScale deploy request and starting it, the recovered deploy request sits in "ready" forever: Progress maps "ready" to pending, the deferred-deploy promotion never applies for a non-deferred apply, and the tern auto-deploy trigger only fires on waiting_for_deploy. Resume now starts the deploy for a recovered ready, non-deferred, never-deployed request so the schema change actually runs. Resume also rediscovers the Vitess migration_context after deploying, instead of carrying the tern apply identifier forward. The apply identifier never matches SHOW VITESS_MIGRATIONS, so per-shard progress was empty for the life of a resumed apply. Discovery captures a baseline before deploy and polls for the new context after, preserving an already-real stored context and never clobbering it with an empty result. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…sume On resume, the freshly discovered Vitess migration_context was only returned in the ApplyResult, never persisted durably. A crash after the deploy started but before the resume function returned left stored ResumeState holding the tern apply identifier, so the next resume had no per-shard Vitess progress. Persist the resolved context via OnStateChange as soon as it is known (in both deploy-and-discover resume paths), but only when it is a real Vitess context (a "<system>:<uuid>" form), never the apply identifier — so a previously persisted real context is never clobbered. On the reattach-only path, if the stored value is still the apply identifier, rediscover the context from the current migrations and persist it; a stored real context is kept as-is. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…y requests Restrict schema change context rediscovery to non-terminal candidates so an empty-baseline reattach never attaches progress to an old, completed context that SHOW VITESS_MIGRATIONS still retains; when zero or multiple candidates remain, keep the stored identifier and warn rather than render misleading progress. Only a genuine not-found from the recovered deploy request starts a fresh apply; transient API errors now propagate so resume retries against the same deploy request instead of forking a duplicate branch and deploy. Rename the resume helpers to schema change context wording per the repo terminology rule. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
a77641f to
81196f9
Compare
Fixes two correctness bugs in the PlanetScale engine's crash-recovery resume path.
Crash between create and deploy never deployed. When a worker crashed after creating a deploy request but before starting it, resume reattached to the recovered request and returned current state without ever deploying. For a non-deferred apply the request sits in
readyforever — Progress mapsreadyto pending, thewaiting_for_deploypromotion only applies to deferred deploys, and the auto-deploy trigger only fires onwaiting_for_deploy— so the schema change never ran. Resume now starts the deploy for a recovered request that isready, non-deferred, and not yet deployed, mirroring the fresh apply path. A deferred or already-in-flight request is left alone.Resume never discovered the Vitess migration context. The fresh apply path captures a context baseline before deploying and discovers the new
migration_contextafter; the resume path carried the tern apply identifier forward instead. That identifier never matchesSHOW VITESS_MIGRATIONS, so per-shard progress was empty for the life of a resumed apply. Resume now captures a baseline before the resume deploy and discovers the real context after, preserving an already-real stored context and never clobbering it with an empty result.The shared discovery retry loop is extracted into one helper used by both paths and now honors context cancellation between polls.
🤖 Generated with Claude Code