fix(engine): deploy and rediscover schema change context on PlanetScale resume by morgo · Pull Request #306 · block/schemabot

morgo · 2026-06-11T17:10:49Z

Fixes two correctness bugs in the PlanetScale engine's crash-recovery resume path.

Crash between create and deploy never deployed. When a worker crashed after creating a deploy request but before starting it, resume reattached to the recovered request and returned current state without ever deploying. For a non-deferred apply the request sits in ready forever — Progress maps ready to pending, the waiting_for_deploy promotion only applies to deferred deploys, and the auto-deploy trigger only fires on waiting_for_deploy — so the schema change never ran. Resume now starts the deploy for a recovered request that is ready, non-deferred, and not yet deployed, mirroring the fresh apply path. A deferred or already-in-flight request is left alone.

Resume never discovered the Vitess migration context. The fresh apply path captures a context baseline before deploying and discovers the new migration_context after; the resume path carried the tern apply identifier forward instead. That identifier never matches SHOW VITESS_MIGRATIONS, so per-shard progress was empty for the life of a resumed apply. Resume now captures a baseline before the resume deploy and discovers the real context after, preserving an already-real stored context and never clobbering it with an empty result.

worker crash after create, before deploy
      |
      v
 resume -> deploy request "ready", non-deferred, DeployedAt == nil
      |                              |
      | (before: returned pending)   v  (now: starts deploy)
      x  never deploys          DeployDeployRequest + rediscover context

The shared discovery retry loop is extracted into one helper used by both paths and now honors context cancellation between polls.

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR fixes crash-recovery correctness in the PlanetScale engine by ensuring resumed applies (a) actually start a recovered deploy request that was created but never deployed, and (b) re-run Vitess migration_context discovery so per-shard progress can work after resume.

Changes:

Extracted a shared migration_context discovery retry helper that honors context cancellation between polls.
Updated the resume path to deploy recovered non-deferred deploy requests stuck in ready, and to re-capture/discover Vitess migration_context around resume deploys.
Added unit tests for resume deploy behavior and the deploy gating predicate.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
pkg/engine/planetscale/progress.go	Adds a retry helper for Vitess `migration_context` discovery with bounded polling and ctx cancellation.
pkg/engine/planetscale/apply.go	Uses the new discovery helper, fixes resume behavior to deploy recovered requests when needed, and resolves `migration_context` after resume deploys.
pkg/engine/planetscale/apply_test.go	Adds tests covering resume deploy gating and resume behavior for several recovered deploy-request states.

Comments suppressed due to low confidence (1)

pkg/engine/planetscale/apply.go:439

After discovering the Vitess migration_context, the code returns it in the ApplyResult but never persists it via OnStateChange. If the worker crashes after discovery (or immediately after deploy) but before Apply returns, the stored ResumeState will still contain the tern apply identifier, and subsequent resume/progress will have empty per-shard Vitess progress indefinitely.

	migrationContext := e.discoverMigrationContextWithRetry(ctx, client, req.Database, req.Credentials, existingContexts)

	meta, err := encodePSMetadata(&psMetadata{
		BranchName:       branchName,
		DeployRequestID:  dr.Number,
		DeployRequestURL: dr.HtmlURL,
		IsInstant:        useInstant,
	})
	if err != nil {
		return nil, fmt.Errorf("encode metadata for deploy request #%d: %w", dr.Number, err)
	}

	return &engine.ApplyResult{
		Accepted: true,
		Message:  fmt.Sprintf("Deploy request #%d created", dr.Number),
		ResumeState: &engine.ResumeState{
			MigrationContext: migrationContext,
			Metadata:         meta,
		},

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

aparajon · 2026-06-12T16:48:42Z

Both fixes address real resume gaps, the deploy-gating predicate (DeployedAt == nil + ready + non-deferred) is correct with good tests, and the earlier durability comments were addressed properly. Three changes needed, two nits:

Empty-baseline rediscovery can attach to the wrong context (reattach path in apply.go) — discovery returns the first row not in the baseline, and SHOW VITESS_MIGRATIONS retains completed historical rows. With an empty baseline, any database with prior schema-change history will most likely match an old, completed context from an unrelated change — rendering finished progress while the real deploy is mid-flight. Misleading progress during an incident is worse than no progress. Suggest restricting candidates to non-terminal rows (or rows started after this apply began), and when zero or multiple candidates remain, keeping the stored identifier and warning — empty progress is the safe degraded mode. A test with historical rows present would pin this.
Transient errors fork a second deploy request — the getDeployRequest error branch treats any error (including transient API failures) as "deploy request cleaned up → start fresh," which creates a new branch and deploy request while the original may still be actively deploying. Progress already distinguishes not-found via errors.As — mirror that here, and return transient errors so resume retries instead of forking.
Terminology — repo rule: "schema change", never "migration". Quoting the literal Vitess identifiers (migration_context, SHOW VITESS_MIGRATIONS) is fine, but the PR title and the new helper names (discoverMigrationContextWithRetry, resolveResumeMigrationContext, persistResumeMigrationContext) are this PR's additions — the new log strings already say "schema change context," so suggest aligning the title and helpers.

Nits: the OnStateChange == nil early return in the persist helper is silent (the sibling branch logs Debug — match it); and a one-line comment on the deploy gate noting that a concurrent deploy between the read and the call is self-correcting (the provider rejects a second deploy) would document the TOCTOU reasoning. An integration-level crash-then-resume test with an existing deploy request would also strengthen this beyond the current mocks.

Verdict: needs changes for items 1–3.

🤖 This review was generated by Claude Code (claude-fable-5) with maintainer approval.

…esume When a worker crashes between creating a PlanetScale deploy request and starting it, the recovered deploy request sits in "ready" forever: Progress maps "ready" to pending, the deferred-deploy promotion never applies for a non-deferred apply, and the tern auto-deploy trigger only fires on waiting_for_deploy. Resume now starts the deploy for a recovered ready, non-deferred, never-deployed request so the schema change actually runs. Resume also rediscovers the Vitess migration_context after deploying, instead of carrying the tern apply identifier forward. The apply identifier never matches SHOW VITESS_MIGRATIONS, so per-shard progress was empty for the life of a resumed apply. Discovery captures a baseline before deploy and polls for the new context after, preserving an already-real stored context and never clobbering it with an empty result. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…sume On resume, the freshly discovered Vitess migration_context was only returned in the ApplyResult, never persisted durably. A crash after the deploy started but before the resume function returned left stored ResumeState holding the tern apply identifier, so the next resume had no per-shard Vitess progress. Persist the resolved context via OnStateChange as soon as it is known (in both deploy-and-discover resume paths), but only when it is a real Vitess context (a "<system>:<uuid>" form), never the apply identifier — so a previously persisted real context is never clobbered. On the reattach-only path, if the stored value is still the apply identifier, rediscover the context from the current migrations and persist it; a stored real context is kept as-is. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…y requests Restrict schema change context rediscovery to non-terminal candidates so an empty-baseline reattach never attaches progress to an old, completed context that SHOW VITESS_MIGRATIONS still retains; when zero or multiple candidates remain, keep the stored identifier and warn rather than render misleading progress. Only a genuine not-found from the recovered deploy request starts a fresh apply; transient API errors now propagate so resume retries against the same deploy request instead of forking a duplicate branch and deploy. Rename the resume helpers to schema change context wording per the repo terminology rule. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings June 11, 2026 17:10

Copilot started reviewing on behalf of morgo June 11, 2026 17:11 View session

Copilot AI reviewed Jun 11, 2026

View reviewed changes

Comment thread pkg/engine/planetscale/apply.go Outdated

Comment thread pkg/engine/planetscale/apply.go Outdated

Comment thread pkg/engine/planetscale/apply.go

morgo marked this pull request as ready for review June 12, 2026 14:31

morgo requested review from Kiran01bm and aparajon as code owners June 12, 2026 14:31

morgo and others added 3 commits June 12, 2026 11:02

morgo force-pushed the fix/planetscale-resume-deploy-and-context branch from a77641f to 81196f9 Compare June 12, 2026 17:12

morgo changed the title ~~fix(engine): deploy and rediscover migration context on PlanetScale resume~~ fix(engine): deploy and rediscover schema change context on PlanetScale resume Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(engine): deploy and rediscover schema change context on PlanetScale resume#306

fix(engine): deploy and rediscover schema change context on PlanetScale resume#306
morgo wants to merge 3 commits into
mainfrom
fix/planetscale-resume-deploy-and-context

morgo commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aparajon commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

morgo commented Jun 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aparajon commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants