Skip to content

fix(engine): deploy and rediscover schema change context on PlanetScale resume#306

Open
morgo wants to merge 3 commits into
mainfrom
fix/planetscale-resume-deploy-and-context
Open

fix(engine): deploy and rediscover schema change context on PlanetScale resume#306
morgo wants to merge 3 commits into
mainfrom
fix/planetscale-resume-deploy-and-context

Conversation

@morgo

@morgo morgo commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Fixes two correctness bugs in the PlanetScale engine's crash-recovery resume path.

Crash between create and deploy never deployed. When a worker crashed after creating a deploy request but before starting it, resume reattached to the recovered request and returned current state without ever deploying. For a non-deferred apply the request sits in ready forever — Progress maps ready to pending, the waiting_for_deploy promotion only applies to deferred deploys, and the auto-deploy trigger only fires on waiting_for_deploy — so the schema change never ran. Resume now starts the deploy for a recovered request that is ready, non-deferred, and not yet deployed, mirroring the fresh apply path. A deferred or already-in-flight request is left alone.

Resume never discovered the Vitess migration context. The fresh apply path captures a context baseline before deploying and discovers the new migration_context after; the resume path carried the tern apply identifier forward instead. That identifier never matches SHOW VITESS_MIGRATIONS, so per-shard progress was empty for the life of a resumed apply. Resume now captures a baseline before the resume deploy and discovers the real context after, preserving an already-real stored context and never clobbering it with an empty result.

worker crash after create, before deploy
      |
      v
 resume -> deploy request "ready", non-deferred, DeployedAt == nil
      |                              |
      | (before: returned pending)   v  (now: starts deploy)
      x  never deploys          DeployDeployRequest + rediscover context

The shared discovery retry loop is extracted into one helper used by both paths and now honors context cancellation between polls.

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings June 11, 2026 17:10

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes crash-recovery correctness in the PlanetScale engine by ensuring resumed applies (a) actually start a recovered deploy request that was created but never deployed, and (b) re-run Vitess migration_context discovery so per-shard progress can work after resume.

Changes:

  • Extracted a shared migration_context discovery retry helper that honors context cancellation between polls.
  • Updated the resume path to deploy recovered non-deferred deploy requests stuck in ready, and to re-capture/discover Vitess migration_context around resume deploys.
  • Added unit tests for resume deploy behavior and the deploy gating predicate.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
pkg/engine/planetscale/progress.go Adds a retry helper for Vitess migration_context discovery with bounded polling and ctx cancellation.
pkg/engine/planetscale/apply.go Uses the new discovery helper, fixes resume behavior to deploy recovered requests when needed, and resolves migration_context after resume deploys.
pkg/engine/planetscale/apply_test.go Adds tests covering resume deploy gating and resume behavior for several recovered deploy-request states.
Comments suppressed due to low confidence (1)

pkg/engine/planetscale/apply.go:439

  • After discovering the Vitess migration_context, the code returns it in the ApplyResult but never persists it via OnStateChange. If the worker crashes after discovery (or immediately after deploy) but before Apply returns, the stored ResumeState will still contain the tern apply identifier, and subsequent resume/progress will have empty per-shard Vitess progress indefinitely.
	migrationContext := e.discoverMigrationContextWithRetry(ctx, client, req.Database, req.Credentials, existingContexts)

	meta, err := encodePSMetadata(&psMetadata{
		BranchName:       branchName,
		DeployRequestID:  dr.Number,
		DeployRequestURL: dr.HtmlURL,
		IsInstant:        useInstant,
	})
	if err != nil {
		return nil, fmt.Errorf("encode metadata for deploy request #%d: %w", dr.Number, err)
	}

	return &engine.ApplyResult{
		Accepted: true,
		Message:  fmt.Sprintf("Deploy request #%d created", dr.Number),
		ResumeState: &engine.ResumeState{
			MigrationContext: migrationContext,
			Metadata:         meta,
		},

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/engine/planetscale/apply.go Outdated
Comment thread pkg/engine/planetscale/apply.go Outdated
Comment thread pkg/engine/planetscale/apply.go
@morgo morgo marked this pull request as ready for review June 12, 2026 14:31
@morgo morgo requested review from Kiran01bm and aparajon as code owners June 12, 2026 14:31
@aparajon

Copy link
Copy Markdown
Collaborator

Both fixes address real resume gaps, the deploy-gating predicate (DeployedAt == nil + ready + non-deferred) is correct with good tests, and the earlier durability comments were addressed properly. Three changes needed, two nits:

  1. Empty-baseline rediscovery can attach to the wrong context (reattach path in apply.go) — discovery returns the first row not in the baseline, and SHOW VITESS_MIGRATIONS retains completed historical rows. With an empty baseline, any database with prior schema-change history will most likely match an old, completed context from an unrelated change — rendering finished progress while the real deploy is mid-flight. Misleading progress during an incident is worse than no progress. Suggest restricting candidates to non-terminal rows (or rows started after this apply began), and when zero or multiple candidates remain, keeping the stored identifier and warning — empty progress is the safe degraded mode. A test with historical rows present would pin this.

  2. Transient errors fork a second deploy request — the getDeployRequest error branch treats any error (including transient API failures) as "deploy request cleaned up → start fresh," which creates a new branch and deploy request while the original may still be actively deploying. Progress already distinguishes not-found via errors.As — mirror that here, and return transient errors so resume retries instead of forking.

  3. Terminology — repo rule: "schema change", never "migration". Quoting the literal Vitess identifiers (migration_context, SHOW VITESS_MIGRATIONS) is fine, but the PR title and the new helper names (discoverMigrationContextWithRetry, resolveResumeMigrationContext, persistResumeMigrationContext) are this PR's additions — the new log strings already say "schema change context," so suggest aligning the title and helpers.

Nits: the OnStateChange == nil early return in the persist helper is silent (the sibling branch logs Debug — match it); and a one-line comment on the deploy gate noting that a concurrent deploy between the read and the call is self-correcting (the provider rejects a second deploy) would document the TOCTOU reasoning. An integration-level crash-then-resume test with an existing deploy request would also strengthen this beyond the current mocks.

Verdict: needs changes for items 1–3.


🤖 This review was generated by Claude Code (claude-fable-5) with maintainer approval.

morgo and others added 3 commits June 12, 2026 11:02
…esume

When a worker crashes between creating a PlanetScale deploy request and
starting it, the recovered deploy request sits in "ready" forever: Progress
maps "ready" to pending, the deferred-deploy promotion never applies for a
non-deferred apply, and the tern auto-deploy trigger only fires on
waiting_for_deploy. Resume now starts the deploy for a recovered ready,
non-deferred, never-deployed request so the schema change actually runs.

Resume also rediscovers the Vitess migration_context after deploying, instead
of carrying the tern apply identifier forward. The apply identifier never
matches SHOW VITESS_MIGRATIONS, so per-shard progress was empty for the life of
a resumed apply. Discovery captures a baseline before deploy and polls for the
new context after, preserving an already-real stored context and never
clobbering it with an empty result.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…sume

On resume, the freshly discovered Vitess migration_context was only returned
in the ApplyResult, never persisted durably. A crash after the deploy started
but before the resume function returned left stored ResumeState holding the
tern apply identifier, so the next resume had no per-shard Vitess progress.

Persist the resolved context via OnStateChange as soon as it is known (in both
deploy-and-discover resume paths), but only when it is a real Vitess context
(a "<system>:<uuid>" form), never the apply identifier — so a previously
persisted real context is never clobbered. On the reattach-only path, if the
stored value is still the apply identifier, rediscover the context from the
current migrations and persist it; a stored real context is kept as-is.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…y requests

Restrict schema change context rediscovery to non-terminal candidates so an
empty-baseline reattach never attaches progress to an old, completed context
that SHOW VITESS_MIGRATIONS still retains; when zero or multiple candidates
remain, keep the stored identifier and warn rather than render misleading
progress.

Only a genuine not-found from the recovered deploy request starts a fresh
apply; transient API errors now propagate so resume retries against the same
deploy request instead of forking a duplicate branch and deploy.

Rename the resume helpers to schema change context wording per the repo
terminology rule.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@morgo morgo force-pushed the fix/planetscale-resume-deploy-and-context branch from a77641f to 81196f9 Compare June 12, 2026 17:12
@morgo morgo changed the title fix(engine): deploy and rediscover migration context on PlanetScale resume fix(engine): deploy and rediscover schema change context on PlanetScale resume Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants