Skip to content

fix(vitess): recover late migration context#232

Draft
aparajon wants to merge 3 commits into
mainfrom
armand/planetscale-progress-context
Draft

fix(vitess): recover late migration context#232
aparajon wants to merge 3 commits into
mainfrom
armand/planetscale-progress-context

Conversation

@aparajon

@aparajon aparajon commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Why

SchemaBot can sometimes create/start a PlanetScale deploy request before Vitess exposes the deploy's migration_context in SHOW VITESS_MIGRATIONS.

When that happens, SchemaBot still knows the deploy request ID, but it cannot show the richer Vitess progress view: per-table/per-shard progress, row counts, ETA, and related migration details. Without a recovery path, progress can stay limited to coarse deploy-request status even after the Vitess rows appear.

What

  • Save the set of Vitess migration contexts that existed before the deploy starts.
  • Persist that baseline on vitess_apply_data so it survives normal progress polling and restarts.
  • If progress polling later sees deploy metadata but no migration_context, query SHOW VITESS_MIGRATIONS again and recover a context from current rows.
  • Select recovered contexts deterministically: ignore baseline contexts, prefer the earliest requested_timestamp at or after the deploy request creation time, and avoid guessing when multiple untimed new contexts exist.
  • Persist a recovered context from both the active apply worker loop and read/API progress path so recovery survives worker restarts and scheduler owner handoff.
  • Store Vitess row timestamps with each baseline context for operator debugging.

Risk Assessment

Low. This only improves PlanetScale progress enrichment when resume state is missing migration_context; deploy-request state remains the source of truth for apply state.

The storage change is additive: a nullable JSON column on vitess_apply_data.

Generated with Amp

Copilot AI review requested due to automatic review settings June 4, 2026 19:42

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves PlanetScale/Vitess progress enrichment by re-discovering a missing Vitess migration_context during Progress() polling (when the persisted resume state lacks it), using deploy/request timestamps to select the most relevant context and adding regression tests for the selection logic.

Changes:

  • Add fallback migration context discovery in Engine.Progress() when ResumeState.MigrationContext is empty.
  • Introduce timestamp-based context selection helpers (migrationContextDiscoveryAfter, latestMigrationContextAfter) and parse requested_timestamp from SHOW VITESS_MIGRATIONS.
  • Add tests covering context cutoff selection and discovery “after” timestamp behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pkg/engine/planetscale/progress.go Adds “late” migration context discovery during polling and implements timestamp-based context selection.
pkg/engine/planetscale/progress_test.go Adds regression tests for context cutoff selection and discovery-after timestamp computation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/engine/planetscale/progress.go Outdated
Comment thread pkg/engine/planetscale/progress_test.go Outdated
@aparajon aparajon force-pushed the armand/planetscale-progress-context branch 5 times, most recently from 317afd5 to fd31453 Compare June 4, 2026 22:01
Co-authored-by: Amp <amp@ampcode.com>
@aparajon aparajon force-pushed the armand/planetscale-progress-context branch from fd31453 to ca91b28 Compare June 4, 2026 22:31
@aparajon aparajon marked this pull request as ready for review June 4, 2026 22:37
@aparajon aparajon requested review from Kiran01bm and morgo as code owners June 4, 2026 22:37
@aparajon aparajon changed the title fix(planetscale): recover late migration context fix(vitess): recover late migration context Jun 4, 2026
Comment thread pkg/engine/planetscale/progress.go Outdated
Comment on lines +86 to +89
req.ResumeState = &engine.ResumeState{
MigrationContext: migrationContext,
Metadata: req.ResumeState.Metadata,
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we better persist the recovered number somewhere - ex: in a column that exists (migration_context) so that it turns the per-poll 9-query sweep into a one-time cost ?

example impact summary from my agent session:

Vitess DB with 8 keyspaces; Apply Z hit the late-context bug → stored migration_context = ""

Every CLI `schemabot status` / UI poll for Apply Z:
   BuildResumeState(vad)  →  migration_context still ""  (never persisted)
   → discoverMigrationContext:
        ListKeyspaces            (1 PlanetScale API call)
        SHOW VITESS_MIGRATIONS   × 8 keyspaces  (8 vtgate queries)
   → result thrown away, not saved
   → next poll repeats the same 9 queries

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — this is resolved now in 7db975b2. LocalClient.Progress persists a recovered ResumeState.MigrationContext back into vitess_apply_data.migration_context when the stored value is missing/different, so the expensive context discovery sweep becomes a one-time recovery path instead of repeating on every poll. I also added a regression test for that path.

Co-authored-by: Amp <amp@ampcode.com>
@aparajon aparajon force-pushed the armand/planetscale-progress-context branch from 384004d to 7db975b Compare June 5, 2026 02:27
@aparajon

aparajon commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

I checked compatibility with #242 by simulating a local merge of this branch into the opaque-resume-state branch. The two PRs are conceptually compatible, but this PR is not merge-compatible as-is.

The merge currently conflicts in:

  • pkg/storage/types.go
  • pkg/tern/local_apply_grouped.go
  • pkg/tern/local_client.go
  • pkg/tern/local_client_test.go
  • pkg/tern/local_control.go

The main semantic overlap is the persistence path:

  • This PR recovers a late Vitess migration_context and persists that recovery through vitess_apply_data.
  • feat(engine): persist opaque resume state #242 makes engine_resume_states the durable source of truth for opaque engine state, and stops rebuilding engine state from vitess_apply_data for control/progress calls.

I think the engine-side idea in this PR is still the right follow-up, but the tern/storage side should be adapted after #242:

  1. Keep the PlanetScale engine changes:

    • include the pre-deploy existing_migration_contexts baseline in ResumeState.Metadata
    • use deploy-request creation time to select the right newly observed Vitess context
    • return an updated ResumeState from Progress when the context is recovered
  2. Drop or rewrite the vitess_apply_data durability changes:

    • no need for existing_migration_contexts in vitess_apply_data as an engine input
    • no need for persistRecoveredVitessMigrationContext(...)
    • feat(engine): persist opaque resume state #242 already persists updated result.ResumeState from progress polling into engine_resume_states
  3. Keep vitess_apply_data projection-only if we still want operator/UI metadata there, but it should stay non-authoritative for engine calls.

Recommended order: merge #242 first, then rebase/rework this PR on top of it. That gives this PR a cleaner golden path:

PlanetScale progress recovers migration_context
        │
        ▼
returns updated opaque ResumeState
        │
        ▼
SchemaBot persists it in engine_resume_states
        │
        ▼
future progress/control calls load engine_resume_states directly

@aparajon aparajon marked this pull request as draft June 8, 2026 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants