Schema/index migration support: keep SearchSchema and the search index from drifting during a deploy

## Problem

`@lde/search` couples two artefacts that change together but deploy independently: the `SearchSchema` (which drives the derived GraphQL SDL via `@lde/search-api-graphql`) and the physical search index (the Typesense collection built by the engine). A config change ships the new SDL the instant the API binary deploys, but the index only reflects the change once a reindex completes. During that window `SearchSchema` and the live index drift, and there is currently no shared guidance or primitive for doing the change safely — every consumer (Dataset Register first) has to reinvent it.

The drift is asymmetric and direction-dependent:

- **Adding** a field — the index must lead the SDL. An API that exposes a field the index lacks is the dangerous direction (queries fail or return nulls).
- **Removing** a field — the SDL must lead. Stop exposing it, then drop it from the index.
- **In-place upsert** (where supported) makes it worse than blue/green: a long-lived mixed population where some documents carry the new field and some don’t, with no clean cutover instant. Blue/green keeps each collection internally consistent (wholly vN+1) and gives a single alias-swap cutover point.

## Framing: which coupling to keep, which to relax

Two different kinds of coupling are bundled here, and the design should treat them oppositely:

- **Derivation coupling** — one `SearchSchema` definition projects into *both* the GraphQL non-null flag (`build-schema.ts`) and the Typesense `optional` flag (`collection-schema.ts`). The same `required` bit drives both. **Keep this.** It is the single source of truth that prevents *semantic* drift; hand-maintaining a GraphQL schema and a Typesense schema separately guarantees they diverge (e.g. a field marked non-null in the API but not required in the index → runtime null-violation). This is a static, build-time coupling — the benign kind.
- **Temporal coupling** — today all projections must flip at the same instant, with only one schema version ever live (*simultaneity* coupling). This is what blocks migrations. The goal is not to eliminate temporal coupling (a rename inherently requires “backfill before cutover before retire”) but to **relax rigid simultaneity into explicit ordered sequencing** (connascence of execution order). The cutover-ordering contract below *is* a specification of that ordered temporal coupling.

One-line statement of intent: **keep the static derivation coupling; relax the rigid simultaneous temporal coupling into explicit ordered temporal coupling, with a transitional window where two versions coexist.**

## Proposal

Treat a `SearchSchema` change like a database migration. Provide, in `@lde/search`, the model and helpers that make this the obvious path:

1. **Classify a schema change** as additive-nullable (safe to ship the SDL immediately) vs. breaking (rename, retype, remove, tighten-to-required, changed match/filter/sort semantics — must span ≥2 releases via expand/contract). Classify at the **`SearchSchema` level**, not the GraphQL level: that single diff is what derives *both* projections, so it is the only place that can reason about API *and* index consequences together and emit a cutover order (`index-first` / `api-first` / `multi-release`).
   - graphql-js already ships `findBreakingChanges` / `findDangerousChanges`. Running them over `buildSearchSchema(prev)` vs `buildSearchSchema(next)` gives the **GraphQL-contract half almost for free** — wire it in as a belt-and-suspenders cross-check on the GraphQL projection (it also catches a future `buildSearchSchema` regression silently altering the SDL, which is the job of the existing `printSearchSchema` snapshot test).
   - Note the limits: `printSchema` is a *printer, not a differ* — the snapshot test is a tripwire that detects-but-doesn’t-classify. And `findBreakingChanges` reasons only in GraphQL-consumer terms; it knows nothing about the **index half** (Typesense `optional: false` rejecting documents, the data-completeness gate, the `sh:minCount` precondition). That index half has no graphql-js equivalent and is the net-new piece to write.
2. **New fields are `optional`** in both the collection schema and the GraphQL type — never mocked with empty strings (indistinguishable from real empty values; corrupts facet counts and sorts). This is also the concrete fix for the engine erroring when a sort/facet field is absent on some documents.
3. **Tightening an optional field to `required`** is a breaking change, not an additive one — it is the search-index equivalent of adding `NOT NULL`. It is gated on *data completeness*, not on “a reindex finished”: marking `required` sets Typesense `optional: false` (the next rebuild rejects any document missing the value) and wraps the GraphQL field in `GraphQLNonNull` (query-time null-violation for any document missing it). Safe path: ship the field `optional`, complete a rebuild, verify zero nulls across the whole collection — ideally backed by the source SHACL shape enforcing `sh:minCount ≥ 1`, which is the real “if applicable” gate — *then* flip to `required`. (`required` is moot for arrays/booleans/`id`, which are non-null regardless.)
4. **Document the cutover-ordering contract** (index-leads-for-add, SDL-leads-for-remove) and the recommendation to reindex into the new collection, then cut over API + alias together — collapsing the drift window to ~zero for fast rebuilds.
5. **Keep the alias swap as the single cutover point** so a later async-reindex orchestrator (background reindex, queue the schema change, go live on completion) can be layered on without redesign. That orchestrator is out of scope here — it only earns its keep once a rebuild is too long to block a deploy and the additive-nullable drift window becomes unacceptable.

## Scope: build the classifier, not transform hooks

Distinguish **classify/verify functions** (pure, cheap, valuable now) from **transform/execute hooks** (stateful, premature):

- Build the **classifier** now. It is a pure function over two `SearchSchema`s; it encodes this contract as executable code rather than prose, gates CI (a breaking diff without the version bump / multi-release plan fails the build), and is the substrate every later capability builds on. The classifier is the one piece worth front-running, because it is a contract specification, not an abstraction over instances.
- Do **not** add Rails-style `up`/`down` transform hooks yet. Under blue/green full rebuild there is no data to migrate — every document is re-derived from source through the new schema and the old collection is discarded, so a transform hook would be dead code. Even a new *derived* field needs none (derivation already lives in the schema; a rebuild computes it for free). A transform hook only has a job under **in-place upserts**.
- General rule: don’t generalise migration machinery until there are ≥2–3 real migrations to generalise *from* — otherwise it abstracts a pattern of one.

### When the heavier machinery starts paying off

Today (single consumer, full blue/green rebuilds) even a breaking change is just “deploy new schema → full rebuild → swap alias,” with the only exposure being rebuild duration. The transitional-superset / async-orchestrator / transform-hook investment earns its keep only when one of these fires:

- **Multiple API consumers** that can’t redeploy in lockstep (the GraphQL contract then needs a real deprecation window).
- **Incremental / in-place upserts** replace full rebuilds (the mixed-population problem above, plus an actual transform step to hook).
- **Rebuild time becomes a real outage** (large index → the swap window stops being negligible).

## Notes

- Dataset Register already runs exclusively as blue/green full rebuild (fresh `${alias}_${timestamp}` collection, atomic alias swap, drop previous; concurrent triggers skipped, not queued). It is the first consumer that needs this and a good proving ground.
- A migration runner must confirm the rebuild it depends on actually completed (alias points at the new timestamped collection) before flipping the API live, because a trigger arriving mid-rebuild is silently skipped.
- Relates to the GraphQL surface in #529.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Schema/index migration support: keep SearchSchema and the search index from drifting during a deploy #531

Problem

Framing: which coupling to keep, which to relax

Proposal

Scope: build the classifier, not transform hooks

When the heavier machinery starts paying off

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Schema/index migration support: keep SearchSchema and the search index from drifting during a deploy #531

Description

Problem

Framing: which coupling to keep, which to relax

Proposal

Scope: build the classifier, not transform hooks

When the heavier machinery starts paying off

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions