diff --git a/docs/src/format/table/.pages b/docs/src/format/table/.pages index 16c20058608..5b0cb0e95e6 100644 --- a/docs/src/format/table/.pages +++ b/docs/src/format/table/.pages @@ -6,4 +6,5 @@ nav: - Layout: layout.md - Branch & Tag: branch_tag.md - Row ID & Lineage: row_id_lineage.md + - Data Overlay Files: data_overlay_file.md - MemTable & WAL: mem_wal.md diff --git a/docs/src/format/table/data_overlay_file.md b/docs/src/format/table/data_overlay_file.md new file mode 100644 index 00000000000..5540f860018 --- /dev/null +++ b/docs/src/format/table/data_overlay_file.md @@ -0,0 +1,390 @@ +# Data Overlay Files + +!!! note "Overlay files require feature flag 64 (data overlay files)" + + A reader or writer that does not understand overlay files must refuse a + dataset that uses them. Silently ignoring an overlay would return stale base + values, which is a correctness bug rather than a degraded experience. + +Overlay files supply new values for a subset of `(row offset, field)` cells +within a fragment **without rewriting the fragment's base data files**. They make +updates cheap when only a small fraction of rows and/or columns change: instead +of rewriting whole columns or moving rows to a new fragment, a writer appends a +small file carrying just the changed cells. + +This is Lance's third mechanism for changing data in place, alongside +[deletion files](index.md#deletion-files) (which remove rows) and +[data evolution](index.md#data-evolution) (which adds or rewrites whole columns). +An overlay changes individual cells. + +## Concepts + +### Coverage and resolution + +Each overlay declares which cells it provides through a **coverage** bitmap (or, +for sparse overlays, one bitmap per field). The bitmaps index **physical row +offsets** — positions in the base data files, counting deleted rows — so they are +stable across deletions, exactly like deletion vectors. + +To resolve a cell `(offset, field)` on read, walk the fragment's overlays from +**newest to oldest**. The first overlay that covers `(offset, field)` wins; its +value is used. If no overlay covers the cell, the value falls through to the base +data file (or is `NULL` if no base data file holds that field). + +Precedence among overlays is determined by: + +1. `committed_version` — higher wins (see [Versioning](#versioning-and-ordering)). +2. Position in `DataFragment.overlays` as a tiebreaker — a later entry is newer. + +A covered offset whose value is `NULL` overrides the cell **to** `NULL`. This is +distinct from an offset that is simply absent from the bitmap, which falls +through to the base. Coverage, not value-nullness, decides whether an overlay +applies. + +### Interaction with deletions + +Deletions take precedence over overlays. If a row offset is marked deleted in the +fragment's deletion file, any overlay value for that offset is dead and is +ignored, regardless of commit order. This keeps the invariant simple: a deletion +is the final word on a row, so a concurrent overlay against a row that was +deleted needs no special conflict handling — its values are merely inert. + +### Physical layout + +An overlay's data file stores **one value column per field**, in the order of +`data_file.fields`. It does **not** store a row-offset key column. The position of +a covered offset's value within its column is the **rank** of that offset in the +field's coverage bitmap — the number of set bits below it. For a Roaring bitmap +this is an O(1) operation, so random access to any cell is a rank computation +followed by a single value fetch, with no offset column to read and no binary +search. + +Because different fields may cover different offset sets, the value columns of a +single sparse overlay may have **different lengths**. The Lance file format +permits columns of differing item counts within one file, so a sparse overlay is +representable as a single file. (See [Writer support](#writer-support) for the +current implementation status.) + +### Dense vs. sparse overlays + +A single overlay is one of two shapes: + +- **Dense (rectangular).** One `shared_offset_bitmap` applies to every field. Every + covered offset has a value for every field. This is the common case for a plain + `UPDATE`, where one `SET` list is applied to one set of rows. +- **Sparse.** A `FieldCoverage` carries one bitmap per field, used when different + fields cover different offset sets — for example a `MERGE` with multiple + `WHEN MATCHED` branches, where different rows update different columns. A dense + overlay would have to widen to the bounding rectangle and fill the untouched + cells with their current values (post-images), which for wide columns such as + embeddings means re-storing data that did not change. A sparse overlay stores + exactly the changed cells. + +A writer may always express a non-rectangular update as **multiple dense overlays +in one transaction** (one per coverage group) instead of a single sparse overlay. + +## Protobuf + +
+DataOverlayFile protobuf message + +```protobuf +%%% proto.message.DataOverlayFile %%% +``` + +
+ +
+FieldCoverage protobuf message + +```protobuf +%%% proto.message.FieldCoverage %%% +``` + +
+ +## Versioning and ordering + +Overlays reuse the dataset version as their ordering clock rather than +introducing a separate generation counter. + +`committed_version` is the dataset version at which an overlay **became +effective** — the version of the commit that introduced it, **not** the version +it was read from. It is stamped at commit time and re-stamped if the commit is +retried, in the same way as the created-at / last-updated-at version sequences. + +This single value drives every ordering decision: + +- **Overlay vs. overlay** (read precedence): higher `committed_version` wins. +- **Overlay vs. index** (query correctness): an index records the + `dataset_version` it was built from. An index whose `dataset_version >= + committed_version` already incorporates the overlay. An overlay whose + `committed_version > index.dataset_version` is newer than the index and its + cells must be excluded from index results and re-evaluated. +- **Scheduler signal**: the gap between an overlay's `committed_version` and an + index's `dataset_version`, or between an overlay and the base, is a staleness + measure the compaction scheduler can use. + +!!! note "Why effective version, not read version" + + Suppose an overlay reads version 5 and commits at version 6, while an index + is built reading version 5 (before the overlay) and commits at version 7 with + `dataset_version = 5`. If the overlay stored its *read* version (5), the test + `5 > 5` is false, the row would not be excluded, and the index — which never + saw the overlay — would return a stale result. Storing the *effective* + version (6) makes `6 > 5` true, the cell is excluded and re-evaluated, and the + result is correct. + +## Index integration + +Building an index over a fragment that has overlays does **not** require dropping +the fragment from the index's coverage. The fragment stays indexed, and the query +path reconciles overlays at query time using an **exclusion set**. + +The exclusion set for an index on field `F` is the union of the coverage bitmaps, +restricted to field `F`, of every overlay whose `committed_version > +index.dataset_version`. The exclusion is **field-aware**: an overlay that touches +only unrelated columns does not exclude anything from the index on `F`. + +The query then proceeds as: + +1. Run the index search as usual, producing candidate rows. +2. Remove any candidate in the exclusion set. (Its indexed value may be stale.) +3. **Re-evaluate** the excluded rows against their current values — the same flat + path already used for the unindexed tail of fragments. For a scalar predicate + this re-applies the filter; for a vector query it re-scores the row's current + vector. Rows that still match are added back to the result. + +Step 3 is what makes exclusion correct rather than merely safe: removing a row +from index candidates without re-evaluating it would silently drop a row that +should match under its new value. + +### Correctness invariant + +> For every indexed field `F` and every row offset `o` in a fragment the index +> covers, the index's entry for `(o, F)` is trusted unless `o` is excluded. +> `o` is excluded iff some overlay with `committed_version > index.dataset_version` +> covers `(o, F)`. + +The write and compaction paths together preserve this: + +- **Writes** change a cell only by adding an overlay, and that overlay's + `committed_version` exceeds the version of any pre-existing index — so the + change is always covered by an exclusion. +- **Compaction** may remove an overlay only if the index no longer relies on it + (see below). + +## Compaction + +Overlays accumulate read cost — every overlay is a bitmap to test and a possible +file to open. Compaction bounds that cost in two modes: + +- **Overlay → overlay.** Merge several overlays into fewer, computing the + post-image per `(offset, field)` by walking the merged overlays newest-first. + The merged overlay takes the **maximum** `committed_version` of its inputs, so + the exclusion semantics are preserved. Indexes are unaffected. This is cheap and + does not touch the base. +- **Overlay → base.** Fold overlays into a fresh base data file, computing the + post-image for every covered cell, then clear the overlays. The base is + complete, so every post-image is well defined. Overlay offsets are physical, so + they cannot survive a rewrite that reorders rows; folding therefore materializes + values rather than carrying overlays forward. + +!!! warning "Folding an indexed field must update its index" + + An overlay→base fold removes the overlay, which removes the exclusion signal + that kept an index correct. Folding an overlay that covers an indexed field + `F` is therefore equivalent to a column rewrite of `F` and must, in the same + commit, either rebuild the index to a `dataset_version` at least the folded + overlay's `committed_version`, or remove the fragment from the index's + coverage so the rows fall to the flat path. Otherwise the index would serve + stale values with no overlay to exclude them. This is the same rule that + already governs rewriting a column that an index is built on. + +When a fragment with overlays is compacted by a row-rewriting operation +(`RewriteRows`, which produces new fragments with new row addresses), the +overlays are folded into the new base as part of the rewrite, and existing +[fragment-reuse remapping](row_id_lineage.md) handles the row-address changes as +it does today. + +## Row lineage + +An overlay write updates the `last_updated_at_version` of every covered row, so +change-data-feed and time-travel queries observe the update. Because overlays are +addressed by physical offset, they do **not** require stable row IDs to be +enabled; lineage updates apply only when those features are on. + +## Worked example + +A table `users` with stable row IDs enabled and these fields: + +| field id | name | type | +|----------|-----------|-------------------------| +| 1 | id | `int32` (primary key) | +| 2 | name | `utf8` | +| 3 | age | `int32` | +| 4 | embedding | `fixed_size_list`| + +Created at version 1 as a single fragment `0` with one base data file +`data/file0.lance` holding all four columns. `physical_rows = 4`: + +| offset | id | name | age | embedding | +|--------|----|-------|-----|------------------| +| 0 | 1 | Alice | 30 | … | +| 1 | 2 | Bob | 25 | … | +| 2 | 3 | Carol | 40 | … | +| 3 | 4 | Dave | 22 | … | + +A BTree scalar index on `age` is built at version 1, covering fragment `0` +(`dataset_version = 1`). + +### Step 1 — write an overlay + +```sql +UPDATE users SET age = 26 WHERE id = 2; -- Bob, offset 1 +``` + +This touches one field (`age`) for one row, so the writer emits a dense overlay +and commits it as version 2. Fragment `0` gains: + +```text +DataOverlayFile { + data_file: { path: "data/overlay-.lance", fields: [3], column_indices: [0] } + coverage: shared_offset_bitmap = {1} + committed_version: 2 +} +``` + +The overlay file stores a single `age` column with one value, `[26]`, at +rank `{1}.rank(1) = 0`. `last_updated_at_version[1]` is set to 2. + +### Step 2 — read + +`SELECT id, age FROM users` reads base ages `[30, 25, 40, 22]`. For `age` +(field 3), the overlay covers offset 1, so `age[1]` is replaced with the overlay +value at position `{1}.rank(1) = 0` → `26`. Result ages: `[30, 26, 40, 22]`. + +### Step 3 — index query + +```sql +SELECT * FROM users WHERE age = 26; +``` + +The `age` index was built at `dataset_version = 1`; the overlay's +`committed_version` is 2. Since `2 > 1`, the overlay's coverage for `age`, `{1}`, +is the exclusion set for this query. + +- The index (built at v1) holds Bob's *old* `age = 25`, so a lookup for `26` + returns nothing from the index. +- Offset 1 is in the exclusion set, so it is re-evaluated on the flat path. Its + current `age` (26, via the overlay) matches `age = 26`, so Bob is returned. + +The mirror case `WHERE age = 25` shows exclusion preventing a stale hit: the index +returns offset 1 (stale `25`), but offset 1 is excluded, re-evaluated to `26`, and +correctly dropped. + +### Step 4 — a second, non-rectangular write + +```sql +MERGE INTO users USING staged ON users.id = staged.id +WHEN MATCHED AND staged.kind = 'rename' THEN UPDATE SET name = staged.name -- Carol(2), Dave(3) +WHEN MATCHED AND staged.kind = 'revec' THEN UPDATE SET embedding = staged.embedding -- Bob(1) +``` + +`name` is updated for offsets `{2, 3}` and `embedding` for offset `{1}` — different +fields over different rows. This is a sparse overlay, committed as version 3: + +```text +DataOverlayFile { + data_file: { path: "data/overlay-.lance", fields: [2, 4], column_indices: [0, 1] } + coverage: field_coverage { offset_bitmaps: [ {2,3}, {1} ] } + // name (field 2) ^ ^ embedding (field 4) + committed_version: 3 +} +``` + +The file's `name` column has **two** values (`["Caroline", "David"]`, at +ranks 0 and 1 of `{2,3}`) and its `embedding` column has **one** value (at rank 0 +of `{1}`) — columns of different lengths in one file. + +### Step 5 — read after the second write + +`SELECT name, age, embedding FROM users` resolves each field independently, +newest overlay first: + +- `name`: the v3 overlay covers `{2,3}` → `["Alice", "Bob", "Caroline", "David"]`. +- `age`: the v3 overlay does not cover `age`; the v2 overlay still applies at + offset 1 → `[30, 26, 40, 22]`. +- `embedding`: the v3 overlay covers `{1}` → Bob's vector is the new one, others + from base. + +Overlays from different versions coexist and apply per field. + +### Step 6 — compaction (overlay → base) + +The scheduler folds both overlays into fragment `0` at version 4, computing +post-images for `age`, `name`, and `embedding`, and writing a new base data file +`data/file1.lance` with those columns. In the old file, fields 2, 3, and 4 are +tombstoned (`-2`); field 1 (`id`) remains. The fragment's `overlays` list is +cleared. Row addresses are preserved (a column rewrite, not a row rewrite), so +stable row IDs and the deletion vector are untouched. + +Because the fold removed the overlay that was excluding offset 1 from the `age` +index, the same commit must reconcile that index: either rebuild it at +`dataset_version >= 2`, or drop fragment `0` from its coverage so `age` queries +fall to the flat path. After a rebuild at version 4, no overlay remains and the +`age` index directly returns `26` for Bob with no exclusion needed. + +## Guidance + +!!! note "This section is a stub." + + The following are implementation considerations, not part of the on-disk + specification. + +### When to overlay vs. rewrite a column vs. move rows + +*(To be expanded.)* The choice between appending an overlay, rewriting a full +column (data evolution), and moving updated rows to a new fragment depends on the +fraction of rows changed, the fraction of columns changed, column width, the +presence of indexes on the changed columns, and the accumulated overlay read +cost. Roughly: few rows changed favors overlays; most rows in a few columns +favors a column rewrite; most columns changed favors moving rows to a new +fragment. + +### Writer support + +*(To be expanded.)* Dense (rectangular) overlays write with the existing +equal-length file writer today. Sparse overlays stored as a **single** file +require the writer to emit columns of independent lengths, which the current v2 +writer does not yet do (it advances all columns from one global row counter). +Until that support lands, a writer can express a sparse update as multiple dense +overlays in one transaction. + +### Scheduling compaction + +*(To be expanded.)* The overlay→overlay and overlay→base modes have very +different costs; a cost/benefit scheduler decides when each is worthwhile, using +the version gap as a staleness signal. + +### Open questions + +*(To be resolved.)* + +- **Per-fragment vs. per-table overlays.** Overlays are attached per fragment. + Should there be a table-level overlay concept, and how would it interact with + fragment-level row addressing? +- **Relationship to LSM.** Overlays plus compaction resemble an LSM tree (newest + layer wins, periodic merge). How far should that analogy be taken, and what do + we deliberately do differently given Lance's random-access requirements? +- **Coverage bitmap spill.** Coverage bitmaps live inline in the manifest. Very + large coverage (an overlay touching many rows) may warrant external spill, as + the row-ID and last-updated-at sequences already do above a size threshold. + +## Related specifications + +- [Table format overview](index.md) +- [Transactions](transaction.md) +- [Row ID & Lineage](row_id_lineage.md) +- [Index Formats](../index/index.md) +- [Format Versioning](versioning.md) diff --git a/docs/src/format/table/index.md b/docs/src/format/table/index.md index 94ea4b90dc9..ce4d0b26613 100644 --- a/docs/src/format/table/index.md +++ b/docs/src/format/table/index.md @@ -168,6 +168,35 @@ However, this invalidates row addresses and requires rebuilding indices, which c +## Data Overlay Files + +!!! note "Overlay files require feature flag 64 (data overlay files)" + +Overlay files supply new values for a subset of `(row offset, field)` cells within +a fragment without rewriting the base data files. They make updates cheap when only +a small percentage of rows and/or columns change: a writer appends a small file +carrying just the changed cells instead of rewriting whole columns or moving rows +to a new fragment. + +On read, each cell is resolved by consulting the fragment's overlays from newest to +oldest; the first overlay covering that `(offset, field)` wins, otherwise the value +falls through to the base data file. Indices keep covering the fragment and reconcile +overlays at query time through a field-aware exclusion set. + +For the full specification — coverage and resolution rules, dense vs. sparse layout, +versioning, index integration, compaction, and a worked example — see the +[Data Overlay Files Specification](data_overlay_file.md). + +
+DataOverlayFile protobuf message + +```protobuf +%%% proto.message.DataOverlayFile %%% +``` + +
+ + ## Related Specifications ### Storage Layout diff --git a/protos/table.proto b/protos/table.proto index d298809d5d8..cc8b477a6a6 100644 --- a/protos/table.proto +++ b/protos/table.proto @@ -113,6 +113,11 @@ message Manifest { // * 2: row ids are stable and stored as part of the fragment metadata. // * 4: use v2 format (deprecated) // * 8: table config is present + // * 16: data files use multiple base paths (shallow clone / multi-base) + // * 32: the transaction file under _transactions is not written (inline only) + // * 64: data overlay files are present (see DataOverlayFile). Readers that do + // not understand overlays must refuse the dataset, since ignoring an overlay + // would silently return stale base values. uint64 reader_feature_flags = 9; // Feature flags for writers. @@ -311,6 +316,15 @@ message DataFragment { repeated DataFile files = 2; + // Optional overlay files for this fragment, which supply new values for a + // subset of cells without rewriting the base data files. This MUST be empty + // if the data overlay files feature flag (64) is not set in the manifest. + // + // Order is significant: a later entry is newer than an earlier one. When two + // overlays cover the same (offset, field) and share a `committed_version`, the + // later entry wins. See DataOverlayFile for the full resolution rules. + repeated DataOverlayFile overlays = 11; + // File that indicates which rows, if any, should be considered deleted. DeletionFile deletion_file = 3; @@ -433,6 +447,66 @@ message DataFile { optional uint32 base_id = 7; } // DataFile +// An overlay file supplies new values for a subset of (row offset, field) cells +// within a fragment, without rewriting the fragment's base data files. It is +// used for efficient updates when only a small fraction of rows and/or columns +// change. +// +// On read, a cell is resolved by consulting the fragment's overlays from newest +// to oldest: the first overlay that covers that (offset, field) wins; if none +// cover it, the value falls through to the base data file. Because deletions +// take precedence over overlays, an overlay value for an offset that is also +// marked deleted is dead and is ignored. +// +// The overlay's data file does NOT store a row-offset key column. Within a value +// column, the position of a covered offset's value is the rank (0-based count of +// set bits below it) of that offset within the field's coverage bitmap. Because +// fields may cover different offset sets, the value columns of a single overlay +// data file may have different lengths (which the Lance file format permits). +message DataOverlayFile { + // The data file storing the overlay's new cell values, one value column per + // field in `data_file.fields`. No row-offset key column is stored. + DataFile data_file = 1; + + // Which (offset, field) cells this overlay provides values for. + oneof coverage { + // A single 32-bit Roaring bitmap of physical row offsets that applies to + // every field in `data_file.fields` (a "dense" / rectangular overlay). + // Every covered offset has a value for every field. This is the common case + // for a plain UPDATE, where one SET list is applied to one set of rows. + bytes shared_offset_bitmap = 2; + // Per-field coverage for a "sparse" overlay, used when different fields cover + // different offset sets (e.g. a MERGE with multiple WHEN MATCHED branches). + FieldCoverage field_coverage = 4; + } + + // The dataset version at which this overlay became effective: the version of + // the commit that introduced it, NOT the version it was read from. It is + // stamped at commit time and re-stamped if the commit is retried, in the same + // way as the created-at / last-updated-at version sequences. + // + // This drives two orderings: + // * Versus index builds: an index whose `dataset_version` >= this value + // already incorporates this overlay. Otherwise the overlay's covered cells + // are excluded from index results for the affected fields and re-evaluated + // against their current values (see the Data Overlay Files specification). + // * Versus other overlays: when two overlays cover the same (offset, field), + // the one with the higher `committed_version` wins. Overlays that share a + // `committed_version` are ordered by their position in + // `DataFragment.overlays`, where a later entry is newer and wins. + uint64 committed_version = 3; +} + +// Per-field coverage for a sparse overlay. +message FieldCoverage { + // One entry per field in the overlay's `data_file.fields`, in the same order. + // Each is a 32-bit Roaring bitmap of the physical row offsets covered for that + // field. An offset present in a field's bitmap but mapped to a NULL value + // means the cell is overridden to NULL (distinct from an offset that is absent, + // which falls through to the base data file). + repeated bytes offset_bitmaps = 1; +} + // Deletion File // // The path of the deletion file is constructed as: