diff --git a/docs/APPEND_ONLY_RAFT_DOVETAIL.md b/docs/APPEND_ONLY_RAFT_DOVETAIL.md index 3c632574..89dcfcf1 100644 --- a/docs/APPEND_ONLY_RAFT_DOVETAIL.md +++ b/docs/APPEND_ONLY_RAFT_DOVETAIL.md @@ -12,6 +12,24 @@ or any other model where storage mutates state in place. This doc names the property explicitly + lists the operational consequences so adopters can choose the right deployment shape. +> ### Scope: external architecture pattern, not a built-in lance-graph feature +> +> Post-merge correction (paralleling peer review on the companion doc +> PR #453): the deployment shape described below ("peer-Raft + +> Lance-local-per-node") is an EXTERNAL ARCHITECTURE PATTERN adopters +> can build on top of lance-graph, NOT a built-in lance-graph +> capability. Lance-graph provides the append-only columnar storage + +> the DataFusion query path + the encoding crates; the Raft layer + +> the substrate binary + the consensus-replication path are +> downstream consumer code (e.g. `openraft` or `surreal-cluster`). +> Adopters who only consume lance-graph's columnar + DataFusion path +> should NOT assume their data is automatically replicated. +> +> The doc documents WHY this pattern works well WHEN built on +> lance-graph — the storage-append/consensus-append dovetail property +> — not a feature lance-graph itself ships. + + ## The two write shapes that have to align A distributed Lance deployment has two write paths: @@ -35,15 +53,16 @@ Compare this with the conventional alternatives: |---|---|---| | **LSM-tree (Cassandra)** | Paxos-light / gossip | Storage AND consensus both have their own append-then-mutate cycles. Compaction in storage interacts with hinted handoff in consensus. Coordination headaches. | | **B-tree (PostgreSQL)** | 2PC (citus-like) | Storage in-place updates fight with 2PC's append-log. Vacuum interacts with commit-log replay. More headaches. | -| **Append-only Lance** | Append-only Raft | One write shape. Storage commit = consensus log entry. No interaction problems. | +| **Append-only Lance** | Append-only Raft | One write shape. Storage commit = consensus log entry. (`DatasetOptimizer.compact_files` produces a new manifest version — that output replicates through the same Raft log as a normal write. The operation runs in one place; the result replicates. Per codex P2 PR #454 — see Operational consequence #1 for the honest framing.) | ## Operational consequences -### 1. No compaction storms +### 1. Compaction is qualitatively different, not absent -Cassandra clusters periodically run compaction (rewrite SSTables to -reclaim space + maintain read performance). Each node compacts on -its own schedule. During compaction: +Cassandra clusters periodically run LSM compaction (rewrite SSTables +to reclaim tombstones + merge sorted runs to maintain read +performance). Each node compacts on its own schedule. During +compaction: - The compacting node's CPU spikes - Its disk write bandwidth spikes @@ -56,15 +75,54 @@ The cluster operator's job is partly to schedule compactions across nodes so that not too many compact simultaneously. This is a significant operational burden. -Lance has no compaction. The version log IS the truth; old fragments -can be reclaimed by version-based GC (a much simpler operation than -SSTable compaction) but the GC is local-only, doesn't interact with -replication, and doesn't reorder anything. +Lance has compaction TOO, but of a qualitatively different shape: +`DatasetOptimizer.compact_files` merges small fragments into larger +ones to optimize query layout (many small appends produce many small +fragments which slow scans). For datasets that use deletes, updates, +or dropped columns, the SAME compaction also performs reclamation — +deletion vectors get materialized away (removing rows logically +marked for delete) and dropped columns are physically removed by +default. So there IS a reclamation role on those datasets; it is +just NOT the LSM tombstone-reclaim mechanism (Lance has no +tombstones at the version level — deletions are tracked by +deletion-vectors against append-only fragments). For append-only +write workloads with no deletes/updates/drops, the layout-only +framing applies; for mixed-write workloads, the reclamation +component is also present. Either way the operation produces new +append-only fragments at a better layout. + +Operationally: + +- The compaction OPERATION runs locally on whichever node has the + current leader role (or on a node permitted to run a maintenance + task in the chosen deployment shape); it does not block normal + write flow at the application layer +- The OUTPUT of compaction — the new manifest version + the new + set of fragments — flows through the Raft log like any other + Lance commit. Peers see the new manifest version after consensus + commits, and anti-entropy converges replicas to the post-compaction + state. So the result REPLICATES; the work that produced the result + is what runs in one place +- Per-node SCHEDULING choices do not stack into coordination + headaches the way Cassandra LSM scheduling does, because each + compaction's product is a single committed version (not a + per-replica concurrent rewrite that has to be reconciled). At most + one node should run a given compaction at a time to avoid wasted + work; this is a coordination choice (lock or leader-only), not a + coordination headache +- The failure modes are smaller: a partial compaction is recoverable + via Raft's standard log replay; no in-flight LSM tombstones to + lose; correctness is unaffected A peer-Raft + Lance deployment therefore has uniform per-node -behavior. Each node is doing the same work at the same time, with -the same shape. The operations runbook is simpler because the -failure modes are simpler. +behavior under consensus. Compaction is a maintenance operation +that produces a normal commit; operators plan for it (it consumes +CPU + IO when it runs) but the cluster-wide coordination model is +simpler than Cassandra's per-node-independent LSM compaction +scheduling. (Per post-merge correction on PR #452; sharpened per +codex P2 review on PR #454 — the prior framing said 'independent +of consensus' which overclaimed; the operation is local but the +output replicates.) ### 2. Anti-entropy is a hash compare, not a Merkle-tree walk @@ -109,19 +167,30 @@ footprint depends on which columns mutated, whether the row was new or updated, whether the column had a previous value. Cross-DC replication budget is harder to plan. -### 5. The consensus tax lands once, not twice - -This is the unifying point: with non-append-only storage, an -application that wants linearizable writes pays the consensus tax -TWICE. Once for the consensus protocol shipping operations to -replicas. Once for the storage layer doing per-node compaction + -mutation bookkeeping. The two taxes interact (a compaction storm -delays consensus catch-up; a Raft snapshot has to materialize the -LSM-tree state). - -With Lance + Raft, the consensus tax and the storage tax are the -SAME tax. The append IS both the consensus log entry and the storage -commit. You pay it once. +### 5. The consensus tax and the storage-COMMIT tax are the same tax + +This is the unifying point: with LSM-tree storage, an application +that wants linearizable writes pays the consensus tax TWICE. Once +for the consensus protocol shipping operations to replicas. Once for +the storage layer doing per-node tombstone-reclaim + run-merge +compaction. The two taxes interact (a compaction storm delays +consensus catch-up; a Raft snapshot has to materialize the LSM-tree +state). + +With Lance + Raft, the consensus tax and the storage-COMMIT tax are +the SAME tax — the append IS both the consensus log entry and the +storage commit; you pay it once. Lance does have its own file- +compaction cycle (`DatasetOptimizer.compact_files`), which produces +a NEW manifest version — and that new version flows through the +same Raft log as any other write, replicating to peers via the +normal consensus + anti-entropy path. So compaction's OUTPUT +counts as a consensus event (one more append). What it does NOT +add is a SECOND tax of the LSM-tree kind (per-node tombstone- +reclaim + run-merge bookkeeping that runs on every replica +independently and creates coordination headaches with replication). +The LAYOUT-OPTIMIZATION cycle exists; it pays the SAME consensus +tax as a regular write (one commit), and does NOT layer a separate +per-replica storage tax on top. ## What this implies for deployment shape diff --git a/docs/CLUSTER_ASYMMETRY.md b/docs/CLUSTER_ASYMMETRY.md index 47483f0f..4f7c5d5b 100644 --- a/docs/CLUSTER_ASYMMETRY.md +++ b/docs/CLUSTER_ASYMMETRY.md @@ -95,12 +95,29 @@ by 1-3 orders of magnitude vs LSM-tree wide-column representations: reduces ANN scan space by ~16× under empirical intra-family locality (98.6% per the `lance-graph` PR #444 probe). -- **vort/vart adaptive radix trie**: structural deduplication of - shared HHTL prefixes. The heel + hip nibbles common across many - entities are stored ONCE per path segment, not N times. Adaptive - Radix Tree shape: O(k) lookup, prefix-sharing storage. The same - structure also serves as the time-axis index over Lance - `versions()` for cold-path queries. +- **`lance-graph-contract::hhtl::NiblePath` (shipped) + Lance + `versions()` (shipped)**: HHTL identity is a 16ⁿ nibble path packed + into a `u64` (`FAN_OUT = 16`, `MAX_DEPTH = 16`). Adopters who want + to dedupe shared HHTL prefixes in memory typically derive an + adaptive radix-trie index over `NiblePath` addresses — heel + hip + nibbles common across many entities can be stored once per path + segment via consumer-side structures (O(k) lookup, prefix-sharing). + **The dedup-by-prefix data structure itself is consumer code, not + a built-in lance-graph crate.** Lance's `versions()` returns + `Vec` — the time-axis is the + version-snapshot LOG (each version is a tagged snapshot of the + dataset; the log is append-only, ordered, and queryable). It does + NOT itself identify which identities changed in each snapshot; + adopters who need a change-set derive it by comparing snapshots + (or by maintaining a separate change-index per their workload's + needs). Codex P2 review on PR #454 caught the prior overclaim + that `versions()` was a 'changed-position index'; corrected: + it's the snapshot LOG, and the change-set derivation is consumer + code. An earlier version of this doc cited `vort/vart` as if it + were a shipped crate (a separate fix). The two shipped surfaces + for the bullet are `NiblePath` (identity) and `versions()` + (snapshot log); the radix-shaped consumer trie and the + change-set index are both consumer code. Concrete example: Wikidata (~115M entities). In Cassandra+JG, the indexed graph form is multi-TB with replication factor 3 → multi-TB