From d854974c6426f65a676c19015ac064eaf799e132 Mon Sep 17 00:00:00 2001 From: "jan (via bardioc)" Date: Wed, 3 Jun 2026 11:53:50 +0000 Subject: [PATCH 1/2] docs: post-merge corrections on PR #452 + #453 architecture docs Two follow-up doc fixes after PR #452 and #453 merged. Reviewers flagged real bugs the doc-only PRs landed with; these corrections bring main in line with the post-review state. == CLUSTER_ASYMMETRY.md == Drop the vort/vart adaptive-radix-trie citation; cite shipped NiblePath + Lance versions() instead. A workspace scan across all Cargo.toml files confirms vart is NOT a dependency in this repository; le-domino's own seam-map documents it as 'doc-prose only'. Citing a non-existent crate in a public architecture doc misleads adopters into looking for a crate that they cannot find. The role the bullet described (HHTL-prefix dedup) is actually: - lance-graph-contract::hhtl::NiblePath (shipped) for the identity primitive (16-fan-out nibble path packed into a u64) - Lance versions() (shipped) for the time-axis (cross-session index of which identity positions changed when) Adopters can derive an adaptive radix-trie index over NiblePath addresses themselves; that data structure is consumer code, not a lance-graph dep. The corrected bullet cites the two shipped surfaces and flags the radix-shaped consumer pattern as proposed. == APPEND_ONLY_RAFT_DOVETAIL.md == Apply the same critique class codex raised on PR #453's companion doc: - Scope caveat: peer-Raft + Lance-local is an EXTERNAL architecture pattern (bardioc B1 substrate-b), NOT a built-in lance-graph feature. Adopters provide the Raft layer themselves (openraft / surreal-cluster / external TiKV). Lance-graph contributes the storage-append/consensus-append dovetail property that MAKES the pattern cheap; not the pattern itself. Added a scope banner immediately after the TL;DR. - Compaction honesty: Lance has compaction TOO via DatasetOptimizer.compact_files for fragment layout. The doc previously said 'Lance has no compaction' which is wrong for append-heavy deployments. Rewrote section 1 to distinguish LSM tombstone-reclaim+run-merge compaction from Lance file-layout compaction. Both exist; only LSM coordinates with replication. - Consensus-tax-lands-once section also updated to acknowledge that Lance file compaction runs INDEPENDENTLY of consensus (the storage commit tax and the consensus tax are the same write; the LAYOUT OPTIMIZATION cycle is a separate concern that does not couple). Both files were merged earlier (PR #452, #453); these corrections land in main as a follow-up so adopters reading the docs today see the honest scope + correct citations. No code changes; doc-only. --- docs/APPEND_ONLY_RAFT_DOVETAIL.md | 91 ++++++++++++++++++++++--------- docs/CLUSTER_ASYMMETRY.md | 21 +++++-- 2 files changed, 81 insertions(+), 31 deletions(-) diff --git a/docs/APPEND_ONLY_RAFT_DOVETAIL.md b/docs/APPEND_ONLY_RAFT_DOVETAIL.md index 3c632574..67ffd836 100644 --- a/docs/APPEND_ONLY_RAFT_DOVETAIL.md +++ b/docs/APPEND_ONLY_RAFT_DOVETAIL.md @@ -12,6 +12,24 @@ or any other model where storage mutates state in place. This doc names the property explicitly + lists the operational consequences so adopters can choose the right deployment shape. +> ### Scope: external architecture pattern, not a built-in lance-graph feature +> +> Post-merge correction (paralleling peer review on the companion doc +> PR #453): the deployment shape described below ("peer-Raft + +> Lance-local-per-node") is an EXTERNAL ARCHITECTURE PATTERN adopters +> can build on top of lance-graph, NOT a built-in lance-graph +> capability. Lance-graph provides the append-only columnar storage + +> the DataFusion query path + the encoding crates; the Raft layer + +> the substrate binary + the consensus-replication path are +> downstream consumer code (e.g. `openraft` or `surreal-cluster`). +> Adopters who only consume lance-graph's columnar + DataFusion path +> should NOT assume their data is automatically replicated. +> +> The doc documents WHY this pattern works well WHEN built on +> lance-graph — the storage-append/consensus-append dovetail property +> — not a feature lance-graph itself ships. + + ## The two write shapes that have to align A distributed Lance deployment has two write paths: @@ -35,15 +53,16 @@ Compare this with the conventional alternatives: |---|---|---| | **LSM-tree (Cassandra)** | Paxos-light / gossip | Storage AND consensus both have their own append-then-mutate cycles. Compaction in storage interacts with hinted handoff in consensus. Coordination headaches. | | **B-tree (PostgreSQL)** | 2PC (citus-like) | Storage in-place updates fight with 2PC's append-log. Vacuum interacts with commit-log replay. More headaches. | -| **Append-only Lance** | Append-only Raft | One write shape. Storage commit = consensus log entry. No interaction problems. | +| **Append-only Lance** | Append-only Raft | One write shape. Storage commit = consensus log entry. No interaction problems. (Lance does have its own file-layout compaction via `DatasetOptimizer.compact_files`, but it runs INDEPENDENTLY of consensus — see Operational consequence #1.) | ## Operational consequences -### 1. No compaction storms +### 1. Compaction is qualitatively different, not absent -Cassandra clusters periodically run compaction (rewrite SSTables to -reclaim space + maintain read performance). Each node compacts on -its own schedule. During compaction: +Cassandra clusters periodically run LSM compaction (rewrite SSTables +to reclaim tombstones + merge sorted runs to maintain read +performance). Each node compacts on its own schedule. During +compaction: - The compacting node's CPU spikes - Its disk write bandwidth spikes @@ -56,15 +75,31 @@ The cluster operator's job is partly to schedule compactions across nodes so that not too many compact simultaneously. This is a significant operational burden. -Lance has no compaction. The version log IS the truth; old fragments -can be reclaimed by version-based GC (a much simpler operation than -SSTable compaction) but the GC is local-only, doesn't interact with -replication, and doesn't reorder anything. +Lance has compaction TOO, but of a qualitatively different shape: +`DatasetOptimizer.compact_files` merges small fragments into larger +ones to optimize query layout (many small appends produce many small +fragments which slow scans). It is NOT a tombstone-reclaim cycle — +Lance is append-only at the version level, so there are no tombstones +to reclaim in the LSM sense. File compaction takes existing +append-only fragments and produces new append-only fragments at a +better layout. + +Operationally: + +- Compaction runs INDEPENDENTLY of consensus replication — file + compaction does not block writes, does not affect Raft log shipping, + does not coordinate across nodes +- Per-node compaction is local-only; each node compacts its own + Lance dataset on its own schedule without affecting peers +- The failure modes are smaller (a partial compaction is recoverable; + no in-flight tombstones to lose; correctness is unaffected) A peer-Raft + Lance deployment therefore has uniform per-node -behavior. Each node is doing the same work at the same time, with -the same shape. The operations runbook is simpler because the -failure modes are simpler. +behavior under both consensus and compaction. Operators still plan +for file compaction (it consumes CPU + IO when it runs), but the +cluster-wide coordination burden is significantly lower than +Cassandra's LSM compaction scheduling. (Per post-merge correction +on PR #452.) ### 2. Anti-entropy is a hash compare, not a Merkle-tree walk @@ -109,19 +144,25 @@ footprint depends on which columns mutated, whether the row was new or updated, whether the column had a previous value. Cross-DC replication budget is harder to plan. -### 5. The consensus tax lands once, not twice - -This is the unifying point: with non-append-only storage, an -application that wants linearizable writes pays the consensus tax -TWICE. Once for the consensus protocol shipping operations to -replicas. Once for the storage layer doing per-node compaction + -mutation bookkeeping. The two taxes interact (a compaction storm -delays consensus catch-up; a Raft snapshot has to materialize the -LSM-tree state). - -With Lance + Raft, the consensus tax and the storage tax are the -SAME tax. The append IS both the consensus log entry and the storage -commit. You pay it once. +### 5. The consensus tax and the storage-COMMIT tax are the same tax + +This is the unifying point: with LSM-tree storage, an application +that wants linearizable writes pays the consensus tax TWICE. Once +for the consensus protocol shipping operations to replicas. Once for +the storage layer doing per-node tombstone-reclaim + run-merge +compaction. The two taxes interact (a compaction storm delays +consensus catch-up; a Raft snapshot has to materialize the LSM-tree +state). + +With Lance + Raft, the consensus tax and the storage-COMMIT tax are +the SAME tax — the append IS both the consensus log entry and the +storage commit; you pay it once. Lance does have its own file- +compaction cycle (`DatasetOptimizer.compact_files`), but it runs +INDEPENDENTLY of consensus — file compaction takes existing +append-only fragments and rewrites them into bigger append-only +fragments without tombstones to reclaim and without coordinating +with replication. So the LAYOUT-OPTIMIZATION cycle exists but does +not interact with the consensus tax. ## What this implies for deployment shape diff --git a/docs/CLUSTER_ASYMMETRY.md b/docs/CLUSTER_ASYMMETRY.md index 47483f0f..59462dc1 100644 --- a/docs/CLUSTER_ASYMMETRY.md +++ b/docs/CLUSTER_ASYMMETRY.md @@ -95,12 +95,21 @@ by 1-3 orders of magnitude vs LSM-tree wide-column representations: reduces ANN scan space by ~16× under empirical intra-family locality (98.6% per the `lance-graph` PR #444 probe). -- **vort/vart adaptive radix trie**: structural deduplication of - shared HHTL prefixes. The heel + hip nibbles common across many - entities are stored ONCE per path segment, not N times. Adaptive - Radix Tree shape: O(k) lookup, prefix-sharing storage. The same - structure also serves as the time-axis index over Lance - `versions()` for cold-path queries. +- **`lance-graph-contract::hhtl::NiblePath` (shipped) + Lance + `versions()` (shipped)**: HHTL identity is a 16ⁿ nibble path packed + into a `u64` (`FAN_OUT = 16`, `MAX_DEPTH = 16`). Adopters who want + to dedupe shared HHTL prefixes in memory typically derive an + adaptive radix-trie index over `NiblePath` addresses — heel + hip + nibbles common across many entities can be stored once per path + segment via consumer-side structures (O(k) lookup, prefix-sharing). + **The dedup-by-prefix data structure itself is consumer code, not + a built-in lance-graph crate.** Lance's own `versions()` log is the + time-axis (cross-session index of which identity positions changed + when). An earlier version of this doc cited `vort/vart` as if it + were a shipped crate; corrected per peer review — the radix-shaped + trie at the cognitive layer is a proposed pattern (no shipped crate + name) and the identity primitive + the time-axis are the two + shipped surfaces this bullet should have cited from the start. Concrete example: Wikidata (~115M entities). In Cassandra+JG, the indexed graph form is multi-TB with replication factor 3 → multi-TB From 4bd38f46ec634501a12f91aab839603f70623d8e Mon Sep 17 00:00:00 2001 From: "jan (via bardioc)" Date: Wed, 3 Jun 2026 12:18:44 +0000 Subject: [PATCH 2/2] fix(codex): compaction-not-local-only + compaction-DOES-reclaim + versions-is-not-a-change-index Three P2 findings on PR #454 walked back, all real overclaims: P2 1 (APPEND_ONLY_RAFT_DOVETAIL.md): Walked back 'compaction runs INDEPENDENTLY of consensus replication'. Under peer-Raft, compaction COMMITS a new manifest version which IS part of the consensus log; the compaction OPERATION runs locally but its OUTPUT (new fragments + new manifest delta) flows through Raft like any other write. So peers see it. The independence claim should have been narrower: the SCHEDULING of the operation is local; the OUTCOME replicates. P2 2 (APPEND_ONLY_RAFT_DOVETAIL.md): Walked back 'no tombstones to reclaim in the LSM sense; layout optimization only'. Lance compact_files DOES reclaim deleted rows and dropped columns by default. Different mechanism from LSM tombstones (Lance has no tombstones at the version level; rows are deleted via deletion vectors which compaction materializes away) but the functional role of reclamation IS present for datasets that use deletes, updates, or dropped columns. The doc now distinguishes 'no LSM-style tombstone reclaim' from 'has deletion-vector reclaim and dropped-column reclaim'. P2 3 (CLUSTER_ASYMMETRY.md): Walked back the claim that Lance versions() is 'the time-axis index of which identity positions changed when'. versions() returns Vec metadata: snapshot tags + timestamps, not a change-set index. To find which identities changed between versions, adopters compare snapshots OR maintain a separate index. The corrected bullet describes versions() as the version-snapshot log (which it is) and notes that the change-set derivation is consumer code. Provenance: Codex P2 review on PR #454 commit 16f879ba587fb. --- docs/APPEND_ONLY_RAFT_DOVETAIL.md | 76 +++++++++++++++++++++---------- docs/CLUSTER_ASYMMETRY.md | 22 ++++++--- 2 files changed, 67 insertions(+), 31 deletions(-) diff --git a/docs/APPEND_ONLY_RAFT_DOVETAIL.md b/docs/APPEND_ONLY_RAFT_DOVETAIL.md index 67ffd836..89dcfcf1 100644 --- a/docs/APPEND_ONLY_RAFT_DOVETAIL.md +++ b/docs/APPEND_ONLY_RAFT_DOVETAIL.md @@ -53,7 +53,7 @@ Compare this with the conventional alternatives: |---|---|---| | **LSM-tree (Cassandra)** | Paxos-light / gossip | Storage AND consensus both have their own append-then-mutate cycles. Compaction in storage interacts with hinted handoff in consensus. Coordination headaches. | | **B-tree (PostgreSQL)** | 2PC (citus-like) | Storage in-place updates fight with 2PC's append-log. Vacuum interacts with commit-log replay. More headaches. | -| **Append-only Lance** | Append-only Raft | One write shape. Storage commit = consensus log entry. No interaction problems. (Lance does have its own file-layout compaction via `DatasetOptimizer.compact_files`, but it runs INDEPENDENTLY of consensus — see Operational consequence #1.) | +| **Append-only Lance** | Append-only Raft | One write shape. Storage commit = consensus log entry. (`DatasetOptimizer.compact_files` produces a new manifest version — that output replicates through the same Raft log as a normal write. The operation runs in one place; the result replicates. Per codex P2 PR #454 — see Operational consequence #1 for the honest framing.) | ## Operational consequences @@ -78,28 +78,51 @@ significant operational burden. Lance has compaction TOO, but of a qualitatively different shape: `DatasetOptimizer.compact_files` merges small fragments into larger ones to optimize query layout (many small appends produce many small -fragments which slow scans). It is NOT a tombstone-reclaim cycle — -Lance is append-only at the version level, so there are no tombstones -to reclaim in the LSM sense. File compaction takes existing -append-only fragments and produces new append-only fragments at a -better layout. +fragments which slow scans). For datasets that use deletes, updates, +or dropped columns, the SAME compaction also performs reclamation — +deletion vectors get materialized away (removing rows logically +marked for delete) and dropped columns are physically removed by +default. So there IS a reclamation role on those datasets; it is +just NOT the LSM tombstone-reclaim mechanism (Lance has no +tombstones at the version level — deletions are tracked by +deletion-vectors against append-only fragments). For append-only +write workloads with no deletes/updates/drops, the layout-only +framing applies; for mixed-write workloads, the reclamation +component is also present. Either way the operation produces new +append-only fragments at a better layout. Operationally: -- Compaction runs INDEPENDENTLY of consensus replication — file - compaction does not block writes, does not affect Raft log shipping, - does not coordinate across nodes -- Per-node compaction is local-only; each node compacts its own - Lance dataset on its own schedule without affecting peers -- The failure modes are smaller (a partial compaction is recoverable; - no in-flight tombstones to lose; correctness is unaffected) +- The compaction OPERATION runs locally on whichever node has the + current leader role (or on a node permitted to run a maintenance + task in the chosen deployment shape); it does not block normal + write flow at the application layer +- The OUTPUT of compaction — the new manifest version + the new + set of fragments — flows through the Raft log like any other + Lance commit. Peers see the new manifest version after consensus + commits, and anti-entropy converges replicas to the post-compaction + state. So the result REPLICATES; the work that produced the result + is what runs in one place +- Per-node SCHEDULING choices do not stack into coordination + headaches the way Cassandra LSM scheduling does, because each + compaction's product is a single committed version (not a + per-replica concurrent rewrite that has to be reconciled). At most + one node should run a given compaction at a time to avoid wasted + work; this is a coordination choice (lock or leader-only), not a + coordination headache +- The failure modes are smaller: a partial compaction is recoverable + via Raft's standard log replay; no in-flight LSM tombstones to + lose; correctness is unaffected A peer-Raft + Lance deployment therefore has uniform per-node -behavior under both consensus and compaction. Operators still plan -for file compaction (it consumes CPU + IO when it runs), but the -cluster-wide coordination burden is significantly lower than -Cassandra's LSM compaction scheduling. (Per post-merge correction -on PR #452.) +behavior under consensus. Compaction is a maintenance operation +that produces a normal commit; operators plan for it (it consumes +CPU + IO when it runs) but the cluster-wide coordination model is +simpler than Cassandra's per-node-independent LSM compaction +scheduling. (Per post-merge correction on PR #452; sharpened per +codex P2 review on PR #454 — the prior framing said 'independent +of consensus' which overclaimed; the operation is local but the +output replicates.) ### 2. Anti-entropy is a hash compare, not a Merkle-tree walk @@ -157,12 +180,17 @@ state). With Lance + Raft, the consensus tax and the storage-COMMIT tax are the SAME tax — the append IS both the consensus log entry and the storage commit; you pay it once. Lance does have its own file- -compaction cycle (`DatasetOptimizer.compact_files`), but it runs -INDEPENDENTLY of consensus — file compaction takes existing -append-only fragments and rewrites them into bigger append-only -fragments without tombstones to reclaim and without coordinating -with replication. So the LAYOUT-OPTIMIZATION cycle exists but does -not interact with the consensus tax. +compaction cycle (`DatasetOptimizer.compact_files`), which produces +a NEW manifest version — and that new version flows through the +same Raft log as any other write, replicating to peers via the +normal consensus + anti-entropy path. So compaction's OUTPUT +counts as a consensus event (one more append). What it does NOT +add is a SECOND tax of the LSM-tree kind (per-node tombstone- +reclaim + run-merge bookkeeping that runs on every replica +independently and creates coordination headaches with replication). +The LAYOUT-OPTIMIZATION cycle exists; it pays the SAME consensus +tax as a regular write (one commit), and does NOT layer a separate +per-replica storage tax on top. ## What this implies for deployment shape diff --git a/docs/CLUSTER_ASYMMETRY.md b/docs/CLUSTER_ASYMMETRY.md index 59462dc1..4f7c5d5b 100644 --- a/docs/CLUSTER_ASYMMETRY.md +++ b/docs/CLUSTER_ASYMMETRY.md @@ -103,13 +103,21 @@ by 1-3 orders of magnitude vs LSM-tree wide-column representations: nibbles common across many entities can be stored once per path segment via consumer-side structures (O(k) lookup, prefix-sharing). **The dedup-by-prefix data structure itself is consumer code, not - a built-in lance-graph crate.** Lance's own `versions()` log is the - time-axis (cross-session index of which identity positions changed - when). An earlier version of this doc cited `vort/vart` as if it - were a shipped crate; corrected per peer review — the radix-shaped - trie at the cognitive layer is a proposed pattern (no shipped crate - name) and the identity primitive + the time-axis are the two - shipped surfaces this bullet should have cited from the start. + a built-in lance-graph crate.** Lance's `versions()` returns + `Vec` — the time-axis is the + version-snapshot LOG (each version is a tagged snapshot of the + dataset; the log is append-only, ordered, and queryable). It does + NOT itself identify which identities changed in each snapshot; + adopters who need a change-set derive it by comparing snapshots + (or by maintaining a separate change-index per their workload's + needs). Codex P2 review on PR #454 caught the prior overclaim + that `versions()` was a 'changed-position index'; corrected: + it's the snapshot LOG, and the change-set derivation is consumer + code. An earlier version of this doc cited `vort/vart` as if it + were a shipped crate (a separate fix). The two shipped surfaces + for the bullet are `NiblePath` (identity) and `versions()` + (snapshot log); the radix-shaped consumer trie and the + change-set index are both consumer code. Concrete example: Wikidata (~115M entities). In Cassandra+JG, the indexed graph form is multi-TB with replication factor 3 → multi-TB