Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 94 additions & 25 deletions docs/APPEND_ONLY_RAFT_DOVETAIL.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,24 @@ or any other model where storage mutates state in place.
This doc names the property explicitly + lists the operational
consequences so adopters can choose the right deployment shape.

> ### Scope: external architecture pattern, not a built-in lance-graph feature
>
> Post-merge correction (paralleling peer review on the companion doc
> PR #453): the deployment shape described below ("peer-Raft +
> Lance-local-per-node") is an EXTERNAL ARCHITECTURE PATTERN adopters
> can build on top of lance-graph, NOT a built-in lance-graph
> capability. Lance-graph provides the append-only columnar storage +
> the DataFusion query path + the encoding crates; the Raft layer +
> the substrate binary + the consensus-replication path are
> downstream consumer code (e.g. `openraft` or `surreal-cluster`).
> Adopters who only consume lance-graph's columnar + DataFusion path
> should NOT assume their data is automatically replicated.
>
> The doc documents WHY this pattern works well WHEN built on
> lance-graph — the storage-append/consensus-append dovetail property
> — not a feature lance-graph itself ships.


## The two write shapes that have to align

A distributed Lance deployment has two write paths:
Expand All @@ -35,15 +53,16 @@ Compare this with the conventional alternatives:
|---|---|---|
| **LSM-tree (Cassandra)** | Paxos-light / gossip | Storage AND consensus both have their own append-then-mutate cycles. Compaction in storage interacts with hinted handoff in consensus. Coordination headaches. |
| **B-tree (PostgreSQL)** | 2PC (citus-like) | Storage in-place updates fight with 2PC's append-log. Vacuum interacts with commit-log replay. More headaches. |
| **Append-only Lance** | Append-only Raft | One write shape. Storage commit = consensus log entry. No interaction problems. |
| **Append-only Lance** | Append-only Raft | One write shape. Storage commit = consensus log entry. (`DatasetOptimizer.compact_files` produces a new manifest version — that output replicates through the same Raft log as a normal write. The operation runs in one place; the result replicates. Per codex P2 PR #454 — see Operational consequence #1 for the honest framing.) |

## Operational consequences

### 1. No compaction storms
### 1. Compaction is qualitatively different, not absent

Cassandra clusters periodically run compaction (rewrite SSTables to
reclaim space + maintain read performance). Each node compacts on
its own schedule. During compaction:
Cassandra clusters periodically run LSM compaction (rewrite SSTables
to reclaim tombstones + merge sorted runs to maintain read
performance). Each node compacts on its own schedule. During
compaction:

- The compacting node's CPU spikes
- Its disk write bandwidth spikes
Expand All @@ -56,15 +75,54 @@ The cluster operator's job is partly to schedule compactions across
nodes so that not too many compact simultaneously. This is a
significant operational burden.

Lance has no compaction. The version log IS the truth; old fragments
can be reclaimed by version-based GC (a much simpler operation than
SSTable compaction) but the GC is local-only, doesn't interact with
replication, and doesn't reorder anything.
Lance has compaction TOO, but of a qualitatively different shape:
`DatasetOptimizer.compact_files` merges small fragments into larger
ones to optimize query layout (many small appends produce many small
fragments which slow scans). For datasets that use deletes, updates,
or dropped columns, the SAME compaction also performs reclamation —
deletion vectors get materialized away (removing rows logically
marked for delete) and dropped columns are physically removed by
default. So there IS a reclamation role on those datasets; it is
just NOT the LSM tombstone-reclaim mechanism (Lance has no
tombstones at the version level — deletions are tracked by
deletion-vectors against append-only fragments). For append-only
write workloads with no deletes/updates/drops, the layout-only
framing applies; for mixed-write workloads, the reclamation
component is also present. Either way the operation produces new
append-only fragments at a better layout.

Operationally:

- The compaction OPERATION runs locally on whichever node has the
current leader role (or on a node permitted to run a maintenance
task in the chosen deployment shape); it does not block normal
write flow at the application layer
- The OUTPUT of compaction — the new manifest version + the new
set of fragments — flows through the Raft log like any other
Lance commit. Peers see the new manifest version after consensus
commits, and anti-entropy converges replicas to the post-compaction
state. So the result REPLICATES; the work that produced the result
is what runs in one place
- Per-node SCHEDULING choices do not stack into coordination
headaches the way Cassandra LSM scheduling does, because each
compaction's product is a single committed version (not a
per-replica concurrent rewrite that has to be reconciled). At most
one node should run a given compaction at a time to avoid wasted
work; this is a coordination choice (lock or leader-only), not a
coordination headache
- The failure modes are smaller: a partial compaction is recoverable
via Raft's standard log replay; no in-flight LSM tombstones to
lose; correctness is unaffected

A peer-Raft + Lance deployment therefore has uniform per-node
behavior. Each node is doing the same work at the same time, with
the same shape. The operations runbook is simpler because the
failure modes are simpler.
behavior under consensus. Compaction is a maintenance operation
that produces a normal commit; operators plan for it (it consumes
CPU + IO when it runs) but the cluster-wide coordination model is
simpler than Cassandra's per-node-independent LSM compaction
scheduling. (Per post-merge correction on PR #452; sharpened per
codex P2 review on PR #454 — the prior framing said 'independent
of consensus' which overclaimed; the operation is local but the
output replicates.)

### 2. Anti-entropy is a hash compare, not a Merkle-tree walk

Expand Down Expand Up @@ -109,19 +167,30 @@ footprint depends on which columns mutated, whether the row was new
or updated, whether the column had a previous value. Cross-DC
replication budget is harder to plan.

### 5. The consensus tax lands once, not twice

This is the unifying point: with non-append-only storage, an
application that wants linearizable writes pays the consensus tax
TWICE. Once for the consensus protocol shipping operations to
replicas. Once for the storage layer doing per-node compaction +
mutation bookkeeping. The two taxes interact (a compaction storm
delays consensus catch-up; a Raft snapshot has to materialize the
LSM-tree state).

With Lance + Raft, the consensus tax and the storage tax are the
SAME tax. The append IS both the consensus log entry and the storage
commit. You pay it once.
### 5. The consensus tax and the storage-COMMIT tax are the same tax

This is the unifying point: with LSM-tree storage, an application
that wants linearizable writes pays the consensus tax TWICE. Once
for the consensus protocol shipping operations to replicas. Once for
the storage layer doing per-node tombstone-reclaim + run-merge
compaction. The two taxes interact (a compaction storm delays
consensus catch-up; a Raft snapshot has to materialize the LSM-tree
state).

With Lance + Raft, the consensus tax and the storage-COMMIT tax are
the SAME tax — the append IS both the consensus log entry and the
storage commit; you pay it once. Lance does have its own file-
compaction cycle (`DatasetOptimizer.compact_files`), which produces
a NEW manifest version — and that new version flows through the
same Raft log as any other write, replicating to peers via the
normal consensus + anti-entropy path. So compaction's OUTPUT
counts as a consensus event (one more append). What it does NOT
add is a SECOND tax of the LSM-tree kind (per-node tombstone-
reclaim + run-merge bookkeeping that runs on every replica
independently and creates coordination headaches with replication).
The LAYOUT-OPTIMIZATION cycle exists; it pays the SAME consensus
tax as a regular write (one commit), and does NOT layer a separate
per-replica storage tax on top.

## What this implies for deployment shape

Expand Down
29 changes: 23 additions & 6 deletions docs/CLUSTER_ASYMMETRY.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,12 +95,29 @@ by 1-3 orders of magnitude vs LSM-tree wide-column representations:
reduces ANN scan space by ~16× under empirical intra-family
locality (98.6% per the `lance-graph` PR #444 probe).

- **vort/vart adaptive radix trie**: structural deduplication of
shared HHTL prefixes. The heel + hip nibbles common across many
entities are stored ONCE per path segment, not N times. Adaptive
Radix Tree shape: O(k) lookup, prefix-sharing storage. The same
structure also serves as the time-axis index over Lance
`versions()` for cold-path queries.
- **`lance-graph-contract::hhtl::NiblePath` (shipped) + Lance
`versions()` (shipped)**: HHTL identity is a 16ⁿ nibble path packed
into a `u64` (`FAN_OUT = 16`, `MAX_DEPTH = 16`). Adopters who want
to dedupe shared HHTL prefixes in memory typically derive an
adaptive radix-trie index over `NiblePath` addresses — heel + hip
nibbles common across many entities can be stored once per path
segment via consumer-side structures (O(k) lookup, prefix-sharing).
**The dedup-by-prefix data structure itself is consumer code, not
a built-in lance-graph crate.** Lance's `versions()` returns
`Vec<lance::dataset::Version>` — the time-axis is the
version-snapshot LOG (each version is a tagged snapshot of the
dataset; the log is append-only, ordered, and queryable). It does
NOT itself identify which identities changed in each snapshot;
adopters who need a change-set derive it by comparing snapshots
(or by maintaining a separate change-index per their workload's
needs). Codex P2 review on PR #454 caught the prior overclaim
that `versions()` was a 'changed-position index'; corrected:
it's the snapshot LOG, and the change-set derivation is consumer
code. An earlier version of this doc cited `vort/vart` as if it
were a shipped crate (a separate fix). The two shipped surfaces
for the bullet are `NiblePath` (identity) and `versions()`
(snapshot log); the radix-shaped consumer trie and the
change-set index are both consumer code.

Concrete example: Wikidata (~115M entities). In Cassandra+JG, the
indexed graph form is multi-TB with replication factor 3 → multi-TB
Expand Down