From ef93e287a7fa313809e135d3476137027e53a3b9 Mon Sep 17 00:00:00 2001 From: "jan (via bardioc)" Date: Wed, 3 Jun 2026 11:27:33 +0000 Subject: [PATCH 1/3] =?UTF-8?q?docs:=20cluster=20asymmetry=20=E2=80=94=20c?= =?UTF-8?q?apacity-forced=20vs=20availability-chosen?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Names why lance-graph consumers cluster qualitatively differently than Cassandra-era stacks: data fits per-node (compression cascade + vort/vart trie); peer-Raft replicates the FULL dataset to each node; three-node clusters are production, not toy; no compaction; no rebalancing; no cross-node fan-out for reads. Pairs with the append-only-raft-dovetail PR. --- docs/CLUSTER_ASYMMETRY.md | 222 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 222 insertions(+) create mode 100644 docs/CLUSTER_ASYMMETRY.md diff --git a/docs/CLUSTER_ASYMMETRY.md b/docs/CLUSTER_ASYMMETRY.md new file mode 100644 index 00000000..9b0fc81e --- /dev/null +++ b/docs/CLUSTER_ASYMMETRY.md @@ -0,0 +1,222 @@ +# Cluster ≠ cluster: how lance-graph clusters differ from Cassandra-era clusters + +## TL;DR + +There are two distinct reasons to cluster a distributed system: + +1. **Capacity-forced clustering**: the data does not fit on one node. + You shard. Cross-node queries are required. (Cassandra, JanusGraph, + classic Elasticsearch deployments.) + +2. **Availability-chosen clustering**: the data fits on one node. + You replicate it across N nodes for HA + geo + load distribution. + Each node holds the full dataset. Cross-node queries are NOT + required for the hot path. (CockroachDB, etcd, peer-Raft Lance.) + +Lance-graph consumers are virtually never capacity-forced (the +encoding cascade + columnar layout + radix-trie deduplication +collapse storage by 1-3 orders of magnitude vs LSM-tree wide-column +stores). Clustering is always availability-chosen. + +Don't import the Cassandra cluster operations playbook into a +lance-graph deployment. The failure modes, the operational rhythms, +the budgeting assumptions are all qualitatively different. + +## The two cluster shapes, side by side + +| Concern | Capacity-forced (Cassandra+JG) | Availability-chosen (peer-Raft Lance) | +|---|---|---| +| Reason to cluster | Data does not fit on one node | Data fits on one node; replicate for HA + geo | +| Each node holds | 1/N of the data (consistent-hash shards) | 100% of the data | +| Hot-path reads | Coordinator pattern; fan-out to shard owners | Local; no cross-node hop | +| Replication factor | 3-5 (storage AMPLIFICATION on top of sharding) | 3 (just for HA; no amplification) | +| Effective node count | 3N to 5N where N is shards needed for capacity | 3 (or more if geo distribution is required) | +| Compaction | Yes (each node compacts SSTables on its own schedule; replication lag spikes during) | None (Lance is append-only; version log IS the truth) | +| Cross-shard transactions | Hard; needs 2PC or careful avoidance | Not applicable (no shards) | +| Anti-entropy | Merkle-tree comparison; expensive | Manifest hash compare; O(1) decision | +| Read latency | Depends on coordinator + slowest shard | Local-replica latency; predictable | +| Operator burden | High (capacity planning, compaction scheduling, rebalancing) | Low (HA discipline only; no shard management) | + +## Why lance-graph consumers fit on one node + +The encoding stack typical of lance-graph consumers compresses data +by 1-3 orders of magnitude vs LSM-tree wide-column representations: + +- **highheelbgz SpiralAddress**: 3 integers (12 bytes or 6 bytes + for u16 variants) representing a φ-spiral walking address — NOT + a copy of the weight vector, an address into a deterministic + spiral. Per-row footprint approaches the information-theoretic + minimum for the addressing dimension. + +- **bgz-hhtl-d Slot D / Slot V**: 4 bytes total per row — 2-byte + Slot D (HEEL basin 2 bits / HIP family 4 bits / TWIG centroid 8 + bits / polarity 1 bit / reserved 1 bit) + 2-byte Slot V (BF16 + residual magnitude). The hierarchical addressing is in the bit + layout; the residual captures the per-row delta from the centroid. + +- **bgz17 palette256**: 256-archetype compose table for multi-hop + semantic relations. 8 bits per archetype reference; multi-hop + composition in O(1) via the compose table. + +- **CAM-PQ leaf vectors**: 6 bytes per row for the leaf-exact-match + vector projection. HHTL-banked variant (16 family subcodebooks) + reduces ANN scan space by ~16× under empirical intra-family + locality (98.6% per the `lance-graph` PR #444 probe). + +- **vort/vart adaptive radix trie**: structural deduplication of + shared HHTL prefixes. The heel + hip nibbles common across many + entities are stored ONCE per path segment, not N times. Adaptive + Radix Tree shape: O(k) lookup, prefix-sharing storage. The same + structure also serves as the time-axis index over Lance + `versions()` for cold-path queries. + +Concrete example: Wikidata (~115M entities). In Cassandra+JG, the +indexed graph form is multi-TB with replication factor 3 → multi-TB +× 3 across the cluster. In lance-graph with the above encoding stack, +the same corpus compresses to low single-digit GB total (including +indexes and the Lance version log). Fits on a modest single node. +Each peer replica holds the FULL dataset. + +This is not an academic claim; it's the deployment shape proven in +the `AdaWorldAPI/bardioc` B1 substrate-b reference implementation. + +## Knock-on consequences of availability-chosen clustering + +### 1. Three-node deployment is the starting recommendation, not the toy + +For Cassandra-era thinkers, three-node clusters are toys. Production +Cassandra deployments often run 12-24 nodes. The reasoning: each node +holds 1/N of the data; you need many N to fit the corpus; replication +factor 3 multiplies again. + +For peer-Raft Lance: three-node clusters are production. Each node +holds the full dataset. Three replicas give you majority quorum (one +failure tolerated). More replicas are added for geographic distribution +or read-load fanout, not for capacity. + +Default starting deployment: 3 substrate-b instances, one per AZ in +the same region, peer-Raft replicated. + +### 2. No coordinator pattern; no fan-out lag + +Cassandra reads route through a coordinator node which fans out to +the shard owners and aggregates the response. The query latency is +bounded below by the slowest shard's response time. Hot shards drag +the cluster. + +Peer-Raft Lance reads are LOCAL. The client connects to any node; +the node has the full dataset; it returns the answer without contacting +peers (for eventually-consistent reads) or with a single Raft read-index +round (for linearizable reads). No fan-out, no aggregation, no +slowest-shard lag. + +### 3. No compaction scheduling + +Cassandra clusters need their compaction schedule managed across nodes. +Compaction is CPU + IO heavy and creates replication lag spikes; if +too many nodes compact simultaneously, the cluster's effective +replication factor temporarily drops. + +Lance has no compaction. The version log IS the truth. Garbage collection +of old versions is local-only, doesn't interact with replication, doesn't +require cluster-wide scheduling. + +### 4. Anti-entropy is cheap + +Cassandra anti-entropy (catching up a lagging replica) compares SSTable +Merkle trees node-by-node and streams the diffs. This is expensive and +creates load on both sides. + +Lance peer-Raft anti-entropy: compare the manifest hash between nodes. +If equal, sync. If not, ship missing fragments + the new manifest. The +IDENTIFICATION step is O(1). The streaming step is bounded by the actual +divergence, not by the dataset size. + +### 5. Rebalancing is not a thing + +Cassandra rebalancing (when adding or removing nodes) requires data +movement. Token ranges shift; data streams from old owners to new +owners; the cluster operates in a degraded state during the rebalance. + +Peer-Raft Lance: adding a node = bring up a new substrate-b instance, +let it catch up via Raft, mark it a voter. No data movement needed +beyond the catch-up (and the catch-up is the same wire pattern as +the per-write replication — no special "rebalance" mode). Removing +a node: stop the substrate-b, mark it non-voter, retire it. No data +movement. + +## Knock-on consequences for the Raft consensus tax + +Raft consensus IS on the per-request budget for writes and linearizable +reads. This is irreducible for the distributed-OLTP property. (See +the companion doc `append-only-raft-dovetail.md` for why this lands +lighter in Lance than in LSM-based systems.) + +In an availability-chosen cluster, the consensus tax lands EVEN lighter +than in a capacity-forced cluster: + +- **Smaller per-node datasets**: Raft logs are smaller; commits propagate + faster +- **Fewer replicas needed**: 3 substrate-b instances vs 12-24+ + Cassandra+JG nodes → less coordination overhead per write +- **No cross-node fan-out**: linearizable read-index can be served by + the local leader of the relevant Raft group; doesn't require remote + round-trip when the local instance IS the leader + +## When you actually DO need capacity-forced sharding + +Rare for lance-graph consumers, but possible: + +- Corpora dramatically larger than Wikidata (multi-billion entities or + multi-PB raw data) +- Workloads with hot-key distributions that exceed a single node's + IO bandwidth +- Specific compliance requirements that mandate physical data isolation + per tenant + +In those cases, the appropriate response is application-level sharding +or tenant-level partitioning — NOT a Cassandra-style consistent-hash +ring. Each shard or tenant gets its own peer-Raft + Lance cluster. +The capacity dimension is solved by horizontal application-layer +partitioning; each underlying cluster remains availability-chosen. + +## What this doc does NOT claim + +- **Single-node deployments are sufficient for production.** They're + not. HA requires multiple replicas (or accepted downtime during + failures). Single-node is a development / staging shape. + +- **Three-node is universally optimal.** Geographic distribution + may require more (one replica per region). Specific availability + targets may justify more replicas. + +- **All Lance + Raft stacks ship the same compression.** The specific + encoding cascade described above is from the `bardioc` B1 reference + consumer; other consumers will have different per-row footprints. + The qualitative property (orders-of-magnitude vs LSM wide-column) + remains, but specific numbers vary. + +- **Cassandra+JG choose their shape wrongly.** The Cassandra design + is correct FOR THE STORAGE MODEL IT HAS. The architectural choice + that produces a different deployment shape is the choice to use + Lance + Raft, with append-only storage and columnar compression. + The doc names the consequence, not the wrongness of the alternative. + +## Recommended deployment pattern (reference) + +See [reference consumer implementation in bardioc B1 substrate-b] +(separate proposed doc) for a worked example. Briefly: + +- Three substrate-b instances, one per availability zone within a region +- Each substrate-b is one Rust binary AND one Raft node +- Lance dataset local to each instance (full dataset, not a shard) +- Reads serve from local Lance with no cross-node coordination + (eventually-consistent) or from a Raft read-index round (linearizable) +- Writes serve via Raft quorum to the local leader; replicated as Lance + fragment appends +- Add a fourth+ instance only for geographic distribution or read-load + fanout, not for capacity + +This is the shape proven against Wikidata-scale workloads in the +bardioc reference consumer. Operational complexity is significantly +lower than the Cassandra-era equivalent at the same availability target. \ No newline at end of file From 0007c83a52542d114268a5c6961c55d93e500175 Mon Sep 17 00:00:00 2001 From: "jan (via bardioc)" Date: Wed, 3 Jun 2026 11:35:45 +0000 Subject: [PATCH 2/3] fix(codex-review): scope caveat + compaction honesty + amplification honesty Three codex findings on PR #453: P1: scope caveat - peer-Raft + Lance-local is an EXTERNAL architecture pattern (bardioc B1 substrate-b), NOT a built-in lance-graph feature. Adopters provide the Raft layer themselves (openraft / surreal-cluster / external TiKV); lance-graph provides the columnar storage + DataFusion + encoding crates that MAKE the pattern cheap, not the pattern itself. Added a scope banner after the TL;DR and a per-line reminder on the Recommended deployment pattern section. P2 compaction: Lance has compaction TOO (DatasetOptimizer.compact_files for fragment layout optimization), just qualitatively different from LSM tombstone-reclaim+run-merge. Rewrote section 3 to distinguish LSM compaction from Lance file compaction; operators should plan for the latter. P2 replication amplification: 3 replicas of a 5GB dataset is 15GB of total disk regardless of distribution shape. Replication amplification does NOT disappear in availability-chosen clustering; it does not compound with shard count (the architectural advantage). Fixed the comparison table row and added an honesty paragraph. --- docs/CLUSTER_ASYMMETRY.md | 86 +++++++++++++++++++++++++++++++++------ 1 file changed, 74 insertions(+), 12 deletions(-) diff --git a/docs/CLUSTER_ASYMMETRY.md b/docs/CLUSTER_ASYMMETRY.md index 9b0fc81e..d36eb450 100644 --- a/docs/CLUSTER_ASYMMETRY.md +++ b/docs/CLUSTER_ASYMMETRY.md @@ -22,6 +22,27 @@ Don't import the Cassandra cluster operations playbook into a lance-graph deployment. The failure modes, the operational rhythms, the budgeting assumptions are all qualitatively different. +> ### Scope: external architecture pattern, not a built-in lance-graph feature +> +> Per codex P1 review on PR #453: the deployment shapes described +> below — particularly "peer-Raft + Lance-local-per-node" — describe +> an EXTERNAL ARCHITECTURE PATTERN that adopters can build on top of +> lance-graph, NOT a built-in lance-graph capability. Lance-graph +> itself provides the columnar storage, the DataFusion query path, +> the encoding crates, and the Rust API surface. The Raft layer, the +> substrate binary, and the consensus-replication path are +> downstream consumer code. Adopters implementing this pattern reach +> for `openraft` (or `surreal-cluster` if their stack is +> surrealdb-shaped) to provide the Raft layer; they would NOT +> inherit one from lance-graph. +> +> The doc documents WHY this pattern works well WHEN built on +> lance-graph (the append-only storage model, cheap anti-entropy, +> per-node fit), not a feature lance-graph itself ships. Adopters +> who only consume lance-graph's columnar + DataFusion path should +> NOT assume their data is automatically replicated. + + ## The two cluster shapes, side by side | Concern | Capacity-forced (Cassandra+JG) | Availability-chosen (peer-Raft Lance) | @@ -29,7 +50,7 @@ the budgeting assumptions are all qualitatively different. | Reason to cluster | Data does not fit on one node | Data fits on one node; replicate for HA + geo | | Each node holds | 1/N of the data (consistent-hash shards) | 100% of the data | | Hot-path reads | Coordinator pattern; fan-out to shard owners | Local; no cross-node hop | -| Replication factor | 3-5 (storage AMPLIFICATION on top of sharding) | 3 (just for HA; no amplification) | +| Replication factor + storage amplification | R=3-5 per replica; total cluster storage = N_shards × R (sharding COMPOUNDS replication amplification) | R=3 per replica; total cluster storage = R × dataset_size (same per-replica amplification; no sharding compound) | | Effective node count | 3N to 5N where N is shards needed for capacity | 3 (or more if geo distribution is required) | | Compaction | Yes (each node compacts SSTables on its own schedule; replication lag spikes during) | None (Lance is append-only; version log IS the truth) | | Cross-shard transactions | Hard; needs 2PC or careful avoidance | Not applicable (no shards) | @@ -37,6 +58,17 @@ the budgeting assumptions are all qualitatively different. | Read latency | Depends on coordinator + slowest shard | Local-replica latency; predictable | | Operator burden | High (capacity planning, compaction scheduling, rebalancing) | Low (HA discipline only; no shard management) | + +**Storage amplification honesty (per codex P2 review on PR #453):** +Both shapes have replication storage amplification — three replicas +of a 5GB dataset is 15GB of total cluster disk regardless of +distribution shape. The architectural advantage of availability-chosen +clustering is NOT that amplification disappears (it doesn't); it's +that amplification does NOT compound with shard count. Cassandra at +RF=3 with 10 capacity shards consumes ~30× the single-replica +storage; peer-Raft at R=3 consumes exactly 3×. Same per-replica +amplification, no shard multiplier on top. + ## Why lance-graph consumers fit on one node The encoding stack typical of lance-graph consumers compresses data @@ -110,16 +142,38 @@ peers (for eventually-consistent reads) or with a single Raft read-index round (for linearizable reads). No fan-out, no aggregation, no slowest-shard lag. -### 3. No compaction scheduling - -Cassandra clusters need their compaction schedule managed across nodes. -Compaction is CPU + IO heavy and creates replication lag spikes; if -too many nodes compact simultaneously, the cluster's effective -replication factor temporarily drops. - -Lance has no compaction. The version log IS the truth. Garbage collection -of old versions is local-only, doesn't interact with replication, doesn't -require cluster-wide scheduling. +### 3. Compaction is qualitatively different (lighter coordination) + +Cassandra-style LSM compaction rewrites SSTables to reclaim tombstones +and merge sorted runs. It is CPU + IO heavy and creates replication +lag spikes; if too many nodes compact simultaneously, the cluster's +effective replication factor temporarily drops. Operators schedule +compactions across nodes to avoid that. + +Lance has compaction too, but of a different shape: +`DatasetOptimizer.compact_files` merges small fragments into larger +ones for query layout optimization (many small appends produce many +small fragments which slow scans; periodic file compaction restores +good layout). It is NOT a tombstone-reclaim cycle — Lance is +append-only at the version level, so there are no tombstones to +reclaim in the LSM sense. + +The qualitative difference: + +- **LSM compaction** is a CORRECTNESS + SPACE concern (tombstones must + be reclaimed to bound storage; runs must merge for read performance); + coordination with replication is unavoidable. +- **Lance file compaction** is a LAYOUT OPTIMIZATION concern (queries + get faster when fragments are larger; correctness is unaffected); + it can be scheduled independently per node, produces new fragments + that are themselves append-only, and does NOT block or interact + with consensus replication. + +So Lance compaction exists and operators should plan for it (Lance's +table-maintenance docs describe `DatasetOptimizer.compact_files`). +The operational burden is lower than Cassandra LSM compaction because +the coordination requirements are weaker — but it is not zero. (Per +codex P2 review on PR #453.) ### 4. Anti-entropy is cheap @@ -204,11 +258,19 @@ partitioning; each underlying cluster remains availability-chosen. ## Recommended deployment pattern (reference) +> **Reminder (per codex P1):** This pattern is the bardioc B1 +> substrate-b reference architecture, NOT a built-in lance-graph +> feature. Adopters provide the Raft layer themselves (openraft / +> surreal-cluster / external TiKV). Lance-graph contributes the +> columnar storage + DataFusion + encoding crates that MAKE this +> pattern cheap, not the pattern itself. + See [reference consumer implementation in bardioc B1 substrate-b] (separate proposed doc) for a worked example. Briefly: - Three substrate-b instances, one per availability zone within a region -- Each substrate-b is one Rust binary AND one Raft node +- Each substrate-b is one Rust binary AND one Raft node (Raft impl + from openraft or surreal-cluster — NOT inherited from lance-graph) - Lance dataset local to each instance (full dataset, not a shard) - Reads serve from local Lance with no cross-node coordination (eventually-consistent) or from a Raft read-index round (linearizable) From 8d8ad005e7508c89ae4bb6d4c3456aeae9f10763 Mon Sep 17 00:00:00 2001 From: "jan (via bardioc)" Date: Wed, 3 Jun 2026 11:37:18 +0000 Subject: [PATCH 3/3] fix(codex-review followup): also fix compaction overclaim in comparison table The previous codex-review fix updated section 3 of operational consequences but missed the parallel overclaim in the side-by-side Capacity-forced vs Availability-chosen table row. Now both occurrences distinguish LSM compaction from Lance file compaction consistently. --- docs/CLUSTER_ASYMMETRY.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/CLUSTER_ASYMMETRY.md b/docs/CLUSTER_ASYMMETRY.md index d36eb450..47483f0f 100644 --- a/docs/CLUSTER_ASYMMETRY.md +++ b/docs/CLUSTER_ASYMMETRY.md @@ -52,7 +52,7 @@ the budgeting assumptions are all qualitatively different. | Hot-path reads | Coordinator pattern; fan-out to shard owners | Local; no cross-node hop | | Replication factor + storage amplification | R=3-5 per replica; total cluster storage = N_shards × R (sharding COMPOUNDS replication amplification) | R=3 per replica; total cluster storage = R × dataset_size (same per-replica amplification; no sharding compound) | | Effective node count | 3N to 5N where N is shards needed for capacity | 3 (or more if geo distribution is required) | -| Compaction | Yes (each node compacts SSTables on its own schedule; replication lag spikes during) | None (Lance is append-only; version log IS the truth) | +| Compaction shape | LSM SSTable: tombstone-reclaim + run-merge; coordinates with replication; lag spikes during | Lance file (`DatasetOptimizer.compact_files`): merge small fragments for layout; independent of consensus; no replication-lag interaction | | Cross-shard transactions | Hard; needs 2PC or careful avoidance | Not applicable (no shards) | | Anti-entropy | Merkle-tree comparison; expensive | Manifest hash compare; O(1) decision | | Read latency | Depends on coordinator + slowest shard | Local-replica latency; predictable |