From ef93e287a7fa313809e135d3476137027e53a3b9 Mon Sep 17 00:00:00 2001
From: "jan (via bardioc)" <jan@exo.red>
Date: Wed, 3 Jun 2026 11:27:33 +0000
Subject: [PATCH 1/3] =?UTF-8?q?docs:=20cluster=20asymmetry=20=E2=80=94=20c?=
 =?UTF-8?q?apacity-forced=20vs=20availability-chosen?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Names why lance-graph consumers cluster qualitatively differently than
Cassandra-era stacks: data fits per-node (compression cascade +
vort/vart trie); peer-Raft replicates the FULL dataset to each node;
three-node clusters are production, not toy; no compaction; no
rebalancing; no cross-node fan-out for reads. Pairs with the
append-only-raft-dovetail PR.
---
 docs/CLUSTER_ASYMMETRY.md | 222 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 222 insertions(+)
 create mode 100644 docs/CLUSTER_ASYMMETRY.md

diff --git a/docs/CLUSTER_ASYMMETRY.md b/docs/CLUSTER_ASYMMETRY.md
new file mode 100644
index 00000000..9b0fc81e
--- /dev/null
+++ b/docs/CLUSTER_ASYMMETRY.md
@@ -0,0 +1,222 @@
+# Cluster ≠ cluster: how lance-graph clusters differ from Cassandra-era clusters
+
+## TL;DR
+
+There are two distinct reasons to cluster a distributed system:
+
+1. **Capacity-forced clustering**: the data does not fit on one node.
+   You shard. Cross-node queries are required. (Cassandra, JanusGraph,
+   classic Elasticsearch deployments.)
+
+2. **Availability-chosen clustering**: the data fits on one node.
+   You replicate it across N nodes for HA + geo + load distribution.
+   Each node holds the full dataset. Cross-node queries are NOT
+   required for the hot path. (CockroachDB, etcd, peer-Raft Lance.)
+
+Lance-graph consumers are virtually never capacity-forced (the
+encoding cascade + columnar layout + radix-trie deduplication
+collapse storage by 1-3 orders of magnitude vs LSM-tree wide-column
+stores). Clustering is always availability-chosen.
+
+Don't import the Cassandra cluster operations playbook into a
+lance-graph deployment. The failure modes, the operational rhythms,
+the budgeting assumptions are all qualitatively different.
+
+## The two cluster shapes, side by side
+
+| Concern | Capacity-forced (Cassandra+JG) | Availability-chosen (peer-Raft Lance) |
+|---|---|---|
+| Reason to cluster | Data does not fit on one node | Data fits on one node; replicate for HA + geo |
+| Each node holds | 1/N of the data (consistent-hash shards) | 100% of the data |
+| Hot-path reads | Coordinator pattern; fan-out to shard owners | Local; no cross-node hop |
+| Replication factor | 3-5 (storage AMPLIFICATION on top of sharding) | 3 (just for HA; no amplification) |
+| Effective node count | 3N to 5N where N is shards needed for capacity | 3 (or more if geo distribution is required) |
+| Compaction | Yes (each node compacts SSTables on its own schedule; replication lag spikes during) | None (Lance is append-only; version log IS the truth) |
+| Cross-shard transactions | Hard; needs 2PC or careful avoidance | Not applicable (no shards) |
+| Anti-entropy | Merkle-tree comparison; expensive | Manifest hash compare; O(1) decision |
+| Read latency | Depends on coordinator + slowest shard | Local-replica latency; predictable |
+| Operator burden | High (capacity planning, compaction scheduling, rebalancing) | Low (HA discipline only; no shard management) |
+
+## Why lance-graph consumers fit on one node
+
+The encoding stack typical of lance-graph consumers compresses data
+by 1-3 orders of magnitude vs LSM-tree wide-column representations:
+
+- **highheelbgz SpiralAddress**: 3 integers (12 bytes or 6 bytes
+  for u16 variants) representing a φ-spiral walking address — NOT
+  a copy of the weight vector, an address into a deterministic
+  spiral. Per-row footprint approaches the information-theoretic
+  minimum for the addressing dimension.
+
+- **bgz-hhtl-d Slot D / Slot V**: 4 bytes total per row — 2-byte
+  Slot D (HEEL basin 2 bits / HIP family 4 bits / TWIG centroid 8
+  bits / polarity 1 bit / reserved 1 bit) + 2-byte Slot V (BF16
+  residual magnitude). The hierarchical addressing is in the bit
+  layout; the residual captures the per-row delta from the centroid.
+
+- **bgz17 palette256**: 256-archetype compose table for multi-hop
+  semantic relations. 8 bits per archetype reference; multi-hop
+  composition in O(1) via the compose table.
+
+- **CAM-PQ leaf vectors**: 6 bytes per row for the leaf-exact-match
+  vector projection. HHTL-banked variant (16 family subcodebooks)
+  reduces ANN scan space by ~16× under empirical intra-family
+  locality (98.6% per the `lance-graph` PR #444 probe).
+
+- **vort/vart adaptive radix trie**: structural deduplication of
+  shared HHTL prefixes. The heel + hip nibbles common across many
+  entities are stored ONCE per path segment, not N times. Adaptive
+  Radix Tree shape: O(k) lookup, prefix-sharing storage. The same
+  structure also serves as the time-axis index over Lance
+  `versions()` for cold-path queries.
+
+Concrete example: Wikidata (~115M entities). In Cassandra+JG, the
+indexed graph form is multi-TB with replication factor 3 → multi-TB
+× 3 across the cluster. In lance-graph with the above encoding stack,
+the same corpus compresses to low single-digit GB total (including
+indexes and the Lance version log). Fits on a modest single node.
+Each peer replica holds the FULL dataset.
+
+This is not an academic claim; it's the deployment shape proven in
+the `AdaWorldAPI/bardioc` B1 substrate-b reference implementation.
+
+## Knock-on consequences of availability-chosen clustering
+
+### 1. Three-node deployment is the starting recommendation, not the toy
+
+For Cassandra-era thinkers, three-node clusters are toys. Production
+Cassandra deployments often run 12-24 nodes. The reasoning: each node
+holds 1/N of the data; you need many N to fit the corpus; replication
+factor 3 multiplies again.
+
+For peer-Raft Lance: three-node clusters are production. Each node
+holds the full dataset. Three replicas give you majority quorum (one
+failure tolerated). More replicas are added for geographic distribution
+or read-load fanout, not for capacity.
+
+Default starting deployment: 3 substrate-b instances, one per AZ in
+the same region, peer-Raft replicated.
+
+### 2. No coordinator pattern; no fan-out lag
+
+Cassandra reads route through a coordinator node which fans out to
+the shard owners and aggregates the response. The query latency is
+bounded below by the slowest shard's response time. Hot shards drag
+the cluster.
+
+Peer-Raft Lance reads are LOCAL. The client connects to any node;
+the node has the full dataset; it returns the answer without contacting
+peers (for eventually-consistent reads) or with a single Raft read-index
+round (for linearizable reads). No fan-out, no aggregation, no
+slowest-shard lag.
+
+### 3. No compaction scheduling
+
+Cassandra clusters need their compaction schedule managed across nodes.
+Compaction is CPU + IO heavy and creates replication lag spikes; if
+too many nodes compact simultaneously, the cluster's effective
+replication factor temporarily drops.
+
+Lance has no compaction. The version log IS the truth. Garbage collection
+of old versions is local-only, doesn't interact with replication, doesn't
+require cluster-wide scheduling.
+
+### 4. Anti-entropy is cheap
+
+Cassandra anti-entropy (catching up a lagging replica) compares SSTable
+Merkle trees node-by-node and streams the diffs. This is expensive and
+creates load on both sides.
+
+Lance peer-Raft anti-entropy: compare the manifest hash between nodes.
+If equal, sync. If not, ship missing fragments + the new manifest. The
+IDENTIFICATION step is O(1). The streaming step is bounded by the actual
+divergence, not by the dataset size.
+
+### 5. Rebalancing is not a thing
+
+Cassandra rebalancing (when adding or removing nodes) requires data
+movement. Token ranges shift; data streams from old owners to new
+owners; the cluster operates in a degraded state during the rebalance.
+
+Peer-Raft Lance: adding a node = bring up a new substrate-b instance,
+let it catch up via Raft, mark it a voter. No data movement needed
+beyond the catch-up (and the catch-up is the same wire pattern as
+the per-write replication — no special "rebalance" mode). Removing
+a node: stop the substrate-b, mark it non-voter, retire it. No data
+movement.
+
+## Knock-on consequences for the Raft consensus tax
+
+Raft consensus IS on the per-request budget for writes and linearizable
+reads. This is irreducible for the distributed-OLTP property. (See
+the companion doc `append-only-raft-dovetail.md` for why this lands
+lighter in Lance than in LSM-based systems.)
+
+In an availability-chosen cluster, the consensus tax lands EVEN lighter
+than in a capacity-forced cluster:
+
+- **Smaller per-node datasets**: Raft logs are smaller; commits propagate
+  faster
+- **Fewer replicas needed**: 3 substrate-b instances vs 12-24+
+  Cassandra+JG nodes → less coordination overhead per write
+- **No cross-node fan-out**: linearizable read-index can be served by
+  the local leader of the relevant Raft group; doesn't require remote
+  round-trip when the local instance IS the leader
+
+## When you actually DO need capacity-forced sharding
+
+Rare for lance-graph consumers, but possible:
+
+- Corpora dramatically larger than Wikidata (multi-billion entities or
+  multi-PB raw data)
+- Workloads with hot-key distributions that exceed a single node's
+  IO bandwidth
+- Specific compliance requirements that mandate physical data isolation
+  per tenant
+
+In those cases, the appropriate response is application-level sharding
+or tenant-level partitioning — NOT a Cassandra-style consistent-hash
+ring. Each shard or tenant gets its own peer-Raft + Lance cluster.
+The capacity dimension is solved by horizontal application-layer
+partitioning; each underlying cluster remains availability-chosen.
+
+## What this doc does NOT claim
+
+- **Single-node deployments are sufficient for production.** They're
+  not. HA requires multiple replicas (or accepted downtime during
+  failures). Single-node is a development / staging shape.
+
+- **Three-node is universally optimal.** Geographic distribution
+  may require more (one replica per region). Specific availability
+  targets may justify more replicas.
+
+- **All Lance + Raft stacks ship the same compression.** The specific
+  encoding cascade described above is from the `bardioc` B1 reference
+  consumer; other consumers will have different per-row footprints.
+  The qualitative property (orders-of-magnitude vs LSM wide-column)
+  remains, but specific numbers vary.
+
+- **Cassandra+JG choose their shape wrongly.** The Cassandra design
+  is correct FOR THE STORAGE MODEL IT HAS. The architectural choice
+  that produces a different deployment shape is the choice to use
+  Lance + Raft, with append-only storage and columnar compression.
+  The doc names the consequence, not the wrongness of the alternative.
+
+## Recommended deployment pattern (reference)
+
+See [reference consumer implementation in bardioc B1 substrate-b]
+(separate proposed doc) for a worked example. Briefly:
+
+- Three substrate-b instances, one per availability zone within a region
+- Each substrate-b is one Rust binary AND one Raft node
+- Lance dataset local to each instance (full dataset, not a shard)
+- Reads serve from local Lance with no cross-node coordination
+  (eventually-consistent) or from a Raft read-index round (linearizable)
+- Writes serve via Raft quorum to the local leader; replicated as Lance
+  fragment appends
+- Add a fourth+ instance only for geographic distribution or read-load
+  fanout, not for capacity
+
+This is the shape proven against Wikidata-scale workloads in the
+bardioc reference consumer. Operational complexity is significantly
+lower than the Cassandra-era equivalent at the same availability target.
\ No newline at end of file

From 0007c83a52542d114268a5c6961c55d93e500175 Mon Sep 17 00:00:00 2001
From: "jan (via bardioc)" <jan@exo.red>
Date: Wed, 3 Jun 2026 11:35:45 +0000
Subject: [PATCH 2/3] fix(codex-review): scope caveat + compaction honesty +
 amplification honesty

Three codex findings on PR #453:

P1: scope caveat - peer-Raft + Lance-local is an EXTERNAL architecture
pattern (bardioc B1 substrate-b), NOT a built-in lance-graph feature.
Adopters provide the Raft layer themselves (openraft / surreal-cluster
/ external TiKV); lance-graph provides the columnar storage +
DataFusion + encoding crates that MAKE the pattern cheap, not the
pattern itself. Added a scope banner after the TL;DR and a per-line
reminder on the Recommended deployment pattern section.

P2 compaction: Lance has compaction TOO
(DatasetOptimizer.compact_files for fragment layout optimization),
just qualitatively different from LSM tombstone-reclaim+run-merge.
Rewrote section 3 to distinguish LSM compaction from Lance file
compaction; operators should plan for the latter.

P2 replication amplification: 3 replicas of a 5GB dataset is 15GB
of total disk regardless of distribution shape. Replication
amplification does NOT disappear in availability-chosen clustering;
it does not compound with shard count (the architectural advantage).
Fixed the comparison table row and added an honesty paragraph.
---
 docs/CLUSTER_ASYMMETRY.md | 86 +++++++++++++++++++++++++++++++++------
 1 file changed, 74 insertions(+), 12 deletions(-)

diff --git a/docs/CLUSTER_ASYMMETRY.md b/docs/CLUSTER_ASYMMETRY.md
index 9b0fc81e..d36eb450 100644
--- a/docs/CLUSTER_ASYMMETRY.md
+++ b/docs/CLUSTER_ASYMMETRY.md
@@ -22,6 +22,27 @@ Don't import the Cassandra cluster operations playbook into a
 lance-graph deployment. The failure modes, the operational rhythms,
 the budgeting assumptions are all qualitatively different.
 
+> ### Scope: external architecture pattern, not a built-in lance-graph feature
+>
+> Per codex P1 review on PR #453: the deployment shapes described
+> below — particularly "peer-Raft + Lance-local-per-node" — describe
+> an EXTERNAL ARCHITECTURE PATTERN that adopters can build on top of
+> lance-graph, NOT a built-in lance-graph capability. Lance-graph
+> itself provides the columnar storage, the DataFusion query path,
+> the encoding crates, and the Rust API surface. The Raft layer, the
+> substrate binary, and the consensus-replication path are
+> downstream consumer code. Adopters implementing this pattern reach
+> for `openraft` (or `surreal-cluster` if their stack is
+> surrealdb-shaped) to provide the Raft layer; they would NOT
+> inherit one from lance-graph.
+>
+> The doc documents WHY this pattern works well WHEN built on
+> lance-graph (the append-only storage model, cheap anti-entropy,
+> per-node fit), not a feature lance-graph itself ships. Adopters
+> who only consume lance-graph's columnar + DataFusion path should
+> NOT assume their data is automatically replicated.
+
+
 ## The two cluster shapes, side by side
 
 | Concern | Capacity-forced (Cassandra+JG) | Availability-chosen (peer-Raft Lance) |
@@ -29,7 +50,7 @@ the budgeting assumptions are all qualitatively different.
 | Reason to cluster | Data does not fit on one node | Data fits on one node; replicate for HA + geo |
 | Each node holds | 1/N of the data (consistent-hash shards) | 100% of the data |
 | Hot-path reads | Coordinator pattern; fan-out to shard owners | Local; no cross-node hop |
-| Replication factor | 3-5 (storage AMPLIFICATION on top of sharding) | 3 (just for HA; no amplification) |
+| Replication factor + storage amplification | R=3-5 per replica; total cluster storage = N_shards × R (sharding COMPOUNDS replication amplification) | R=3 per replica; total cluster storage = R × dataset_size (same per-replica amplification; no sharding compound) |
 | Effective node count | 3N to 5N where N is shards needed for capacity | 3 (or more if geo distribution is required) |
 | Compaction | Yes (each node compacts SSTables on its own schedule; replication lag spikes during) | None (Lance is append-only; version log IS the truth) |
 | Cross-shard transactions | Hard; needs 2PC or careful avoidance | Not applicable (no shards) |
@@ -37,6 +58,17 @@ the budgeting assumptions are all qualitatively different.
 | Read latency | Depends on coordinator + slowest shard | Local-replica latency; predictable |
 | Operator burden | High (capacity planning, compaction scheduling, rebalancing) | Low (HA discipline only; no shard management) |
 
+
+**Storage amplification honesty (per codex P2 review on PR #453):**
+Both shapes have replication storage amplification — three replicas
+of a 5GB dataset is 15GB of total cluster disk regardless of
+distribution shape. The architectural advantage of availability-chosen
+clustering is NOT that amplification disappears (it doesn't); it's
+that amplification does NOT compound with shard count. Cassandra at
+RF=3 with 10 capacity shards consumes ~30× the single-replica
+storage; peer-Raft at R=3 consumes exactly 3×. Same per-replica
+amplification, no shard multiplier on top.
+
 ## Why lance-graph consumers fit on one node
 
 The encoding stack typical of lance-graph consumers compresses data
@@ -110,16 +142,38 @@ peers (for eventually-consistent reads) or with a single Raft read-index
 round (for linearizable reads). No fan-out, no aggregation, no
 slowest-shard lag.
 
-### 3. No compaction scheduling
-
-Cassandra clusters need their compaction schedule managed across nodes.
-Compaction is CPU + IO heavy and creates replication lag spikes; if
-too many nodes compact simultaneously, the cluster's effective
-replication factor temporarily drops.
-
-Lance has no compaction. The version log IS the truth. Garbage collection
-of old versions is local-only, doesn't interact with replication, doesn't
-require cluster-wide scheduling.
+### 3. Compaction is qualitatively different (lighter coordination)
+
+Cassandra-style LSM compaction rewrites SSTables to reclaim tombstones
+and merge sorted runs. It is CPU + IO heavy and creates replication
+lag spikes; if too many nodes compact simultaneously, the cluster's
+effective replication factor temporarily drops. Operators schedule
+compactions across nodes to avoid that.
+
+Lance has compaction too, but of a different shape:
+`DatasetOptimizer.compact_files` merges small fragments into larger
+ones for query layout optimization (many small appends produce many
+small fragments which slow scans; periodic file compaction restores
+good layout). It is NOT a tombstone-reclaim cycle — Lance is
+append-only at the version level, so there are no tombstones to
+reclaim in the LSM sense.
+
+The qualitative difference:
+
+- **LSM compaction** is a CORRECTNESS + SPACE concern (tombstones must
+  be reclaimed to bound storage; runs must merge for read performance);
+  coordination with replication is unavoidable.
+- **Lance file compaction** is a LAYOUT OPTIMIZATION concern (queries
+  get faster when fragments are larger; correctness is unaffected);
+  it can be scheduled independently per node, produces new fragments
+  that are themselves append-only, and does NOT block or interact
+  with consensus replication.
+
+So Lance compaction exists and operators should plan for it (Lance's
+table-maintenance docs describe `DatasetOptimizer.compact_files`).
+The operational burden is lower than Cassandra LSM compaction because
+the coordination requirements are weaker — but it is not zero. (Per
+codex P2 review on PR #453.)
 
 ### 4. Anti-entropy is cheap
 
@@ -204,11 +258,19 @@ partitioning; each underlying cluster remains availability-chosen.
 
 ## Recommended deployment pattern (reference)
 
+> **Reminder (per codex P1):** This pattern is the bardioc B1
+> substrate-b reference architecture, NOT a built-in lance-graph
+> feature. Adopters provide the Raft layer themselves (openraft /
+> surreal-cluster / external TiKV). Lance-graph contributes the
+> columnar storage + DataFusion + encoding crates that MAKE this
+> pattern cheap, not the pattern itself.
+
 See [reference consumer implementation in bardioc B1 substrate-b]
 (separate proposed doc) for a worked example. Briefly:
 
 - Three substrate-b instances, one per availability zone within a region
-- Each substrate-b is one Rust binary AND one Raft node
+- Each substrate-b is one Rust binary AND one Raft node (Raft impl
+  from openraft or surreal-cluster — NOT inherited from lance-graph)
 - Lance dataset local to each instance (full dataset, not a shard)
 - Reads serve from local Lance with no cross-node coordination
   (eventually-consistent) or from a Raft read-index round (linearizable)

From 8d8ad005e7508c89ae4bb6d4c3456aeae9f10763 Mon Sep 17 00:00:00 2001
From: "jan (via bardioc)" <jan@exo.red>
Date: Wed, 3 Jun 2026 11:37:18 +0000
Subject: [PATCH 3/3] fix(codex-review followup): also fix compaction overclaim
 in comparison table

The previous codex-review fix updated section 3 of operational
consequences but missed the parallel overclaim in the side-by-side
Capacity-forced vs Availability-chosen table row. Now both
occurrences distinguish LSM compaction from Lance file compaction
consistently.
---
 docs/CLUSTER_ASYMMETRY.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/CLUSTER_ASYMMETRY.md b/docs/CLUSTER_ASYMMETRY.md
index d36eb450..47483f0f 100644
--- a/docs/CLUSTER_ASYMMETRY.md
+++ b/docs/CLUSTER_ASYMMETRY.md
@@ -52,7 +52,7 @@ the budgeting assumptions are all qualitatively different.
 | Hot-path reads | Coordinator pattern; fan-out to shard owners | Local; no cross-node hop |
 | Replication factor + storage amplification | R=3-5 per replica; total cluster storage = N_shards × R (sharding COMPOUNDS replication amplification) | R=3 per replica; total cluster storage = R × dataset_size (same per-replica amplification; no sharding compound) |
 | Effective node count | 3N to 5N where N is shards needed for capacity | 3 (or more if geo distribution is required) |
-| Compaction | Yes (each node compacts SSTables on its own schedule; replication lag spikes during) | None (Lance is append-only; version log IS the truth) |
+| Compaction shape | LSM SSTable: tombstone-reclaim + run-merge; coordinates with replication; lag spikes during | Lance file (`DatasetOptimizer.compact_files`): merge small fragments for layout; independent of consensus; no replication-lag interaction |
 | Cross-shard transactions | Hard; needs 2PC or careful avoidance | Not applicable (no shards) |
 | Anti-entropy | Merkle-tree comparison; expensive | Manifest hash compare; O(1) decision |
 | Read latency | Depends on coordinator + slowest shard | Local-replica latency; predictable |