forked from lance-format/lance-graph
-
Notifications
You must be signed in to change notification settings - Fork 0
docs: cluster asymmetry — capacity-forced vs availability-chosen clustering #453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
ef93e28
docs: cluster asymmetry — capacity-forced vs availability-chosen
AdaWorldAPI 0007c83
fix(codex-review): scope caveat + compaction honesty + amplification …
AdaWorldAPI 8d8ad00
fix(codex-review followup): also fix compaction overclaim in comparis…
AdaWorldAPI File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,284 @@ | ||
| # Cluster ≠ cluster: how lance-graph clusters differ from Cassandra-era clusters | ||
|
|
||
| ## TL;DR | ||
|
|
||
| There are two distinct reasons to cluster a distributed system: | ||
|
|
||
| 1. **Capacity-forced clustering**: the data does not fit on one node. | ||
| You shard. Cross-node queries are required. (Cassandra, JanusGraph, | ||
| classic Elasticsearch deployments.) | ||
|
|
||
| 2. **Availability-chosen clustering**: the data fits on one node. | ||
| You replicate it across N nodes for HA + geo + load distribution. | ||
| Each node holds the full dataset. Cross-node queries are NOT | ||
| required for the hot path. (CockroachDB, etcd, peer-Raft Lance.) | ||
|
|
||
| Lance-graph consumers are virtually never capacity-forced (the | ||
| encoding cascade + columnar layout + radix-trie deduplication | ||
| collapse storage by 1-3 orders of magnitude vs LSM-tree wide-column | ||
| stores). Clustering is always availability-chosen. | ||
|
|
||
| Don't import the Cassandra cluster operations playbook into a | ||
| lance-graph deployment. The failure modes, the operational rhythms, | ||
| the budgeting assumptions are all qualitatively different. | ||
|
|
||
| > ### Scope: external architecture pattern, not a built-in lance-graph feature | ||
| > | ||
| > Per codex P1 review on PR #453: the deployment shapes described | ||
| > below — particularly "peer-Raft + Lance-local-per-node" — describe | ||
| > an EXTERNAL ARCHITECTURE PATTERN that adopters can build on top of | ||
| > lance-graph, NOT a built-in lance-graph capability. Lance-graph | ||
| > itself provides the columnar storage, the DataFusion query path, | ||
| > the encoding crates, and the Rust API surface. The Raft layer, the | ||
| > substrate binary, and the consensus-replication path are | ||
| > downstream consumer code. Adopters implementing this pattern reach | ||
| > for `openraft` (or `surreal-cluster` if their stack is | ||
| > surrealdb-shaped) to provide the Raft layer; they would NOT | ||
| > inherit one from lance-graph. | ||
| > | ||
| > The doc documents WHY this pattern works well WHEN built on | ||
| > lance-graph (the append-only storage model, cheap anti-entropy, | ||
| > per-node fit), not a feature lance-graph itself ships. Adopters | ||
| > who only consume lance-graph's columnar + DataFusion path should | ||
| > NOT assume their data is automatically replicated. | ||
|
|
||
|
|
||
| ## The two cluster shapes, side by side | ||
|
|
||
| | Concern | Capacity-forced (Cassandra+JG) | Availability-chosen (peer-Raft Lance) | | ||
| |---|---|---| | ||
| | Reason to cluster | Data does not fit on one node | Data fits on one node; replicate for HA + geo | | ||
| | Each node holds | 1/N of the data (consistent-hash shards) | 100% of the data | | ||
| | Hot-path reads | Coordinator pattern; fan-out to shard owners | Local; no cross-node hop | | ||
| | Replication factor + storage amplification | R=3-5 per replica; total cluster storage = N_shards × R (sharding COMPOUNDS replication amplification) | R=3 per replica; total cluster storage = R × dataset_size (same per-replica amplification; no sharding compound) | | ||
| | Effective node count | 3N to 5N where N is shards needed for capacity | 3 (or more if geo distribution is required) | | ||
| | Compaction shape | LSM SSTable: tombstone-reclaim + run-merge; coordinates with replication; lag spikes during | Lance file (`DatasetOptimizer.compact_files`): merge small fragments for layout; independent of consensus; no replication-lag interaction | | ||
| | Cross-shard transactions | Hard; needs 2PC or careful avoidance | Not applicable (no shards) | | ||
| | Anti-entropy | Merkle-tree comparison; expensive | Manifest hash compare; O(1) decision | | ||
| | Read latency | Depends on coordinator + slowest shard | Local-replica latency; predictable | | ||
| | Operator burden | High (capacity planning, compaction scheduling, rebalancing) | Low (HA discipline only; no shard management) | | ||
|
|
||
|
|
||
| **Storage amplification honesty (per codex P2 review on PR #453):** | ||
| Both shapes have replication storage amplification — three replicas | ||
| of a 5GB dataset is 15GB of total cluster disk regardless of | ||
| distribution shape. The architectural advantage of availability-chosen | ||
| clustering is NOT that amplification disappears (it doesn't); it's | ||
| that amplification does NOT compound with shard count. Cassandra at | ||
| RF=3 with 10 capacity shards consumes ~30× the single-replica | ||
| storage; peer-Raft at R=3 consumes exactly 3×. Same per-replica | ||
| amplification, no shard multiplier on top. | ||
|
|
||
| ## Why lance-graph consumers fit on one node | ||
|
|
||
| The encoding stack typical of lance-graph consumers compresses data | ||
| by 1-3 orders of magnitude vs LSM-tree wide-column representations: | ||
|
|
||
| - **highheelbgz SpiralAddress**: 3 integers (12 bytes or 6 bytes | ||
| for u16 variants) representing a φ-spiral walking address — NOT | ||
| a copy of the weight vector, an address into a deterministic | ||
| spiral. Per-row footprint approaches the information-theoretic | ||
| minimum for the addressing dimension. | ||
|
|
||
| - **bgz-hhtl-d Slot D / Slot V**: 4 bytes total per row — 2-byte | ||
| Slot D (HEEL basin 2 bits / HIP family 4 bits / TWIG centroid 8 | ||
| bits / polarity 1 bit / reserved 1 bit) + 2-byte Slot V (BF16 | ||
| residual magnitude). The hierarchical addressing is in the bit | ||
| layout; the residual captures the per-row delta from the centroid. | ||
|
|
||
| - **bgz17 palette256**: 256-archetype compose table for multi-hop | ||
| semantic relations. 8 bits per archetype reference; multi-hop | ||
| composition in O(1) via the compose table. | ||
|
|
||
| - **CAM-PQ leaf vectors**: 6 bytes per row for the leaf-exact-match | ||
| vector projection. HHTL-banked variant (16 family subcodebooks) | ||
| reduces ANN scan space by ~16× under empirical intra-family | ||
| locality (98.6% per the `lance-graph` PR #444 probe). | ||
|
|
||
| - **vort/vart adaptive radix trie**: structural deduplication of | ||
| shared HHTL prefixes. The heel + hip nibbles common across many | ||
| entities are stored ONCE per path segment, not N times. Adaptive | ||
| Radix Tree shape: O(k) lookup, prefix-sharing storage. The same | ||
| structure also serves as the time-axis index over Lance | ||
| `versions()` for cold-path queries. | ||
|
|
||
| Concrete example: Wikidata (~115M entities). In Cassandra+JG, the | ||
| indexed graph form is multi-TB with replication factor 3 → multi-TB | ||
| × 3 across the cluster. In lance-graph with the above encoding stack, | ||
| the same corpus compresses to low single-digit GB total (including | ||
| indexes and the Lance version log). Fits on a modest single node. | ||
| Each peer replica holds the FULL dataset. | ||
|
|
||
| This is not an academic claim; it's the deployment shape proven in | ||
| the `AdaWorldAPI/bardioc` B1 substrate-b reference implementation. | ||
|
|
||
| ## Knock-on consequences of availability-chosen clustering | ||
|
|
||
| ### 1. Three-node deployment is the starting recommendation, not the toy | ||
|
|
||
| For Cassandra-era thinkers, three-node clusters are toys. Production | ||
| Cassandra deployments often run 12-24 nodes. The reasoning: each node | ||
| holds 1/N of the data; you need many N to fit the corpus; replication | ||
| factor 3 multiplies again. | ||
|
|
||
| For peer-Raft Lance: three-node clusters are production. Each node | ||
| holds the full dataset. Three replicas give you majority quorum (one | ||
| failure tolerated). More replicas are added for geographic distribution | ||
| or read-load fanout, not for capacity. | ||
|
|
||
| Default starting deployment: 3 substrate-b instances, one per AZ in | ||
| the same region, peer-Raft replicated. | ||
|
|
||
| ### 2. No coordinator pattern; no fan-out lag | ||
|
|
||
| Cassandra reads route through a coordinator node which fans out to | ||
| the shard owners and aggregates the response. The query latency is | ||
| bounded below by the slowest shard's response time. Hot shards drag | ||
| the cluster. | ||
|
|
||
| Peer-Raft Lance reads are LOCAL. The client connects to any node; | ||
| the node has the full dataset; it returns the answer without contacting | ||
| peers (for eventually-consistent reads) or with a single Raft read-index | ||
| round (for linearizable reads). No fan-out, no aggregation, no | ||
| slowest-shard lag. | ||
|
|
||
| ### 3. Compaction is qualitatively different (lighter coordination) | ||
|
|
||
| Cassandra-style LSM compaction rewrites SSTables to reclaim tombstones | ||
| and merge sorted runs. It is CPU + IO heavy and creates replication | ||
| lag spikes; if too many nodes compact simultaneously, the cluster's | ||
| effective replication factor temporarily drops. Operators schedule | ||
| compactions across nodes to avoid that. | ||
|
|
||
| Lance has compaction too, but of a different shape: | ||
| `DatasetOptimizer.compact_files` merges small fragments into larger | ||
| ones for query layout optimization (many small appends produce many | ||
| small fragments which slow scans; periodic file compaction restores | ||
| good layout). It is NOT a tombstone-reclaim cycle — Lance is | ||
| append-only at the version level, so there are no tombstones to | ||
| reclaim in the LSM sense. | ||
|
|
||
| The qualitative difference: | ||
|
|
||
| - **LSM compaction** is a CORRECTNESS + SPACE concern (tombstones must | ||
| be reclaimed to bound storage; runs must merge for read performance); | ||
| coordination with replication is unavoidable. | ||
| - **Lance file compaction** is a LAYOUT OPTIMIZATION concern (queries | ||
| get faster when fragments are larger; correctness is unaffected); | ||
| it can be scheduled independently per node, produces new fragments | ||
| that are themselves append-only, and does NOT block or interact | ||
| with consensus replication. | ||
|
|
||
| So Lance compaction exists and operators should plan for it (Lance's | ||
| table-maintenance docs describe `DatasetOptimizer.compact_files`). | ||
| The operational burden is lower than Cassandra LSM compaction because | ||
| the coordination requirements are weaker — but it is not zero. (Per | ||
| codex P2 review on PR #453.) | ||
|
|
||
| ### 4. Anti-entropy is cheap | ||
|
|
||
| Cassandra anti-entropy (catching up a lagging replica) compares SSTable | ||
| Merkle trees node-by-node and streams the diffs. This is expensive and | ||
| creates load on both sides. | ||
|
|
||
| Lance peer-Raft anti-entropy: compare the manifest hash between nodes. | ||
| If equal, sync. If not, ship missing fragments + the new manifest. The | ||
| IDENTIFICATION step is O(1). The streaming step is bounded by the actual | ||
| divergence, not by the dataset size. | ||
|
|
||
| ### 5. Rebalancing is not a thing | ||
|
|
||
| Cassandra rebalancing (when adding or removing nodes) requires data | ||
| movement. Token ranges shift; data streams from old owners to new | ||
| owners; the cluster operates in a degraded state during the rebalance. | ||
|
|
||
| Peer-Raft Lance: adding a node = bring up a new substrate-b instance, | ||
| let it catch up via Raft, mark it a voter. No data movement needed | ||
| beyond the catch-up (and the catch-up is the same wire pattern as | ||
| the per-write replication — no special "rebalance" mode). Removing | ||
| a node: stop the substrate-b, mark it non-voter, retire it. No data | ||
| movement. | ||
|
|
||
| ## Knock-on consequences for the Raft consensus tax | ||
|
|
||
| Raft consensus IS on the per-request budget for writes and linearizable | ||
| reads. This is irreducible for the distributed-OLTP property. (See | ||
| the companion doc `append-only-raft-dovetail.md` for why this lands | ||
| lighter in Lance than in LSM-based systems.) | ||
|
|
||
| In an availability-chosen cluster, the consensus tax lands EVEN lighter | ||
| than in a capacity-forced cluster: | ||
|
|
||
| - **Smaller per-node datasets**: Raft logs are smaller; commits propagate | ||
| faster | ||
| - **Fewer replicas needed**: 3 substrate-b instances vs 12-24+ | ||
| Cassandra+JG nodes → less coordination overhead per write | ||
| - **No cross-node fan-out**: linearizable read-index can be served by | ||
| the local leader of the relevant Raft group; doesn't require remote | ||
| round-trip when the local instance IS the leader | ||
|
|
||
| ## When you actually DO need capacity-forced sharding | ||
|
|
||
| Rare for lance-graph consumers, but possible: | ||
|
|
||
| - Corpora dramatically larger than Wikidata (multi-billion entities or | ||
| multi-PB raw data) | ||
| - Workloads with hot-key distributions that exceed a single node's | ||
| IO bandwidth | ||
| - Specific compliance requirements that mandate physical data isolation | ||
| per tenant | ||
|
|
||
| In those cases, the appropriate response is application-level sharding | ||
| or tenant-level partitioning — NOT a Cassandra-style consistent-hash | ||
| ring. Each shard or tenant gets its own peer-Raft + Lance cluster. | ||
| The capacity dimension is solved by horizontal application-layer | ||
| partitioning; each underlying cluster remains availability-chosen. | ||
|
|
||
| ## What this doc does NOT claim | ||
|
|
||
| - **Single-node deployments are sufficient for production.** They're | ||
| not. HA requires multiple replicas (or accepted downtime during | ||
| failures). Single-node is a development / staging shape. | ||
|
|
||
| - **Three-node is universally optimal.** Geographic distribution | ||
| may require more (one replica per region). Specific availability | ||
| targets may justify more replicas. | ||
|
|
||
| - **All Lance + Raft stacks ship the same compression.** The specific | ||
| encoding cascade described above is from the `bardioc` B1 reference | ||
| consumer; other consumers will have different per-row footprints. | ||
| The qualitative property (orders-of-magnitude vs LSM wide-column) | ||
| remains, but specific numbers vary. | ||
|
|
||
| - **Cassandra+JG choose their shape wrongly.** The Cassandra design | ||
| is correct FOR THE STORAGE MODEL IT HAS. The architectural choice | ||
| that produces a different deployment shape is the choice to use | ||
| Lance + Raft, with append-only storage and columnar compression. | ||
| The doc names the consequence, not the wrongness of the alternative. | ||
|
|
||
| ## Recommended deployment pattern (reference) | ||
|
|
||
| > **Reminder (per codex P1):** This pattern is the bardioc B1 | ||
| > substrate-b reference architecture, NOT a built-in lance-graph | ||
| > feature. Adopters provide the Raft layer themselves (openraft / | ||
| > surreal-cluster / external TiKV). Lance-graph contributes the | ||
| > columnar storage + DataFusion + encoding crates that MAKE this | ||
| > pattern cheap, not the pattern itself. | ||
|
|
||
| See [reference consumer implementation in bardioc B1 substrate-b] | ||
| (separate proposed doc) for a worked example. Briefly: | ||
|
|
||
| - Three substrate-b instances, one per availability zone within a region | ||
| - Each substrate-b is one Rust binary AND one Raft node (Raft impl | ||
| from openraft or surreal-cluster — NOT inherited from lance-graph) | ||
| - Lance dataset local to each instance (full dataset, not a shard) | ||
| - Reads serve from local Lance with no cross-node coordination | ||
| (eventually-consistent) or from a Raft read-index round (linearizable) | ||
| - Writes serve via Raft quorum to the local leader; replicated as Lance | ||
| fragment appends | ||
| - Add a fourth+ instance only for geographic distribution or read-load | ||
| fanout, not for capacity | ||
|
|
||
| This is the shape proven against Wikidata-scale workloads in the | ||
| bardioc reference consumer. Operational complexity is significantly | ||
| lower than the Cassandra-era equivalent at the same availability target. | ||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For users of this repository, this replication guarantee is not implementable: the current Lance integration only resolves and opens datasets in
crates/lance-graph/src/query.rs:779-852, and a repo-wide search finds no Raft dependency, substrate-b binary, voter configuration, or write-replication path. Presenting quorum-replicated writes as the recommended lance-graph deployment can cause adopters to assume their data is highly available when lance-graph itself provides no such capability; mark this as an external/proposed bardioc architecture rather than a supported deployment pattern.Useful? React with 👍 / 👎.