docs: cluster asymmetry — capacity-forced vs availability-chosen clustering#453
Conversation
Names why lance-graph consumers cluster qualitatively differently than Cassandra-era stacks: data fits per-node (compression cascade + vort/vart trie); peer-Raft replicates the FULL dataset to each node; three-node clusters are production, not toy; no compaction; no rebalancing; no cross-node fan-out for reads. Pairs with the append-only-raft-dovetail PR.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughThis PR adds ChangesCluster Design Documentation
Estimated code review effort: Possibly related PRs:
Poem:
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Review ran into problems🔥 ProblemsGit: Failed to clone repository. Please run the Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ef93e287a7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| - Writes serve via Raft quorum to the local leader; replicated as Lance | ||
| fragment appends |
There was a problem hiding this comment.
Do not document an unimplemented Raft deployment
For users of this repository, this replication guarantee is not implementable: the current Lance integration only resolves and opens datasets in crates/lance-graph/src/query.rs:779-852, and a repo-wide search finds no Raft dependency, substrate-b binary, voter configuration, or write-replication path. Presenting quorum-replicated writes as the recommended lance-graph deployment can cause adopters to assume their data is highly available when lance-graph itself provides no such capability; mark this as an external/proposed bardioc architecture rather than a supported deployment pattern.
Useful? React with 👍 / 👎.
| too many nodes compact simultaneously, the cluster's effective | ||
| replication factor temporarily drops. | ||
|
|
||
| Lance has no compaction. The version log IS the truth. Garbage collection |
There was a problem hiding this comment.
Do not tell operators that Lance has no compaction
For append-heavy deployments—especially the fragment appends recommended later in this document—this is false. Lance provides DatasetOptimizer.compact_files because many small appends create small fragments and a poor layout that slows queries; the official table-maintenance documentation explicitly describes this operation. Telling operators that compaction does not exist can leave production datasets unoptimized, so distinguish Lance file compaction from LSM-tree compaction instead of saying there is none.
Useful? React with 👍 / 👎.
| | Reason to cluster | Data does not fit on one node | Data fits on one node; replicate for HA + geo | | ||
| | Each node holds | 1/N of the data (consistent-hash shards) | 100% of the data | | ||
| | Hot-path reads | Coordinator pattern; fan-out to shard owners | Local; no cross-node hop | | ||
| | Replication factor | 3-5 (storage AMPLIFICATION on top of sharding) | 3 (just for HA; no amplification) | |
There was a problem hiding this comment.
Account for full-replica storage amplification
When each of three nodes holds 100% of the dataset, the cluster consumes roughly three times the single-dataset storage, so describing replication factor 3 as having “no amplification” gives readers an incorrect disk-capacity model. The availability-chosen design may avoid additional shard-management overhead, but it still has replication storage amplification and should be budgeted as such.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
docs/CLUSTER_ASYMMETRY.md (2)
130-133: ⚡ Quick winClarify the O(1) anti-entropy claim.
The claim that anti-entropy identification is "O(1)" depends on the manifest hash comparison being constant-time. This is accurate if the manifest itself is a bounded-size data structure (which it should be), but it may be worth clarifying that the O(1) refers to the identification decision, not necessarily the streaming cost of catching up (which is mentioned separately as "bounded by the actual divergence").
Consider adding a brief clarification that O(1) refers specifically to the "is-in-sync?" decision, independent of dataset size.
📝 Proposed clarification
Lance peer-Raft anti-entropy: compare the manifest hash between nodes. If equal, sync. If not, ship missing fragments + the new manifest. The -IDENTIFICATION step is O(1). The streaming step is bounded by the actual -divergence, not by the dataset size. +IDENTIFICATION step is O(1) — a single hash comparison, independent of +dataset size. The streaming step is bounded by the actual divergence, +not by the total dataset size.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/CLUSTER_ASYMMETRY.md` around lines 130 - 133, Clarify that the "O(1)" remark refers specifically to the IDENTIFICATION step (the constant-time manifest hash comparison) rather than the overall anti-entropy process; update the paragraph mentioning "manifest hash" and "IDENTIFICATION" to state explicitly that O(1) describes the is-in-sync decision cost assuming a bounded-size manifest, and that streaming/catch-up costs remain proportional to the actual divergence (not the dataset size).
193-197: ⚡ Quick winConsider strengthening the caveat about varying compression ratios.
This non-claim appropriately acknowledges that "specific numbers vary" across different consumers. However, the earlier sections make very specific claims about byte sizes and compression ratios that appear universal. Consider whether the doc should more explicitly caveat the specific numbers throughout (e.g., "in the bardioc B1 reference implementation" or "typical values") or strengthen this non-claim to be a more prominent disclaimer earlier in the document.
This is consistent with the PR objectives question about "whether to keep specific numerical compression claims or soften them pending benchmarks."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/CLUSTER_ASYMMETRY.md` around lines 193 - 197, Strengthen the caveat about compression variability by updating the paragraph that currently starts "All Lance + Raft stacks ship the same compression" to explicitly flag that the byte-size and compression-ratio examples are from the bardioc B1 reference implementation and may not be representative; reword to something like "In the bardioc B1 reference implementation (typical values)" or prepend a prominent disclaimer earlier in the doc (intro/summary) noting that specific numbers are implementation-dependent and subject to benchmark variation—ensure the existing mention of "bardioc B1 reference consumer" remains and is made more prominent so readers understand the numbers are not universal.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@docs/CLUSTER_ASYMMETRY.md`:
- Around line 130-133: Clarify that the "O(1)" remark refers specifically to the
IDENTIFICATION step (the constant-time manifest hash comparison) rather than the
overall anti-entropy process; update the paragraph mentioning "manifest hash"
and "IDENTIFICATION" to state explicitly that O(1) describes the is-in-sync
decision cost assuming a bounded-size manifest, and that streaming/catch-up
costs remain proportional to the actual divergence (not the dataset size).
- Around line 193-197: Strengthen the caveat about compression variability by
updating the paragraph that currently starts "All Lance + Raft stacks ship the
same compression" to explicitly flag that the byte-size and compression-ratio
examples are from the bardioc B1 reference implementation and may not be
representative; reword to something like "In the bardioc B1 reference
implementation (typical values)" or prepend a prominent disclaimer earlier in
the doc (intro/summary) noting that specific numbers are
implementation-dependent and subject to benchmark variation—ensure the existing
mention of "bardioc B1 reference consumer" remains and is made more prominent so
readers understand the numbers are not universal.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: 5c87ec78-b09b-40c7-b6af-d9e64a922253
📒 Files selected for processing (1)
docs/CLUSTER_ASYMMETRY.md
…honesty Three codex findings on PR #453: P1: scope caveat - peer-Raft + Lance-local is an EXTERNAL architecture pattern (bardioc B1 substrate-b), NOT a built-in lance-graph feature. Adopters provide the Raft layer themselves (openraft / surreal-cluster / external TiKV); lance-graph provides the columnar storage + DataFusion + encoding crates that MAKE the pattern cheap, not the pattern itself. Added a scope banner after the TL;DR and a per-line reminder on the Recommended deployment pattern section. P2 compaction: Lance has compaction TOO (DatasetOptimizer.compact_files for fragment layout optimization), just qualitatively different from LSM tombstone-reclaim+run-merge. Rewrote section 3 to distinguish LSM compaction from Lance file compaction; operators should plan for the latter. P2 replication amplification: 3 replicas of a 5GB dataset is 15GB of total disk regardless of distribution shape. Replication amplification does NOT disappear in availability-chosen clustering; it does not compound with shard count (the architectural advantage). Fixed the comparison table row and added an honesty paragraph.
…on table The previous codex-review fix updated section 3 of operational consequences but missed the parallel overclaim in the side-by-side Capacity-forced vs Availability-chosen table row. Now both occurrences distinguish LSM compaction from Lance file compaction consistently.
Summary
Adopters of lance-graph routinely import the Cassandra/JanusGraph operations playbook when planning a replicated deployment. They expect to need a large cluster (5-10+ nodes), to spread the data via consistent hashing, to schedule compactions, to budget for cross-node coordinator queries, to monitor anti-entropy. They overprovision and over-operate as a result.
This doc proposes capturing the architectural asymmetry that makes lance-graph cluster operations qualitatively different: OLD stacks (Cassandra+JG, ElasticSearch, etc.) cluster because they have to — the data does not fit on one node, so they shard. Lance-graph consumers cluster because they choose to — the data does fit on one node, and clustering serves availability, geography, and load distribution.
Same word ("cluster"), opposite cost structure. Naming the asymmetry upstream prevents costly mis-deployments.
Doc location
This PR places the doc at
docs/CLUSTER_ASYMMETRY.mdto match the existing ALL_CAPS convention indocs/. Happy to relocate to.claude/knowledge/ordocs/architecture/<lowercase-kebab>.mdper maintainer preference — see Asks below.Provenance
AdaWorldAPI/bardiocPR docs: add hot/cold path architecture and documentation drift audit #15 conversation thread, 2026-06-03. User insight: "clustering in Cassandra / Janusgraph is most probably due to poor performance and data; if you compare that we hold Wikidata locally and Cassadra Janusgraph would need multiple instances".lance-graph-05-lance-append-raft-dovetail.md(same conversation thread) captures the storage-vs-consensus dovetail property; both docs target the same audience and are best read together.ROADMAP_RUST_PRIMARY_HEADSTONE.md§ "Why OLD stack clusters vs why NEW stack clusters (cluster ≠ cluster)" — the in-bardioc canonical statement of this asymmetry.lance-graph/crates/{bgz17, highheelbgz, bgz-tensor/src/hhtl_d.rs}— the encoding cascade that the doc cites. The numerical compression claims are derived from the existing per-row footprints documented in those crates' moduledocs.AdaWorldAPI/Lance-graphPR docs+probe: the agnostic lazy world-spine — addressing vision, locality probe (PASS), markov_soa→AriGraph, EW64-as-AriGraph #444 — the 98.6% intra-family locality finding used in the HHTL-banked CAM-PQ paragraph.AdaWorldAPI/bardioc.claude/TECH_DEBT.mdcollapse Module 6: #[track_caller] error macros for zero-cost location capture #2, where the radix-trie was first established as the time-axis above HHTL identity space.Asks
lance-graph-05-lance-append-raft-dovetail.md. Both ship independently; either can be accepted without the other.docs/architecture/cluster-asymmetry.md(project-public) or.claude/knowledge/cluster-asymmetry.md(Claude-Code knowledge layer) — same recommendation as the sister doc (project-public so non-Claude-Code adopters see it).append-only-raft-dovetail.md(companion PR) before this one because the consensus-tax discussion here references that doc. Or land both in the same review pass.This PR is doc-only. Zero code changes, zero tests, no API impact. Review focus is on (a) whether the architectural property described is welcome content for
Lance-graph, (b) the doc location, and (c) the honest-scope sections.Summary by CodeRabbit