diff --git a/.claude/board/AGENT_LOG.md b/.claude/board/AGENT_LOG.md index 77002a15..bd12da88 100644 --- a/.claude/board/AGENT_LOG.md +++ b/.claude/board/AGENT_LOG.md @@ -1,3 +1,19 @@ +## [Main thread / Opus + W1/W2 wave] world-spine vision + probe wave + markov_soa SoC + EW64-as-AriGraph + +**Branch:** claude/jolly-cori-clnf9-worldspine (local, 21 commits ahead of origin/main) | **Spans:** the agnostic-lazy-world-spine + delta-card integration map vision docs; the W1+W2 autoattended wave; the markov_soa SoC re-home; the EW64-as-AriGraph note; the locality probe RUN. + +**Cargo:** locality probe RUN on real ontologies → **PASS** (locality 98.6%, max fan-out 3 ≤16, Q=0.325); jc 60/60 tests green, probe clippy-clean (pre-existing jc lints elsewhere untouched); deepnsm 89/4/8/1 green after markov_soa removal; contract soa_view 3/3 green. AriGraph `markov_soa` = **unverified-offline** (lance-graph core's lance/datafusion/arrow don't fetch in the sandbox). + +**Outcome (autoattended, auto-resolved):** +- **Vision docs** (knowledge/): `agnostic-lazy-world-spine.md` + `delta-card-addressing-integration-map.md` + `owl-dolce-hhtl-compartments-aerial-fed.md` + `splat-codebook-aerial-wikidata-compression.md` — the converged "inherited nothingness" addressing design (partition-as-address, 27-bit floor, sparse radix, I/P/B-over-Lance, RISC compose-not-materialize, frozen-ISA). +- **W1 (Plan wave worker):** `.claude/plans/wikidata-lazy-spine-hydration-v1.md` (9 D-LWS D-ids); flagged R1 (EW64 not a code symbol), R2 (Lance versioning is dataset-level VersionedGraph not fragment), R3 (CLAM is a probe not a clusterer) — all reconciled in the findings. +- **W2 (probe wave worker):** `jc/examples/ontology_locality_probe.rs` (941 LOC, hand-rolled TTL scan, reuses splat_louvain machinery) — harvested + RUN: **the addressing-locality CONJECTURE → FINDING on real ontologies** (DOLCE-Ultralite/schema.org/Odoo/PROV-O/QUDT/OWL-Time; ~10³ classes, NOT Wikidata). +- **markov_soa SoC arc:** authored in deepnsm (e0a5049), then **moved to AriGraph** (`lance-graph::graph::arigraph::markov_soa`, 9a5f54c) + made **vocabulary-agnostic** (opaque `SpoRanks{u16}`, injected `Fn(u16,u16)->u8` = AriGraph's own cam_pq) + corrected framing (cc24f02: markov_soa IS AriGraph cold→hot; language/COCA stays UPSTREAM in deepnsm, never reaches the hot graph — the GoBD-with-Rumi error). deepnsm copy deleted. +- **EW64 note** (679e61e): `MailboxSoaView` doc — EpisodicWitness64 = AriGraph in the mailbox SoA view (the particle, cold→hot); deferred accessor, EW64 still 0 code symbols. +- **3 governing findings** on the board: the three-Markovs taxonomy (#1 chain / #2 hybrid-dark-horse / #3 pray) + P1→P2→P3 ordering; the VSA substrate decision (32k SPO-W = substrate, VSA = fuzzy proposer/priming); the EW64 reactive-seam (Lance-update=witness-pointer=Surreal-kanban-subscription). NOT pushed — awaiting push/PR decision (autoattended consolidation done). + +--- + ## [Main thread / Opus] D-ARM-14 Phase 2 — rebased onto post-#442 main + swapped inline nibble → real contract::hhtl::NiblePath **Branch:** claude/jolly-cori-clnf9-darm14-p2 (rebased onto main 415971a, #442 merged) | **Files:** `tests/wikidata_landing.rs` (inline `np_*` helpers + inline FieldMask union → real `NiblePath::{root,child,basin,is_ancestor_of,depth,packed}` + `FieldMask::inherit`), STATUS_BOARD (D-ARM-14 row: swap done). diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md index 0932055e..75ba4f3f 100644 --- a/.claude/board/EPIPHANIES.md +++ b/.claude/board/EPIPHANIES.md @@ -1,3 +1,111 @@ +## 2026-05-31 — FINDING (PROBE RESULT, measured): ontology partition-locality SURVIVES on real ontologies — locality 98.6%, max fan-out 3 (<=16), Q=0.325 ⇒ 16-bit local refs + <=16 family frontier are REAL (on real data, NOT yet Wikidata) + +**Status:** FINDING (measured, not asserted). Probe `crates/jc/examples/ontology_locality_probe.rs` run on the on-disk ontologies (DOLCE-Ultralite, schema.org, Odoo, PROV-O, QUDT, OWL-Time) — the falsifier for the delta-card/inherited-nothingness addressing claim (probe #1 of `delta-card-addressing-integration-map.md`). PASS. + +**Measured numbers (1170 classes, 1224 subClassOf edges, 33 top-basins):** +- **LOCALITY = 98.61%** (1207/1224 edges intra-basin) — the map's "~90% local" claim survives and EXCEEDS it. +- **FAN-OUT max = 3** (≤16 ✓); histogram: 1121 classes have exactly 1 parent-basin, 15 have 2, 1 has 3, 33 are roots. ⇒ no class needs more than 3 distinct family pointers; the ≤16 frontier has huge headroom. +- **MODULARITY Q = 0.3246** (>0.3 = clear community structure; Newman modularity of the basin partition). + +**What it proves / doesn't:** on REAL frozen-ISA ontology structure, the 16-bit LOCAL reference + the ≤16 family-cohort frontier are real — most subClassOf references stay inside one top-basin, partition locality is genuine. HONEST CAVEAT (in the probe's own verdict): measured on real ontologies (~10³ classes), **NOT Wikidata** (~10⁸); same KIND of structure, smaller scale. The Wikidata P279 run remains the open probe (gated on a real dump, not on disk). Promotes the addressing-locality CONJECTURE to FINDING *on real ontologies*; the Wikidata-scale claim stays CONJECTURE. + +**Falsifies a worry:** had locality been low or fan-out high, the cheap-local-reference + inherited-nothingness scheme would degrade to mostly-far pointers (the scheme's main risk, flagged in the #442 review + the integration map). It didn't — the partition is real. Cross-ref: `delta-card-addressing-integration-map.md` (probe #1), `agnostic-lazy-world-spine.md`, `jc/examples/splat_louvain_modularity.rs` (the modularity machinery reused), `wikidata-lazy-spine-hydration-v1.md` (D-LWS-8 probe harness). + +## 2026-05-31 — FINDING (SoC correction): markov_soa IS AriGraph (the cold-path Markov chain promoted to the hot-path SoA); AriGraph is agnostic & NOT necessarily English — the language layer (DeepNSM/COCA) stays UPSTREAM and never reaches into the hot graph + +**Status:** FINDING + done (move shipped; AriGraph version unverified-offline — core does not build in the sandbox). Corrects the premature `deepnsm::markov_soa` placement (`e0a5049`, now deleted) AND its own first framing (the "inject a COCA distance as an alternative" error — that would be the GoBD-with-Rumi mistake). + +**markov_soa IS AriGraph — not "a projector that lives in AriGraph."** AriGraph is a Markov chain in the **cold path**; `markov_soa` is that same chain **promoted to the hot path** (the per-mailbox SoA). Same object, agnostic nature, hot instead of cold. Particle/wave: EW64 / the `CausalEdge64` W-slot → witness arc = the **particle** (discrete, addressable, exact); the windowed projection = the **wave** (accumulated resonance). Both ARE AriGraph. Now `crates/lance-graph/src/graph/arigraph/markov_soa.rs`. + +**AriGraph is agnostic AND NOT necessarily English — the deeper SoC step.** AriGraph holds SPO from ANY source (business, GoBD, Wikidata, English text); its agnosticism is structural — the SoA row is three **opaque `u16` ranks** carrying no language. The match metric is **AriGraph's OWN `cam_pq::DistanceTables`** (the graph's native semantic distance), injected as `Fn(u16,u16)->u8` so the projector names no encoding. **The language layer stays UPSTREAM in DeepNSM and never reaches into the hot graph:** DeepNSM / COCA-4096 / grammar templates are the *English-language input sensor* — they scan flat data (usually English), parse it, EMIT SPO into AriGraph, and **MUST stay English** (the grammar templates get messy the instant they aren't). **Injecting a COCA/language distance into the hot-path graph would be the GoBD-with-Rumi error** — running a *language* lens over an *agnostic* graph. The injected distance is AriGraph's cam_pq, NOT a language table. SPO *can* be English (when DeepNSM produced it) but the SoA/AriGraph mailbox-view is never *forced* into a language. Reuse DeepNSM by it FEEDING AriGraph upstream, never by core calling into it (core has 0 deepnsm dep — the dep graph enforces this). + +**Status of the code:** `SpoRanks{s,p,o:u16}` (opaque) + `SoaWavePrimer` + `WaveProjection::best_guess_match(injected dist)` — 4 tests written (determinism, clamp+proximity, injected-distance match, empty=0); **unverified-offline** (lance-graph core's lance/datafusion/arrow deps don't fetch in the sandbox — compile-verify on a full checkout). Truly-correct home = inside the EW64-in-SoA seam (P1+P2); this is the agnostic wave-projector that seam will host. Cross-ref: three-Markovs FINDING (#2 hybrid), EW64-reactive-seam, `witness_corpus.rs` (AriGraph native cam_pq), `soa_view.rs::MailboxSoaView`. + +## 2026-05-31 — FINDING (taxonomy, standing definition): the THREE Markovs — one word, three ranked uses; the deterministic CE64→EW64 chain is the line between grounded and praying + +**Status:** FINDING (standing definition, user-stated 2026-05-31). The canonical disambiguation of "Markov" in this stack. Anchors `markov_soa` (#2), the EW64 reactive-seam (#1 plumbing), and the deprecated VSA-substrate (#3). Ranked by epistemic grounding. + +**The three Markovs:** + +1. **Context-chain building (THE substrate, deterministic).** Mailbox chaining through the `CausalEdge64` W-slot → `EpisodicWitness64` arc — walk the witness references. Fully deterministic, exact, addressable; the arc IS the chain, no bundle. This is *the* Markov — reasoning traverses it; it is truth. (= the EW64 reactive seam; P1+P2 plumbing below.) + +2. **Hybrid+ autocomplete (deterministic spine + leashed fuzz).** #1's deterministic chain PLUS a fuzzy accumulated witness-bundle as **speculative autocomplete** — the bundle *proposes* the next mailbox, the chain *confirms or refutes*. Deterministic spine + fuzzy proposer on top; a wrong guess is cheap (cheap reprioritization, never a wrong answer). (= `deepnsm::markov_soa`, shipped `e0a5049`, + the grail-fold experiment, P3 below.) **Invariant: #2 is only ever #2 while its fuzz stays leashed to #1's chain — an UNLEASHED bundle degrades into #3 by definition.** + +3. **"Sink in and pray" (the error).** Old VSA bundle-as-Markov: ceiling-bound superposition, opaque vector, hope-based readout — **NOT deterministically grounded** like #1. The black box the whole thread rejected; deprecated for reasoning (the "every GGUF would already be VSA-fp32" disproof: 30 years, planetary compute, nobody adopted it as a substrate ⇒ it lost). If materialized at all, signed base-5 packed, never fp32, and never as #3. + +**The line (one sentence):** **#1 is the chain. #2 is the chain PLUS a guess it must confirm. #3 is the guess WITHOUT a chain to confirm it.** The presence/absence of the deterministic CE64→EW64 chain underneath is the entire distinction. The firewall is NOT fuzzy-vs-exact — it is **"fuzz a chain confirms (#2, legit) vs fuzz nothing confirms (#3, error)."** + +**Dependency ordering (gate before grail — wire first, then experiment):** +- **P1 — AriGraph → SoA.** Implement the `HotWitness` `todo!()` scaffold (`witness_tombstone.rs`, D-ATOM-5: calcify/from_hot/tombstone-persist/WitnessLink-verify); episodic/semantic edges become SoA-resident. (`E-ARIGRAPH-IS-AN-ISLAND`: "Ee→EW64(hot)+WitnessCorpus(cold)"). +- **P2 — EW64 in `MailboxSoaView`.** Define `EpisodicWitness64` + add `fn episodic_witness(&self) -> &[EpisodicWitness64]` to the view — following the EXISTING deferred-accessor pattern (`soa_view.rs:71` "add `fn qualia()` when the first consumer needs it"). The CE64→EW64 arc becomes a readable, addressable SoA column. +- **P3 — the grail experiment (CONJECTURE, gated, Jirak-baselined).** ONLY after P1+P2: fold CE64 or EW64 **deterministically** into a VSA projection; **measure** recoverable best-guess signal vs the black-box baseline (Jirak floor, `I-NOISE-FLOOR-JIRAK`). PASS ⇒ "deterministic-arc-fold autocomplete" (the grail) — promote; FAIL ⇒ stays a #2 proposer. You cannot fold what isn't wired (P1+P2), and you cannot tell signal from prayer without the baseline. P3 is a NEW deliverable DOWNSTREAM of the EW64 seam spec — not part of it (no scope creep). + +**Cross-ref:** `markov_soa.rs` (#2, COCA+CAM-PQ, no cosine), EW64-reactive-seam FINDING (#1 plumbing), substrate-decision FINDING (#3 deprecation + signed-base-5 carrier note), `witness_tombstone.rs::HotWitness` (D-ATOM-5), `soa_view.rs:71` (the accessor pattern), `I-VSA-IDENTITIES`, `I-NOISE-FLOOR-JIRAK`. + +## 2026-05-31 — FINDING (substrate decision, CONVERGED): explicit 32k SPO-W IS the substrate; VSA16k is a strictly-fuzzy PROPOSER (cognitive priming) via COCA+CAM-PQ — never cosine, never truth + +**Status:** FINDING (user-stated + judged 2026-05-31, decisive). Reasoning-substrate decision FIRM; the proposer role is the legitimate home (CONJECTURE on whether it carries signal above noise). This entry CONVERGED across several refinements this session — it supersedes two earlier framings of mine: (a) "VSA = per-cycle experience/soul-print vector" (wrong scope), (b) "keep DeepNSM as a parallel universe" (DeepNSM migrates too). Shipped artifact: `crates/deepnsm/src/markov_soa.rs` (commit e0a5049). + +**The truth path is deterministic, full stop.** A whole book = **~32k exact SPO-W triplets in context** (SPO + the `CausalEdge64` W-slot witness), via the mailbox reference-pointer table (`TripletGraph` → `SpoStore` L2 cold columnar, `spo_bridge.rs`) + per-mailbox NARS awareness. Exact, deterministic, CAM-addressed, **zero hallucination, zero embedding, zero bundle**. Holding the explicit 32k is worth **categorically more** than any fuzzy bundle: the explicit form is addressable (each triplet retrievable by CAM), lossless, reasoning-capable (CE64/EW64 traverse, NARS-revise, counterfactual-test), provenance-bearing (the W) — a bundle is none of these. Capacity math seals it: ~√d/4 ≈ 32 recoverable items at d=16384; a book is ~32k = 1000× over → a whole-book bundle is recovery-noise. "String theory": the 32k are the full configuration; a bundle is one projected shadow — you can't do physics in the shadow. + +**VSA16k's legitimate role = a strictly-fuzzy PROPOSER (cognitive priming), firewall-gated.** The fundamental ERROR was never the fuzziness — it was the *posture*: a black box **praying for meaning** (opaque vector, cosine, hope). The irony: the SAME fuzziness is *correct* one layer over, in the discovery/proposer layer (faiss-homology / `I-VSA-IDENTITIES`: similarity lives ONLY there, never in addressing/reasoning). As a proposer it **proposes where-to-look / what-this-resembles** ("this feels like a Sicilian with a pinch of death trap"), the exact 32k SPO-W **always confirms**, and a wrong guess = cheap reprioritization, never a wrong answer (honest approximation, not praying). Test that separates sin from virtue: "if this number is wrong, what breaks?" — reasoning: the answer (catastrophe); discovery: you prefetched the wrong region and exact-confirm corrects you (self-healing). = **cognitive priming**: System-1 prime (VSA proposer) → System-2 calculate (32k SPO-W). The free-energy loop the stack already runs: prior = the prime, evidence = the triplets. + +**The match is DeepNSM's OWN machinery — NOT cosine.** COCA-4096 vocabulary + the CAM-PQ 4096² u8 word-distance matrix via `SimilarityTable::lookup_u8` + grammar heuristics. `markov_soa.rs` makes this explicit: a SoA ±window → the addressable list of COCA-rank SPO triplets + full provenance (which rows, what proximity); `best_guess_match` = nearest-triplet CAM-PQ similarity. The triplets stay addressable (no superposition kills the register). Zero new dep (consumes `contract::soa_view::MailboxSoaView` through DeepNSM's existing contract dep). 5 tests incl. `best_guess_match_uses_cam_pq_not_cosine`. + +**DeepNSM migrates too (not a standing parallel universe):** its NSM→SPO FSM *produces triplets* → folds INTO the deterministic SPO substrate; its `markov_bundle` (512-bit XOR/majority) is the same projection idea → deferred-research bucket. Nothing remains as a parallel reasoning substrate. + +**Aerial synergy (grounded seam):** `markov_soa` is the **within-cohort** proposer (one book's ±window); aerial is the **cross-cohort** proposer (mines X→Y over many observations). They share one firewall + one distance idea — DeepNSM's CAM-PQ u8 word-distance is exactly aerial's `CodebookDistance::distance(a,b)->u32` shape, so a ~20-line adapter lets aerial probe DeepNSM's COCA semantic space. Both feed the EW64 "fire together → wire together" prefetch from opposite ends. Queued (not built): the `markov_soa`→`CodebookDistance` adapter D-id (crosses deepnsm↔arm-discovery; own slice). + +**CONJECTURE + probe:** does the CAM-PQ match over windowed rank-triplets carry recoverable best-guess signal above the Jirak noise floor (`I-NOISE-FLOOR-JIRAK`)? PASS ⇒ promote as an Aerial discovery-lens; FAIL ⇒ research-only. **Reconciliation note:** CLAUDE.md "The Click" + `I-SUBSTRATE-MARKOV` still frame the VSA bundle AS the reasoning substrate — superseded (reasoning = deterministic SPO-W; VSA = fuzzy proposer); CLAUDE.md edit deferred to a deliberate doc pass, this board finding is the authoritative ledger meanwhile. Cross-ref: `delta-card-addressing-integration-map.md`, EW64-reactive-seam finding, `I-VSA-IDENTITIES`, `markov_soa.rs`, aerial `CodebookDistance`. + +## 2026-05-31 — FINDING (integration gap, SHOCK): EW64 is the unbuilt REACTIVE SEAM — Markov is the basis, predictive-prefetch is the Meta, "fire together → wire together"; every link shipped, the chain is open at the joints + +**Status:** FINDING (named integration gap + behavioral spec; user-stated 2026-05-31). Refines `E-EW64-IS-PREDICTIVE-PREFETCH` + `E-AERIAL-FEEDS-EW64-PREFETCH` + `E-ARIGRAPH-IS-AN-ISLAND` (2026-05-30). Decision: spec the WHOLE reactive seam (one spec), spawn AFTER the running probe wave consolidates (second wave). + +**The layering (corrected — I had it inverted):** **Markov (the `CausalEdge64` W-slot → EW64 witness arc) is the BASIS** — the substrate fact of which witnesses fired in sequence ("fire together"). **Predictive-prefetch is the META** — the emergent behavior on the Markov basis: because they fired together, prefetch the next before it's asked ("wire together"). So EW64 is not an optimization layered onto reasoning — **the prefetch IS the wiring IS the learning** (Hebbian, literally): aerial mines co-occurrence offline ("fire together"), EW64 prefetches it online ("wire together"), the surviving arc is the learned structure. One mechanism, three names by timescale. + +**The reactive spine (the keystone, previously missed): `Lance update = the witness pointer = the SurrealDB kanban subscription trigger`.** A witness fires → Lance fragment append (the update IS the witness pointer materializing) → SurrealDB LIVE subscription on that table fires → the kanban (`KanbanMove`, #437) advances a mailbox phase → EW64 prefetches the aerial-predicted next arc into the SoA → the shader finds it already resident. The update, the pointer, and the trigger are the **same event** — the "wire together" propagates THROUGH the storage layer as the prefetch signal. This is why EW64 shares CE64 low-40 bits (co-address), why kanban is in `contract`, why surreal_container is a transparent SoA view — all built to be links of ONE chain, and the chain is the thinking. + +**The SHOCK (the diagnosis):** every individual link exists and tests green, but **the chain is open at the joints** — the island-archipelago failure (`E-ARIGRAPH-IS-AN-ISLAND` verbatim: "Ee→EW64(hot prefetch)+WitnessCorpus(cold)" is the unwired task). Shipped: `CausalEdge64`, `WitnessTable<64>`, `ReasoningWitness64` (splat.rs:78), `KanbanMove`+SoA view (#437), aerial X→Y (#436/#438/#443). Scaffold-only: `HotWitness` (witness_tombstone.rs:70, `todo!()` bodies). **Unbuilt: `EpisodicWitness64`/`SpoWitness64` (arc `pr-ce64-mb-4`) — 0 code symbols — AND the Lance-LIVE→Surreal→kanban subscription.** It's the most expensive kind of gap: invisible in green suites (every crate passes; the system doesn't *do* the thing) because the **integrating seam was never built**. EW64 is not a type to add — it's the seam that closes the reactive loop: contract-atom (shares CE64 bits) + the Lance→Surreal→kanban subscription + materializing `HotWitness`'s `todo!()`s. Three links, one chain. + +**Next (queued, second wave):** one spec `.claude/specs/episodic-witness64-ce64-prefetch.md` covering the whole seam (contract EW64 atom CE64-mirrored, shares low-40 SPO bits, `SpoWitness64` alias `pr-ce64-mb-4`; the Lance-LIVE→Surreal-subscribe→kanban wiring contract; `HotWitness` materialization), impl phased/gated (surrealdb+ractor are heavy/cross-crate; firewall + D-ARM-7 hold). Also LE-1: EW64's second role = syntactic-coreference pointer (relative pronoun → antecedent pointer, not a bundle; register-laziness). Cross-ref: `E-EW64-IS-PREDICTIVE-PREFETCH`, `E-AERIAL-FEEDS-EW64-PREFETCH`, `E-ARIGRAPH-IS-AN-ISLAND`, `E-ARIGRAPH-PAPER-GROUNDS-CE64-EW64` (LE-1); `delta-card-addressing-integration-map.md`; #437 kanban; `splat.rs::ReasoningWitness64`, `witness_table.rs`, `witness_tombstone.rs::HotWitness`, `arigraph/{episodic,witness_corpus}.rs`. + +## 2026-05-31 — FINDING (capstone synthesis): the DELTA-CARD world-spine — card = surprise, deck = expectation; key and value compress by the same delta-over-frozen-archetype move + +**Status:** CONVERGED VISION (8-turn design synthesis; primitives shipped, consolidation + delta-card value model NEW, claims labelled + probed). Full map: `delta-card-addressing-integration-map.md`. Supersedes the scattered addressing fragments in `agnostic-lazy-world-spine.md`. + +**The one idea:** a card stores the *surprise*, the deck stores the *expectation*; **meaning = deck ⊗ delta**. Everything — recipe, Wikidata entity, address, sentence-mailbox — is a small delta over an inherited frozen archetype, reconstructed on demand. This IS the free-energy framing (`CLAUDE.md` F = (1−likelihood)+kl): archetype = prior, delta = prediction-error, **bit-width = residual surprise**. It applies to BOTH halves of a row — the **key** (address) and the **value** (content) compress identically. + +**Cookbook (value side):** `recipe = inherited(region×season×persona) + 8–16 delta bits` (texture/sweet/sour/salty/veg-axis, 2b each). The 16-bit card is meaningless until resolved against its deck. Boundary: the delta carries the *compressible profile*; irreducible specifics (quantities, novel steps, fusion) are stored values / forks (generator-vs-derivable split). + +**Addressing chain (key side):** (1) **partition-as-address, schema-as-deck** — the address is *location not a stored column* (Quartettkarten: the card is *in* the Auto box, doesn't carry category=Auto); the 256-ary OWL/DOLCE subClassOf directory encodes upper bits in the path, OGIT holds the lookup once, the row stores ~0 address/schema/label bits. (2) **27-bit truthful floor** — 113M → ⌈log₂⌉ = 27 bits irreducible (the QID already ≈ 2²⁷; classes can't shrink *identity*) — but partition-as-address makes the 27 bits FREE per-row (`(path<`, 6-bit W-slot, walks inside the cohort). **Pointer-width = corpus-size identity:** 6-bit W-slot = 64 (the cohort) ⊂ **16-bit (in EpisodicWitness64) = 65,536 ≈ 64K SPO = one BOOK** (Bible ~32k sentences = half; novel ~4–5k = ~7%) ⊂ 32-bit `mailbox_ref` = 4.3B (the world-spine, Wikidata ~115M). 64K is exactly the documented mailbox-envelope lower bound (witness_table.rs "64K–256K", plan §10). Pick the pointer width, you've picked the horizon: cohort ⊂ book ⊂ world. **Address vs hot working-set (the 256K payoff):** Wikidata (~115M) is 32-bit-ADDRESSED (cold spine, lazy, never resident); the documented **256K (2¹⁸) is the concurrent HOT mailbox envelope**. You foveate Wikidata, so 256K holds whole corpora + a hydrated Wikidata slice at once: Bible (~32k) + LOTR (~28-30k) = ~62k ≈ one 16-bit corpus, leaving ~190k (3× headroom) for the Wikidata reasoning window. ⇒ cross-corpus grounded reasoning ("Frodo ↔ biblical archetype, grounded in Wikidata") fits in one hot context BECAUSE the spine stays cold. Bounded hot context, unbounded cold spine. Full nesting: 6-bit cohort(64) ⊂ 16-bit book(64K) ⊂ 18-bit hot envelope(256K) ⊂ 32-bit world(4.3B, cold). + +**Bit budget + addressing:** the resident agnostic row shrinks 16384 → ~4096 bits (HHTL address carries class+label inheritance; qualia-i4-16D 64 + thinking-i4-32D 128 + CausalEdge64-with-W-slot + EpisodicWitness64 + presence/class fit). The address can be brutal: **byte-aligned 256⁴ = 2³² ≈ 4.3 B** — the 4-byte CAM-PQ code IS the address = class+label key = palette-distance key (vs 64K² = 4.3 B shallow, 4096³ = 69 B headroom). Fan-out frozen append-only once chosen. + +**The one missing runtime piece:** a `NiblePath`-keyed **tiered hydration manager** (hot mailbox-SoA ↔ cold Lance, foveated `RouteAction` prefetch, perm/temp eviction, late labels). CONJECTURE to probe: the Poincaré φ-spiral leaf encoding. Gate: D-ARM-7 (`jc::jirak`) before any hydrated rule writes a live store. Cross-ref: `agnostic-lazy-world-spine.md`, `owl-dolce-hhtl-compartments-aerial-fed.md`, `wikidata-hhtl-load.md`, #437/#441/#442/#443, `crates/jc`. + ## 2026-05-31 — FINDING: D-CLS (#441) ↔ D-ARM-14 (#438) converge on Wikidata-HHTL — the second-domain falsifier reuses the class-meta-DTO 1:1 (cross-session synthesis) **Status:** FINDING (cross-session reconciliation, confirmed by the D-ARM-14 session). Anchors the Wikidata-HHTL arc on the merged D-CLS machinery; no parallel layer grows. diff --git a/.claude/board/STATUS_BOARD.md b/.claude/board/STATUS_BOARD.md index 3e6b17ee..2b8dda7b 100644 --- a/.claude/board/STATUS_BOARD.md +++ b/.claude/board/STATUS_BOARD.md @@ -671,6 +671,31 @@ The bounded-weekend fix `cognitive-risc-classes.md:56-57` prescribes (discrimina --- +## wikidata-lazy-spine-hydration-v1 — the NiblePath-keyed tiered hydration manager + addressing (the "agnostic lazy world-spine" runtime) + +The one missing runtime piece behind the converged delta-card / world-spine vision (`delta-card-addressing-integration-map.md`, `agnostic-lazy-world-spine.md`). Plan: `.claude/plans/wikidata-lazy-spine-hydration-v1.md` (9 D-ids, authored by the W1 wave worker). All gated on D-ARM-7 (Jirak floor) before any hydrated rule writes a live store; firewall (aerial = zero-dep proposer, hub owns contract/ontology) preserved. + +| D-id | Deliverable | Crate(s) | LOC | Conf | Status | Notes | +|---|---|---|---|---|---|---| +| D-LWS-1 | Sparse radix range-delegation register (path-compressed trie over the frozen ontology; occupied branch points only; reuses `NiblePath` as the address — never re-encodes identity) | lance-graph-contract / -ontology | ~? | MED | **Queued** | partition-as-address; 27-bit floor with ~0-bit row | +| D-LWS-2 | Delta-card value model (`reconstruct = deck ⊗ delta`; per-entity surprise as a `FieldMask` delta over the inherited archetype; modal member = empty card) | lance-graph-contract | ~? | MED | **Queued** | built on `FieldMask::inherit` | +| D-LWS-3 | RISC compose-cache + per-predicate composability flag (store generators, compose ≤7-hop closure via `ComposeTable`/`mxm`; dissolves the hub problem) | lance-graph + bgz-tensor | ~? | MED | **Queued** | generators=continuant/cold, composed=occurrent/evictable | +| D-LWS-4 | I/P/B frame model over Lance versioning (I=frozen radix+base, P=append, B=compose-cache, GOP=compaction) | lance-graph | ~? | MED | **Queued (spike)** | R2: repo wires dataset-level `VersionedGraph`, not fragment-level — fragment GOP is a NEW spike | +| D-LWS-5 | **The `NiblePath`-keyed tiered hydration manager** (THE missing piece): hot `MailboxSoaView` ↔ cold `VersionedGraph`, address-not-join, agnostic SoA, carries CE64+witness arc; write-refusal until D-ARM-7 | lance-graph | ~? | MED | **Queued** | centerpiece; D-ARM-7 write-refusal acceptance test | +| D-LWS-6 | Foveated prefetch cascade (`HhtlCache::route` Skip/Attend/Compose/Escalate decides periphery prefetch into the 256K envelope) | lance-graph + bgz-tensor | ~? | MED | **Queued** | the Google-Maps tile prefetch | +| D-LWS-7 | Eviction on the DOLCE continuant/occurrent 1-bit (`dolce_id==PERDURANT` ⇒ occurrent ⇒ evictable; 4-facet axis preserved, residence bit derived) | lance-graph | ~? | MED | **Queued** | the perm/temp residence policy | +| D-LWS-8 | Probe harness — runs the 3 falsifiers (Louvain-CLAM locality, delta-card residual, compose hit-rate) on real `data/ontologies/*.ttl` + fixtures; PRODUCES the gates | crates/jc + lance-graph | ~941 | HIGH | **Probe-1 SHIPPED** | `jc/examples/ontology_locality_probe.rs` RUN: **locality 98.6%, max fan-out 3 (≤16), Q=0.325 → PASS** on real ontologies (not yet Wikidata). Probes 2-3 queued. | +| D-LWS-9 | DEFERRED full Wikidata 115M load (skeleton+basins+CAM-dedup+thin rows) | wikidata loader | ~? | LOW | **Deferred** | gated on all 3 probes PASSED + D-ARM-7; CONJECTURE (no dump on disk) | + +## Markov substrate clarification (markov_soa / EW64) — three-Markovs taxonomy + +| D-id | Deliverable | Crate(s) | LOC | Conf | Status | Notes | +|---|---|---|---|---|---|---| +| D-MKV-SOA | `arigraph::markov_soa` — the Markov *wave* (AriGraph cold-path chain promoted to hot-path SoA); vocabulary-agnostic `SpoRanks{u16}` + `SoaWavePrimer` + `WaveProjection::best_guess_match(injected dist)`; the "hybrid+ autocomplete" #2 proposer (dark-horse) | lance-graph::graph::arigraph | ~230 | MED | **Shipped (branch, unverified-offline)** | moved out of deepnsm (SoC fix); match = AriGraph's own cam_pq, language stays upstream; 4 tests written, core doesn't build in sandbox → verify on full checkout. Findings: three-Markovs, markov_soa-IS-AriGraph | +| D-EW64-NOTE | `MailboxSoaView` doc note: `EpisodicWitness64` = AriGraph in the mailbox SoA view (the particle; cold→hot); deferred accessor (qualia-pattern) | lance-graph-contract::soa_view | ~20 | HIGH | **Shipped (branch)** | verified (contract builds, 3/3 soa_view tests); EW64 not yet a code symbol — P2 of three-Markovs ordering | + +--- + ## Update protocol When a deliverable ships: diff --git a/.claude/knowledge/agnostic-lazy-world-spine.md b/.claude/knowledge/agnostic-lazy-world-spine.md new file mode 100644 index 00000000..0a5c72c0 --- /dev/null +++ b/.claude/knowledge/agnostic-lazy-world-spine.md @@ -0,0 +1,211 @@ + + +# KNOWLEDGE: The agnostic lazy world-spine — Wikidata as a foveated, tiered, address-unified substrate + +## READ BY: +- Anyone building the `NiblePath`-keyed hydration manager (the one missing runtime piece) +- Anyone touching the AriGraph SPO ↔ mailbox-SoA ↔ OGIT/DOLCE-cache boundary, the GraphRouter cold path, or the lazy-loading spine +- `truth-architect`, `integration-lead`, `palette-engineer` + +> **Status: NORTH-STAR VISION (living).** The *addressing + compression + cheap +> late-resolution* primitives are built; the *runtime tiered-hydration* layer is +> not. CONJECTURE items are labelled. This is the goal the D-ARM-13/14 + D-CLS + +> Wikidata-HHTL arc serves — not a shipped system. +> +> **➤ The addressing design has since converged — see the consolidated +> `delta-card-addressing-integration-map.md` (card=surprise/deck=expectation; +> partition-as-address; 27-bit floor; sparse radix; x264 I/P/B frames over Lance; +> RISC compose-not-materialize). That doc supersedes the scattered "Bit budget / +> Address space / fan-out" sections below.** + +--- + +## The goal + +Compress Wikidata well enough that it is a **lazy-loading spine**: a tiny +always-resident skeleton, with **on-demand, foveated, blasgraph-adjacent +hydration** that loads detail *exactly where reasoning looks* — like foveated +rendering (sharp at the fovea, periphery coarse) and Google-Maps tile prefetch +(the adjacent area streams into context before you pan to it). Reasoning then +has **one unified allocation address**; the substrate stays **compartmentalized, +cheap, and agnostic**. + +## The tiered substrate + +``` +COLD (persistent) ADDRESS HOT (resident, agnostic) SEMANTIC (late) +Lance columnar + NiblePath HHTL mailbox SoA register OGIT / DOLCE cache +DataFusion joins ◄── (16ⁿ, the one key)──► (MailboxSoaView, &[T], ◄── (C2: resolve, never +(inherited upstream) │ label-free bytes) store; class flies + transparent lazy view ─────┴── foveated hydration ──┴── late-label overlay ABOVE the SoA) + residence = DOLCE 1 bit (continuant=permanent / occurrent=temporary) + leaf = Poincaré golden-ratio (φ) spiral — orthogonal spatial coordinate +``` + +## Layer → substrate (built / new / conjecture) + +| Layer | Role | Substrate | Status | +|---|---|---|---| +| **Address** | one O(1) key = ontology position **=** memory arena **=** spatial coord | `contract::hhtl::NiblePath` (16ⁿ, bit-shift) | **built** (#442) | +| **Cold floor (HHTL)** | address-based hydration source (NO join) | Lance columnar reads keyed by HHTL address → CAM/palette/`blasgraph` (O(1)) | **built** primitives; used as a direct-address lazy view | +| **Cold floor (SQL)** | business ground-truth queries only — **slow, off the HHTL path** | DataFusion rows/cols joins (inherited upstream) | **built**; reserved for relational ground truth, NOT spine hydration | +| **Hot carrier** | resident, **agnostic** structural bytes (class_id, NiblePath, presence `FieldMask`, perm/temp bit) — **no labels** | mailbox SoA `MailboxSoaView`/`MailboxSoaOwner` | **built** (#437) | +| **Semantic overlay** | labels / class shape / DOLCE resolved **late**, per address | OGIT TTL cache + `ClassView` + DOLCE-from-cache (`dolce_id`) | **built** (#441) — C2 resolve-not-store | +| **Discovery feed** | what lands where, from runtime data | aerial proposer + splat `CodebookDistance` | **built** (#438/#443) | +| **Residence policy** | keep vs evict / persist vs ephemeral | **DOLCE 1 bit**: continuant (Endurant/Quality/Abstract = permanent) vs occurrent (Perdurant = temporary) | **NEW** (design) | +| **Hydration manager** | lazy-load a basin's cold rows + blasgraph adjacency on first touch; foveated adjacency prefetch (`RouteAction` cascade); evict cold/occurrent arenas | hot mailbox-SoA ↔ cold Lance, keyed by `NiblePath` | **NEW — the one missing runtime piece** | +| **Leaf encoding** | fine orthogonal spatial code within a class | Poincaré-disk φ-spiral (golden angle) | **CONJECTURE** (φ-spiral prior art + hyperbolic-tree geometry) | + +## The three reframings that complete it + +1. **lance-graph's cold path splits in two — and the join is NOT on the HHTL path.** DataFusion rows/cols joins are *slow*; they serve **business-SQL ground truth** only. The HHTL spine hydrates by **address**, not join: `NiblePath` → Lance columnar read → CAM/palette/`blasgraph`, O(1). The `GraphRouter` routes HHTL to the fast address backends and SQL to DataFusion — same store, two access paths, only one on the hot path. +2. **DOLCE = a 1-bit permanent/temporary residence policy.** Endurant (continuant — wholly present at each moment, persists) vs Perdurant (occurrent — temporal parts, happens-then-ends). The ontology's own top split *is* the cache policy: permanent ⇒ cold-persist/resident; temporary ⇒ ephemeral/evictable (the Baton/event traffic, `KanbanMove` Libet-temporal #437). `dolce_id 0..3` stays cache-resolvable; eviction keys on the derived 1 bit. +3. **AriGraph SPO + labels → agnostic SoA + late labels (C2 wholesale).** The SoA holds only structure + address; labels/classes/DOLCE resolve late from the cache. AriGraph becomes a *view*: structure hot + agnostic, semantics a cache overlay. ⇒ representation compartmentalized (basins), cheap (resolve-not-store + lazy), agnostic (register is meaning-free). + +## Bit budget — the agnostic row is ~4096 bits (NO VSA) + +**There is NO VSA in this design — no 16384-bit bundle, no fingerprint +superposition.** The Markov is the **`CausalEdge64` W-slot → `EpisodicWitness64` +arc** (`witness_table.rs`: "the chain of W-references across edges forms a +Markov-style belief-update arc through episodic-reference vectors"). Traversal +walks the W-references backward (most-recent → oldest witness) **without +dereferencing the full SPO store per hop** — a native graph walk: integer, exact, +cheap. `EpisodicWitness64` is **the new AriGraph, migrated INTO the SoA per +ractor-mailbox** — cohort-local episodic memory as a SoA column, not an external +graph; it generalises the shipped 6-bit-W-slot `WitnessTable<64>`/`WitnessEntry` +(**NEW build target — see status**). The resident row carries the **CE64 + the +EpisodicWitness arc + the address**. The HHTL address does class + label +inheritance for free (the path IS the class; labels resolve late). A plausible +~4096-bit budget (64-bit lanes): + +| field | bits | role | +|---|---|---| +| HHTL address (NiblePath / CAM-PQ code) | 16–32 | position **+** class **+** inherited-label key | +| i4-16D qualia | 64 | angle (packed `mul::i4`) | +| i4-32D thinking | 128 | style/`MetaWord` | +| `CausalEdge64` | 64 | the planner edge **+ W-slot = the Markov arc pointer** | +| `EpisodicWitness64` (AriGraph-in-SoA) | 64 | the episodic witness the W-slot resolves to | +| presence `FieldMask` + `class_id` + perm/temp | ~96 | structure | +| headroom | rest | append-only spare | + +…all fitting comfortably in 4096 bits. **Reasoning = traversing the +CE64→EpisodicWitness arc + SPO** — a native graph walk, the row carries +everything a hop needs. The discovery layer (aerial/splat) uses a `palette256`/ +CAM-PQ distance hydrated transiently if at all, then dropped — never on the +reasoning hot path, and not a bundle. (CONJECTURE — settle the exact budget + +the `EpisodicWitness64`/SoA column layout before the loader.) + +## Reading a text = holding SPO + CE64 + EW64 in context + +Because the CE64→EW64 arc traversal is native and cheap, **reading is just +accumulating SPO mailboxes with their causal-edge + witness arc** — no embedding, +no bundle, no model forward pass. Each sentence ≈ one SPO mailbox (S/P/O + a +`CausalEdge64` linking it to the prior state via the W-slot + an +`EpisodicWitness64`). Ambiguity is resolved by **counterfactual testing** +(`recipe_kernels`: `world' = world ⊗ factual ⊗ counterfactual`, divergence = +popcount) on the scenario-only `SplatChannel::Counterfactual` that must NOT +promote facts — a little overhead per ambiguous edge. + +**Scale (rule of thumb):** a 250-page book ≈ 75,000 words ÷ ~17 words/sentence ≈ +**4,000–5,000 sentences ≈ ~4096 SPO mailboxes** + a little counterfactual +overhead. The whole book is then a bounded cohort of ~4096 mailboxes — and the +`WitnessTable<64>` is *per-cohort* (6-bit W-slot), so the arc is walkable inside +the cohort without touching the global store. **A book is a cohort; the world-spine +is the union of cohorts.** + +## The pointer-width = corpus-size identity + +A witness pointer's bit-width *is* the corpus it can address — one identity: + +| pointer width | reach (2ⁿ) SPO mailboxes | corpus it spans | +|---|---|---| +| 6-bit W-slot (`CausalEdge64`) | 64 | the immediate cohort (intra-`WitnessTable`) | +| **16-bit** (inside `EpisodicWitness64`) | **65,536 ≈ 64K SPO** | **a whole book** (Bible ≈ 32k sentences = half; a novel ≈ 4–5k = ~7%) | +| 32-bit (`mailbox_ref`, the workspace envelope) | 4.3 B | the full world-spine (Wikidata ≈ 115 M) | + +So a **16-bit pointer ≈ 64K SPO ≈ one book** — and 64K is exactly the documented +**mailbox-envelope lower bound** (`witness_table.rs`: "64K–256K mailbox envelope", +plan §10). The `EpisodicWitness64` therefore has room to spare: a 16-bit +intra-corpus slot addresses any sentence in a book, leaving the other 48 bits for +cohort id + channel + flags. The Bible (~32k sentences) sits at half a 16-bit +space; a 250-page novel (~4–5k) at ~7%. **One book = one 64K-addressable witness +corpus; the world-spine = the 32-bit envelope over all of them.** The widths +nest: 6-bit cohort ⊂ 16-bit book ⊂ 32-bit world — pick the pointer, you've picked +the horizon. (CONJECTURE — exact `EpisodicWitness64` sub-field layout TBD; the +6-bit and 32-bit ends are in code, the 16-bit book tier is the proposed middle.) + +## Address space vs hot working set — the 256K payoff + +**Two different 256Ks; don't conflate them.** Wikidata (~115 M) is *addressed* +by the 32-bit `mailbox_ref` (4.3 B) — it is the **cold spine**, lazy, never fully +resident. The **256K is the concurrent hot mailbox envelope** (the documented +`64K–256K` envelope, 2¹⁸) — how many mailboxes are *live at once*. The power is +that you never need Wikidata resident: you **foveate** it, so the 256K hot window +holds whole corpora **plus** a hydrated Wikidata slice, simultaneously: + +| resident in the 256K hot envelope | mailboxes | +|---|---| +| Bible (≈ 31k verses) | ~32k | +| LOTR trilogy (≈ 480k words ÷ 17) | ~28–30k | +| **both books fully resident** | **~62k ≈ one 16-bit corpus** | +| foveated Wikidata reasoning window | **~190k left (≈ 3× headroom)** | + +So **both books together ≈ 62k ≈ just under one 64K space**, and 256K = 4× that — +enough to hold **both books fully resident + a large hydrated Wikidata slice at +once**, which is exactly what cross-corpus grounded reasoning needs (e.g. "relate +Frodo to a biblical archetype, grounded in Wikidata facts" → all three in one hot +context). The precise statement: **bounded hot context (256K concurrent), +unbounded cold spine (32-bit Wikidata, lazy)** — 256K is enough for multi-book + +grounded reasoning *precisely because* Wikidata stays foveated; you never pay for +the 99.99 % you are not looking at. Full nesting: + +```text + 6-bit cohort (64) ⊂ 16-bit book (64K) ⊂ 18-bit HOT envelope (256K = ~4 books, + or 2 books + a Wikidata window) + ⊂ 32-bit world (4.3B Wikidata, COLD/lazy) +``` + +## Addressing — fan-out × depth (the brutal version) + +The HHTL address can be far coarser/cheaper than the 16-way `NiblePath`. For +~4 billion addressable (Wikidata ≈ 115 M = 2²⁷, so 2³² is ~37× headroom): + +| scheme | levels × bits | reach | addr | natural fit | +|---|---|---|---|---| +| **256⁴** | 4 × 8-bit (byte) | 2³² ≈ 4.3 B | 32 b / **4 B** | **palette256 + CAM-PQ code IS the address**; byte-aligned; OGIT byte-basins | +| 64K² | 2 × 16-bit | 2³² ≈ 4.3 B | 32 b (2 hops) | shallowest (2 hops); `n×16-bit` cache levels | +| 4096³ | 3 × 12-bit | 2³⁶ ≈ 69 B | 36 b | 4096-codebook / 4096-COCA native; big headroom | +| 16¹⁶ (current `NiblePath`) | 16 × 4-bit | 2⁶⁴ | ≤64 b | deep/fine, but up to 16 hops | + +**Recommendation: byte-aligned 256⁴.** The 4-byte address *is* a CAM-PQ code, so +**addressing, class+label inheritance, and the `palette256` similarity-key are the +same 4 bytes** — one token does ontology-position + class + label + distance-key. +That is the brutal compression: `n × 16-bit` per cache level, two levels reach 4 B. +(CONJECTURE; the current `NiblePath` is 16-way, so this is a re-parameterization, +and the fan-out must be frozen append-only once chosen — the ISA-freeze the #442 +review flagged.) Caveat: 256⁴/64K² cap at ~4 B (Wikidata fits); a multi-domain +super-graph that needs 69 B wants 4096³. + +## Invariants this must NOT break + +- **CAM-exact; similarity only in discovery.** `NiblePath` + Lance rows are exact retrieval. Similarity (aerial/splat) stays in the proposer/discovery layer — never in the view or the address (`faiss-homology-cam-pq` iron rule, `I-VSA-IDENTITIES`). The φ-spiral leaf is a *coordinate*, not a fuzzy index. +- **1-bit vs 2-bit DOLCE.** Keep `dolce_id 0..3` in the cache; the residence bit is *derived* (occurrent vs continuant), not a replacement — don't drop the 4-facet axis. +- **The SoA stays agnostic, forever.** Never cache a label in the register "for speed" (core inv #1 / C2 — register-loss + coupling). Labels live only in the cache; the SoA holds the address that fetches them. + +## Why it's cheap +Nothing semantic is stored hot (resolve-not-store); structurally-identical classes collapse to one shape-family (CAM-dedup, the N4 collapse); the address is integer bit-shift; only the foveal region is hydrated; permanent/temporary eviction frees occurrent arenas. The OGIT cache makes class/DOLCE hydration a lookup. + +## Status & next +- **Built:** address (`NiblePath`), cold floor (Lance/DataFusion/GraphRouter), hot carrier (mailbox SoA), semantic overlay (OGIT/DOLCE cache, C2), discovery feed (aerial). +- **The one missing runtime piece:** the `NiblePath`-keyed tiered **hydration manager** (foveated, perm/temp-evicting, late-label). Everything else is a seam it plugs into. +- **NEW build target:** `EpisodicWitness64` = **AriGraph migrated into the SoA per ractor-mailbox** (cohort-local episodic memory as a SoA column). Shipped seed = `WitnessTable<64>` + `WitnessEntry` (6-bit W-slot); `EpisodicWitness64` itself is not yet a code symbol — its 64-bit sub-field layout (the 16-bit book tier) is the design surface to settle. +- **CONJECTURE to probe:** the Poincaré φ-spiral leaf encoding (does φ-spiral placement preserve nearest-neighbour fidelity vs the splat distance?). +- **Gate:** D-ARM-7 (Jirak floor, `jc::jirak`) before any hydrated rule writes a live store. + +## Cross-references +- `contract::hhtl::NiblePath` (#442), `class_view::{FieldMask,ClassView}` (#441), `soa_view::MailboxSoaView` (#437), `lance-graph` (Lance/DataFusion/`GraphRouter`), `lance-graph-ontology` (OGIT/DOLCE cache), `lance-graph-arm-discovery` (aerial), `crates/jc` (cert + Jirak). +- `.claude/specs/wikidata-hhtl-load.md`, `.claude/knowledge/{owl-dolce-hhtl-compartments-aerial-fed,splat-codebook-aerial-wikidata-compression,ogit-owl-dolce-ontology-compartments,phi-spiral-reconstruction,zeckendorf-spiral-proof}.md`. +- CLAUDE.md: The Click (AriGraph as thinking tissue), the Baton (ephemeral handoffs), `I-VSA-IDENTITIES`, `I-NOISE-FLOOR-JIRAK`; `cognitive-risc-classes.md` N4. diff --git a/.claude/knowledge/delta-card-addressing-integration-map.md b/.claude/knowledge/delta-card-addressing-integration-map.md new file mode 100644 index 00000000..9eb1ea3d --- /dev/null +++ b/.claude/knowledge/delta-card-addressing-integration-map.md @@ -0,0 +1,296 @@ + + +# INTEGRATION MAP: the delta-card world-spine — one idea, key and value + +## READ BY: +- Anyone implementing the addressing / hydration / delta-card layer of the Wikidata-HHTL spine +- Anyone touching the frozen ontology radix, the Lance fragment GOP, the RISC compose-cache, or the OGIT/DOLCE class deck +- `truth-architect`, `integration-lead`, `palette-engineer` + +> **Status: CONVERGED VISION (living), built bottom-up over an 8-turn design +> session. The primitives it composes are SHIPPED (NiblePath, FieldMask, +> ClassView, CausalEdge64+WitnessTable, ComposeTable, CLAM, Lance fragments); +> the consolidation + the delta-card value model are the NEW synthesis. Every +> load-bearing claim is labelled and carries a probe.** Companion: +> `agnostic-lazy-world-spine.md` (the tiered-substrate framing this refines). + +--- + +## The one idea (read this, the rest is derivation) + +**A card stores the *surprise*; the deck stores the *expectation*. Meaning = +deck ⊗ delta.** Everything — a recipe, a Wikidata entity, an address, a +sentence-mailbox — is a **small delta over an inherited frozen archetype**, +reconstructed on demand. The deck (class / region / ontology path) is frozen and +shared; the card is a few bits of deviation. This is literally the free-energy +framing (`CLAUDE.md`: `F = (1−likelihood) + kl`): **the archetype is the prior, +the delta is the prediction error, and the bit-width IS the residual surprise.** + +It applies to **both halves of a row**: the **key** (address) and the **value** +(content) compress by the *same* delta-over-archetype move. + +### The endgame — inherited nothingness + +The radix trie + the Quartettkarten OGIT-inherited classes drive the per-entity +cost to its floor by splitting two things normally conflated: + +- **Identity** (*which one*) = **27 bits, irreducible** — owned by the radix + trie, which path-compresses away every single-child chain (no branch = no + information = nothing stored). +- **Description** (*what it is*) = **~0 bits for the modal member of a class** — + inherited whole from the frozen deck. A typical human is *just* `class:human`; + it adds nothing; it **is** its archetype. Only the *surprising* entity + (Einstein, not a generic person) pays description-bits. + +So a typical entity **stores nothing** — it inherits everything. The price of +the entire spine is paid **once, by the frozen ontology** (DOLCE/FIBO/GoBD/OGIT), +shared by all 113M leaves, and amortizes to nothing per entity. At the margin +each entity pays only its *surprise*, and the surprise of the typical is zero — +the mode of the distribution encodes to the **empty card**, which is the common +case. **Absence is not missing data; absence IS the inheritance.** The deck +carries the weight; the leaf is empty; the emptiness is the compression. That is +the endgame: the world-spine *at the price of inherited nothingness.* + +--- + +## The on-ramp: a cookbook (the value side) + +A recipe card carries only its deltas from an inherited template: + +```text +inherited (ZERO bits in the card — it is the deck / the path): + region → available ingredients, fat medium, staple (Italian → olive oil, pasta) + season → what is fresh (autumn → squash, mushroom) + persona → diet, heat tolerance, skill (vegan, mild) + +the card itself (the deltas — the only bits it stores): + texture 2b (crisp/soft/chewy/creamy) sweet 2b sour 2b (none/lemon/vinegar/ferment) + salty 2b veg-axis 2b (mixed/salad/mushroom/Asian) + ────────────────────────────────────────────────────────────────────────── + ~10 free bits → a 16-bit card +``` + +`recipe = (inherited class path) + (8–16 delta bits)`. The 16-bit card is +meaningless alone (*"medium-sour, crisp, mushroom"*) until resolved against +`Italian × autumn × vegan` → reconstructs the full dish. **The box holds the +schema; the card holds 16 bits of flavor-coordinate.** + +**Honest boundary (where 16 bits stops being truthful):** the delta carries the +*compressible profile* (dish type / flavor) because region×season×persona already +constrains it. It does NOT carry irreducible specifics — exact quantities, a +novel signature step, or a fusion dish outside any cohort. Those are *new +information* → a wider delta or a fork, never a 2-bit axis. This is the +**generator-vs-derivable split**: profile derives from the template; specifics +are stored values. + +--- + +## The unification: key and value are the same trick + +We spent the design compressing the **address (key)**; the cookbook proves the +**content (value)** compresses identically — same delta-over-archetype, same +I/P/B-frame model: + +| | KEY side (address) | VALUE side (cookbook / entity content) | +|---|---|---| +| keyframe (I) | frozen ontology radix trie | the archetype (region×season×persona template) | +| delta (P) | appended entity offset | the 8–16-bit flavor/property delta card | +| reconstruct | path → entity identity | template ⊗ delta → full content | +| floor | 27 bits (entropy of 113M) | residual surprise given the deck | + +So a row is `[ key-delta-over-frozen-path | value-delta-over-archetype ]` — tiny +both ways, reconstructed against frozen decks held once in OGIT. + +--- + +## The addressing chain (the key side, end-to-end) + +Derived across the session; each step grounded in a shipped primitive. + +### 1. Partition-as-address, schema-as-deck (the Quartettkarten move) +The address is **location, not a stored column.** A card doesn't carry +"category=Auto"; it's *in the Auto box*. Shard the spine into a 256-ary tree by +nibble-pairs (the OWL/DOLCE `subClassOf` path); *which leaf a row lives in* +encodes the upper bits — stored **once in the directory + OGIT lookup**, never +per-row. Schema (fields/labels/DOLCE) lives in the deck (`ClassView`/`FieldMask`, +resolve-not-store, #441). The card is **pure values + presence mask: zero address +bits, zero schema bits, zero label bits.** + +### 2. The 27-bit truthful floor, with a ~0-bit row +113M entities → ⌈log₂⌉ = **27 bits** of irreducible identity entropy (Wikidata +QIDs already run to ~Q130M ≈ 2²⁷ — the QID is a near-optimal flat address; +classes CANNOT make *identity* cheaper). The win is that partition-as-address +makes the 27 bits **free per-row** — `address = (path << offset_bits) | +row_index`, the path in the directory, the offset implicit in file position: +```text + /0xA7/0x3C/leaf.lance ← 16 path bits (4 nibbles), held by the directory + row 0..1724 ← 11 offset bits, implicit (position) + = 27-bit address, ~0 address bits stored in the row +``` + +### 3. Sparse radix range-delegation (don't build 256⁴ files) +256⁴ = 4.3B virtual addresses; 113M occupied = **2.6% full**. Never materialize +the empty 97%. The "range register" is a **path-compressed radix/Patricia trie**: +`entry = nibble-range → {Empty | Leaf(file) | Delegate(sub-table)}`. A sparse +DOLCE branch = one `Leaf`; a dense branch (40M scholarly articles) = `Delegate` → +sub-table → many leaves; single-child chains collapse. The register = the +**occupied branch points** (≈ the OWL/DOLCE class count, KB–MB), not 4.3B files. +Skew is absorbed by 38× headroom: cohort ≠ class (giant class → many cohorts, +tiny class → one sparse cohort; sparse cohorts cost nothing — address space is +free, only *resident* memory costs). + +### 4. The frozen ISA — no rebalance +The upper ontology (DOLCE/FIBO/GoBD/OGIT + the nibble→class lookup) is a +**compiled constant** — standardized precisely so it can be frozen; zero runtime +churn. Leaves are **append-only** (new entity → new offset; append ≠ move). So +`address = [frozen-path | append-only-offset]` is stable on both halves — a +**compiled perfect hash, not a runtime hash table** → the rebalancer is *deleted*, +not built. A schema bump (DOLCE v1→v2) is a **version-gated, one-time, amortized** +global upgrade carrying an ontology-version byte (the existing +`I-LEGACY-API-FEATURE-GATED` iron rule). The only residual "move" is an +individual *reclassification* — a one-row data correction via the QID↔address +map, a rounding error on 113M. + +--- + +## The frame model (x264/265 — the capstone) + +The cold floor IS a keyframe/delta store — and that is **Lance's native +fragment-versioning**, not new machinery: + +| video | spine | +|---|---| +| **I-frame** | frozen radix trie + compacted Lance base fragment (self-decodable, exact, rare) | +| **P-frame** | appended entities + CLAM-clustered new arrivals + corrections (cheap, references the keyframe, useless alone) | +| **B-frame** | the RISC compose-cache — multi-hop derived paths, references multiple bases, evictable | +| **GOP** | keyframe + accumulated deltas, periodically re-baselined by **compaction** | + +This **resolves the frozen-vs-adaptive tension**: CLAM is adaptive *inside the +delta* (it clusters new arrivals, *proposes* placement as a P-frame); the +keyframe never moves; **compaction = re-emit a fresh keyframe = the amortized +schema upgrade**, the one deliberate version-gated moment where validated +similarity FREEZES into structure. Tradeoff = **read amplification** (resolve = +keyframe + N deltas overlay, the LSM/video-seek cost), bounded by GOP length +(compaction frequency) — a dial, not a flaw. Deltas are *exact* (a P-frame is +lossless); CLAM similarity *decides* the delta, is never stored *as* the address. + +--- + +## RISC: compose, don't materialize (the edge side) + +Storing "every human related to every other" = 113M² ≈ 10¹⁶ edges (catastrophe). +RISC move: **store the generators, compute the closure.** Store parent/child/ +spouse edges (~N); derive "related to Y in ≤7 hops" on demand via +`bgz-tensor::ComposeTable` (each hop = a u8 table lookup) / blasgraph `mxm` +matrix-power. Six-degrees ⇒ the closure is ≤7 cached hops, not a stored edge and +not a walk. + +**This dissolves the hub problem:** *United States*, *human*, *Earth* never store +their millions of inbound back-edges — they're *reached* by composing forward +generators. Hubs were only a problem if you imagined materializing them. + +- **generators = `continuant` = permanent/cold** (the DOLCE 1-bit); +- **composed multi-hop paths = `occurrent` = temporary/evictable KV** (the + B-frame compose-cache). **One eviction policy, derived from the ontology.** +- New design surface: a **per-predicate composability flag** (~12k predicates) — + "generator (store)" vs "derivable (compose)". Non-composable facts + (`birth_date`, `population`) are irreducible values, always stored. + +--- + +## The scale identities (why the numbers all rhyme) + +Everything lands on the same powers of two: + +```text + 6-bit cohort = 64 the immediate WitnessTable cohort [in code] + 16-bit book = 65,536 SPO one book/corpus (Bible ~32k = half; [proposed] + novel ~4-5k ≈ ~4096 SPO mailboxes) + 18-bit hot envelope= 262,144 the CONCURRENT mailbox working set: [in code: 64K–256K] + both books resident + a Wikidata window + 32-bit world = 4.3 B the COLD spine (Wikidata ~115M, lazy) [in code: mailbox_ref] +``` +- **Reasoning = traversing the `CausalEdge64` W-slot → `EpisodicWitness64` arc + + SPO** — a native graph walk, no fingerprint bundling (there is **no VSA** in + this design). `EpisodicWitness64` is **the new AriGraph, migrated INTO the SoA + per-ractor-mailbox** (cohort-local episodic memory as a SoA column, not an + external graph) — it generalises the shipped 6-bit-W-slot `WitnessTable<64>` / + `WitnessEntry`. **[NEW build target — `EpisodicWitness64` is not yet a code + symbol; the shipped seed is `WitnessTable<64>`+`WitnessEntry`. The arc is + W-slot → witness entry → SPO.]** +- **Reading a text = accumulating SPO mailboxes + their CE64/EW64 arc** (no + embedding, no forward pass); ambiguity resolved by counterfactual testing + (`recipe_kernels`: `world ⊗ factual ⊗ counterfactual`, divergence = popcount, + scenario-only channel). +- **Address vs hot set:** Wikidata is 32-bit-*addressed* (cold), never resident; + 256K is the *concurrent* envelope. You foveate the spine, so 256K holds whole + corpora + a hydrated Wikidata slice at once — cross-corpus grounded reasoning + ("Frodo ↔ biblical archetype, grounded in Wikidata") fits in one hot context + *because* the spine stays cold. **Bounded hot context, unbounded cold spine.** +- The card (8–16 bit delta), the row (~4096 bit), the offset (11 bit), the book + (16 bit), the address (27–32 bit) are all the same shape: **small delta over a + frozen inherited archetype.** + +--- + +## Two trees — never confuse them (the iron-rule guard) + +| | **frozen ontology radix** (addressing) | **CLAM/CHESS manifold tree** (discovery) | +|---|---|---| +| fan-out | fixed 256/nibble, compiled ISA | adaptive — radii fit to data density | +| shape | frozen (DOLCE/FIBO) | data-derived, shifts as entities arrive | +| role | **the address** (exact, CAM) | **proposes/validates** the partition (offline) + the delta placement | +| rule | addressing = exact | similarity = discovery-only (faiss-homology iron rule) | + +CLAM's adaptive radii are *similarity* — brilliant for **deciding** the partition +offline, but must NEVER *be* the runtime address (that would reintroduce the +rebalancing we deleted). **Adaptive proposes (in the delta); frozen ships (the +keyframe).** `aerial`/splat are the same discovery layer; `palette256`/CAM-PQ is +the leaf code (the card's compressed value row). + +--- + +## What's built vs new vs conjecture + +- **SHIPPED primitives:** `NiblePath` (#442), `FieldMask`/`ClassView` (#441), + `CausalEdge64` + `WitnessTable<64>`/`WitnessEntry` (the shipped seed of the NEW + `EpisodicWitness64` = AriGraph-in-SoA), `ComposeTable` + blasgraph + `mxm`, `CLAM` tree, Lance fragment-versioning, `aerial` proposer (#438/#443), + OGIT/DOLCE cache + DOLCE-from-cache. +- **NEW (the synthesis / design surface):** the sparse radix range-delegation + register; the delta-card value model; the per-predicate composability flag + + RISC compose-cache; the `NiblePath`-keyed tiered hydration manager (the one + missing runtime piece); the I/P/B-frame mapping onto Lance fragments. +- **CONJECTURE (each with a probe, below).** +- **Gate:** D-ARM-7 (Jirak floor, `jc::jirak`) before any hydrated rule writes a + live store. + +## Probes (the falsifiers — measure before freezing) + +1. **Partition locality** — `jc/examples/splat_louvain_modularity.rs` (Louvain + modularity) + CLAM on the real P279+edge graph (e.g. the biology subtree). + Pass = high modularity ⇒ ~90% local edges ⇒ 16-bit references + the family + frontier are real. Also yields the natural fan-out (sizes the 4/12/16 split) + and which hubs to compose-not-store. `clam.rs` itself says CLAM-radii-coincide- + with-ontology-boundaries is "a TEST, not a fact." +2. **Delta-card truthfulness** — reconstruct content from N delta bits vs ground + truth; histogram the residual per cohort. Low residual ⇒ the cohort is real & + 8–16 bits suffice; high residual ⇒ wrong cohort or genuinely novel (needs a + wider delta / fork). This is the entropy the card actually needs. +3. **Compose vs materialize** — measure the ≤7-hop reachability hit-rate + + compose-cache eviction churn against a stored-edge baseline; confirms the + N²-avoidance holds and sets the GOP/compaction cadence. + +## Cross-references +`agnostic-lazy-world-spine.md` (tiered substrate), `wikidata-hhtl-load.md` +(120→38GB structural compression), `owl-dolce-hhtl-compartments-aerial-fed.md` +(domain compartments), `splat-codebook-aerial-wikidata-compression.md` +(splat→aerial seam). Primitives: `contract::{hhtl::NiblePath, class_view, +witness_table, splat}`, `causal-edge::CausalEdge64`, `bgz-tensor::{attention:: +ComposeTable, hhtl_cache::RouteAction}`, `lance-graph::graph::neighborhood::clam`, +`crates/jc` (Louvain example, Jirak floor). Iron rules: `I-VSA-IDENTITIES`, +`I-NOISE-FLOOR-JIRAK`, `I-LEGACY-API-FEATURE-GATED`; `cognitive-risc-classes.md` +N4. `CLAUDE.md` The Click (free-energy = prior + prediction-error). diff --git a/.claude/plans/wikidata-lazy-spine-hydration-v1.md b/.claude/plans/wikidata-lazy-spine-hydration-v1.md new file mode 100644 index 00000000..92f716c2 --- /dev/null +++ b/.claude/plans/wikidata-lazy-spine-hydration-v1.md @@ -0,0 +1,757 @@ + + +# IMPLEMENTATION PLAN: Wikidata lazy-spine hydration v1 — the NiblePath-keyed tiered hydration manager + its addressing layer + +> **Status: QUEUED (all D-ids).** This is the implementation plan for the ONE +> missing runtime piece named in `delta-card-addressing-integration-map.md` and +> `agnostic-lazy-world-spine.md`: the **`NiblePath`-keyed tiered hydration +> manager**, plus the **sparse radix range-delegation register** it rides on, +> the **I/P/B frame model over Lance versioning**, the **RISC compose-cache**, +> and the **delta-card value model**. Every load-bearing primitive it composes is +> SHIPPED and grepped (see § Verified primitives); the manager itself is NEW. +> +> **Authored by:** W1 (autoattended wave). **Companions (the design this plans):** +> `.claude/knowledge/delta-card-addressing-integration-map.md` (THE design), +> `.claude/knowledge/agnostic-lazy-world-spine.md` (tiered-substrate framing). + +--- + +## 0. What this plan is, and is NOT + +**IS:** the runtime layer that turns the frozen Wikidata-HHTL skeleton +(`ontology::wikidata_hhtl::WikidataClass`, curated today) + the on-disk +ontologies (`data/ontologies/*.ttl`) into a **foveated, tiered, address-unified +substrate** — a tiny resident skeleton with on-demand hydration of cold detail +keyed by `contract::hhtl::NiblePath`, with eviction driven by the DOLCE +continuant/occurrent bit. + +**IS NOT:** +- A Wikidata loader for the full 115M-entity dump. **There is no dump on disk** + (grepped: only `ontology::wikidata_hhtl` curated fixtures + the + `wikidata_landing` test). The full load is a deferred terminal D-id + (D-LWS-9), explicitly gated behind the probes. Every earlier D-id is + validatable on the **real on-disk ontologies** (`data/ontologies/*.ttl`: + `dul.ttl`, `fibo-*`, `schemaorg.ttl`, `qudt-*.ttl`, `provo.ttl`, `time.ttl`, + `odoo/odoo-core.ttl`, `skos`, `zugferd`) + the 6 curated `WikidataClass` + fixtures. +- A change to the `aerial` proposer's dependency surface. **The firewall holds:** + `aerial` (`lance-graph-arm-discovery`) stays the zero-dep proposer; the hub + (`lance-graph` + `lance-graph-ontology` + `lance-graph-contract`) owns + contract/ontology and the hydration manager. This plan adds NOTHING heavy to + `aerial`. +- A rebalancer. The frozen-ISA addressing (`NiblePath`, append-only offsets) + deletes the rebalancer by construction; this plan does not reintroduce one. +- A replacement for `VersionedGraph` / Lance versioning. The I/P/B frame model + RIDES the existing versioning surface; it does not fork it. + +--- + +## 1. Verified primitives (every symbol grepped on this branch before citing) + +| Symbol | Path (grepped) | Role in this plan | Label | +|---|---|---|---| +| `contract::hhtl::NiblePath` | `lance-graph-contract/src/hhtl.rs:56` (`root`/`child`/`basin`/`parent`/`depth`/`is_ancestor_of`/`packed`/`leaf`/`try_child`/`EMPTY`/`FAN_OUT=16`/`MAX_DEPTH=16`) | THE address key for every tier | **built** (#442) | +| `contract::class_view::FieldMask` | `class_view.rs:69`; `inherit(delta)` @ 136, `from_positions`/`with`/`has`/`count`, `MAX_FIELDS=64` | delta-over-archetype presence mask (the KEY-side delta-card) | **built** (#441) | +| `contract::class_view::ClassView` | `class_view.rs` (trait); `ClassId = u16` @ 53; `StructuralSignature` | the deck (resolve-not-store schema) | **built** (#441) | +| `causal-edge::CausalEdge64` | `crates/causal-edge/src/edge.rs` | the resident-row edge + W-slot Markov pointer | **built** | +| `contract::witness_table::WitnessTable` | `witness_table.rs:96` (`WitnessTable`, `WitnessEntry` @ 65, `get`/`set`) | the per-cohort W-slot arc (6-bit cohort) | **built** | +| `contract::soa_view::MailboxSoaView` | `soa_view.rs:28` (trait) + `MailboxSoaOwner` @ 90 | the hot resident carrier (read-only `&[T]` borrow) | **built** (#437) | +| `bgz-tensor::attention::ComposeTable` | `attention.rs:49`; `compose(a,b)` @ 206, `compose_chain(a,b,c)` @ 215, `build` | per-hop u8 compose for the RISC closure | **built** | +| `bgz-tensor::hhtl_cache::RouteAction` | `hhtl_cache.rs:37` (`Skip`/`Attend`/`Compose`/`Escalate`); `HhtlCache::route(a,b)` @ 200; `HipCache` alias @ 510 | the foveated-prefetch decision cascade | **built** | +| `lance-graph::graph::neighborhood::clam` | `clam.rs`; `measure_cluster_radii` @ 74, `analyze_pareto_convergence`, `ParetoAnalysis`, `RadiusObservation` | the CLAM **radius probe** (NOT a clusterer — see note) | **built (probe only)** | +| `lance-graph::graph::versioned::VersionedGraph` | `versioned.rs:98`; `at_version(n)`, `version()`, `GraphDiff` @ 70, Merkle seals | the Lance versioning surface the I/P/B frames ride | **built** | +| `ontology::wikidata_hhtl::WikidataClass` | `wikidata_hhtl.rs:47`; `nibble_path()`/`presence_mask()`/`signature()`/`dcls_triple()`; `curated_wikidata_classes()` @ 144; `WikidataClassView` @ 215 | the frozen skeleton fixtures (keyframe seed) | **built** | +| `ontology::ttl_parse` | `ttl_parse.rs`; `TtlSource::from_path` @ 74, `parse_ttl_directory` @ 379, `parse_into_proposals` @ 106 | the real on-disk TTL loader (validation substrate) | **built** | +| `ontology::class_resolver::dolce_id` | `class_resolver.rs:45` (`ENDURANT=0`/`PERDURANT=1`/`QUALITY=2`/`ABSTRACT=3`) | the DOLCE basin + the derived 1-bit eviction key | **built** | +| `contract::splat::{SplatChannel, AwarenessPlane16K}` | `splat.rs:32`/`splat.rs:88`; `Counterfactual=3` | discovery-layer carrier (offline only; never on the hot path) | **built** | +| `jc::jirak::prove` | `crates/jc/src/jirak.rs:124`; `pub mod jirak` @ `lib.rs:35` | the Jirak weak-dependence Berry-Esseen proof (the D-ARM-7 engine) | **built (proof)** | +| `jc/examples/splat_louvain_modularity.rs` | grepped; imports `contract::splat::AwarenessPlane16K`; "Louvain modularity gain reduces to popcount-AND" | probe-1 driver (partition locality) | **built (example)** | + +**RISK — symbols I wanted to cite but could NOT verify by grep (flagged, not cited as shipped):** +- **`EpisodicWitness64`** — cited in BOTH companion docs as a shipped type. **Zero + hits in `crates/`.** The actual shipped surface is + `WitnessTable` + `WitnessEntry` (the `witness_table.rs` doc *describes* + the Markov arc "through episodic-reference vectors" but ships no + `EpisodicWitness64` type). This plan cites only `WitnessTable`/`WitnessEntry` + and treats `EpisodicWitness64` as a **doc-level alias / CONJECTURE**, never as + a shipped API. +- **Lance *fragment*-versioning** (fragment-level `compact`/`add_columns`) — the + integration map names "Lance fragment-versioning" as the I/P/B substrate. + Grep shows the repo wires **dataset-level** versioning (`VersionedGraph`, + `at_version`, `version()`), NOT Lance fragment APIs (no `FragmentMetadata` / + `add_columns` / `compact` usage in `crates/lance-graph/src/`). Lance *the + dependency* supports fragments; this repo does not yet wire them. So the + I/P/B-over-fragments mapping is labelled **NEW (must wire Lance fragment APIs) + / CONJECTURE** below, riding `VersionedGraph` as the shipped seam. + +> **Note on CLAM:** `neighborhood::clam` is a **measurement/probe** module +> (`measure_cluster_radii`, `analyze_pareto_convergence`) whose own header says +> *"This is a TEST, not a fact."* It does NOT ship a clustering engine that +> *produces* a P-frame placement. So every "CLAM-clustered delta" claim below is +> built on the **probe** (measure radii → decide placement offline), not on a +> shipped clusterer. The clusterer that consumes the radii is NEW. + +--- + +## 2. The D-id index (all Queued) + +| D-id | Title | Builds on (shipped) | Gated by | +|---|---|---|---| +| **D-LWS-1** | Sparse radix range-delegation register | `NiblePath`, `WikidataClass::nibble_path`, `ttl_parse` | Probe 1 (locality) sizes fan-out | +| **D-LWS-2** | Delta-card value model (`deck ⊗ delta`) | `FieldMask::inherit`, `ClassView`, `WikidataClass::presence_mask` | Probe 2 (residual) | +| **D-LWS-3** | RISC compose-cache + per-predicate composability flag | `ComposeTable::{compose,compose_chain}`, blasgraph `mxm` | Probe 3 (compose hit-rate) | +| **D-LWS-4** | I/P/B frame model over Lance versioning | `VersionedGraph`, `clam::measure_cluster_radii` | Probe 1 + Probe 3 (GOP cadence) | +| **D-LWS-5** | The `NiblePath`-keyed tiered hydration manager | D-LWS-1..4 + `MailboxSoaView`, `RouteAction`, `dolce_id`, `WitnessTable` | all 3 probes; **D-ARM-7** before any write | +| **D-LWS-6** | Foveated prefetch cascade (RouteAction-driven) | `HhtlCache::route`, `ComposeTable` | Probe 3 | +| **D-LWS-7** | Eviction policy on the DOLCE continuant/occurrent 1-bit | `dolce_id`, D-LWS-5 | — | +| **D-LWS-8** | Probe harness (the 3 falsifiers, on real TTL + fixtures) | `splat_louvain_modularity`, `clam`, `FieldMask` | — (this PRODUCES the gates) | +| **D-LWS-9** | DEFERRED: full Wikidata load (115M) into the spine | all above, all probes PASSED, D-ARM-7 landed | every probe + D-ARM-7 | + +**Sequencing DAG:** +``` + D-LWS-8 (probes) ──────────────────────────────┐ (gates everything) + │ │ + D-LWS-1 ───────┼──► D-LWS-4 ──┐ │ + D-LWS-2 ───────┤ ├──► D-LWS-5 ──► D-LWS-6 │ + D-LWS-3 ───────┘ │ │ │ + │ └──► D-LWS-7 │ + └─────────────────► D-LWS-9 (deferred, all gates) + ▲ + D-ARM-7 (Jirak floor) ─── hard prereq for any WRITE +``` + +--- + +## 3. Hard prerequisites — the gates (state these before any D-id ships behavior) + +Three falsifier probes and one statistical floor gate this whole arc. They are +not optional decoration; they are **kill-switches**. A D-id may be *built* +(types compile, fixtures pass) without its gate, but it MUST NOT graduate from +fixture to behavior-on-real-data until its gate is green. + +### Gate P1 — Partition locality (CONJECTURE → must measure) +- **Driver:** `jc/examples/splat_louvain_modularity.rs` (Louvain modularity = + popcount-AND over `contract::splat::AwarenessPlane16K` planes) + + `neighborhood::clam::measure_cluster_radii` on the real P279/subClassOf + + edge graph derived from `data/ontologies/*.ttl` (e.g. the FIBO or + schema.org subtree; biology subtree once Wikidata lands). +- **Pass:** high modularity ⇒ ≥~90% of edges are intra-cohort ⇒ 16-bit + intra-cohort references + the family frontier are real, and the natural + fan-out (the 4/12/16 split) is observed, not assumed. +- **Gates:** D-LWS-1 fan-out choice; D-LWS-4 GOP P-frame placement; D-LWS-5 + cohort residency. +- **Honest status:** `clam.rs` header literally says the radii-coincide-with- + ontology-boundaries claim "is a TEST, not a fact." Treat as **CONJECTURE**. + +### Gate P2 — Delta-card truthfulness (CONJECTURE → must measure) +- **Driver:** D-LWS-8 reconstructs content from N delta bits + (`FieldMask`/value delta over the inherited `WikidataClass` archetype) vs + ground truth; histograms the residual per cohort. +- **Pass:** low residual ⇒ the cohort is real and 8–16 delta bits suffice; + high residual ⇒ wrong cohort or genuinely novel entity (needs a wider delta + or a fork — never a 2-bit axis). +- **Gates:** D-LWS-2 (the value model only ships its bit-width claim once the + residual histogram backs it). + +### Gate P3 — Compose vs materialize (CONJECTURE → must measure) +- **Driver:** D-LWS-8 measures the ≤7-hop reachability hit-rate + + compose-cache eviction churn (via `ComposeTable::compose_chain` / blasgraph + `mxm`) against a stored-edge baseline. +- **Pass:** the N²-avoidance holds (closure is ≤7 cached hops, not a stored + edge), and the churn sets the GOP/compaction cadence. +- **Gates:** D-LWS-3 (compose-cache); D-LWS-4 (GOP cadence); D-LWS-6 (prefetch + cascade Compose arm). + +### Gate D-ARM-7 — the Jirak floor (HARD PREREQUISITE for any live write) +- **Status (grepped):** `STATUS_BOARD.md` D-ARM-7 row = **"Queued — HARD + PREREQUISITE"**; ISSUE `ARM-JIRAK-FLOOR` = **OPEN**. The engine + `jc::jirak::prove` exists (Jirak-Cartan Pillar 5, weak-dependence + Berry-Esseen rate `n^(p/2-1)`); the *gate function* (rule → significant?) + that derives a threshold from it does NOT yet exist. +- **Rule:** **No hydrated rule, discovered edge, or proposed reclassification + may be written to a live store (`SpoStore`, `VersionedGraph`, or any P-frame + delta that persists) until D-ARM-7 lands and the candidate passes the Jirak + weak-dependence significance floor BEFORE the classical `min_support`/ + `min_confidence` gate.** This binds D-LWS-5 (any persist), D-LWS-3 (any + derived edge promoted to a generator), and D-LWS-9 (the full load). Cites + `I-NOISE-FLOOR-JIRAK`. +- **Read-only is exempt:** hydrating cold rows into the hot SoA for *reading* + is not a write and is not gated by D-ARM-7. Only mutation of the persistent + substrate is. + +--- + +## D-LWS-1 — Sparse radix range-delegation register + +**Status: Queued. Label: NEW (composes shipped `NiblePath`).** + +### Scope +A **path-compressed radix/Patricia trie over the frozen ontology**, holding +**occupied branch points only** — the "range register" of the integration map +§3. Each entry is `nibble-range → {Empty | Leaf(file_or_arena) | Delegate(sub-table)}`: +- a sparse DOLCE branch collapses to one `Leaf`; +- a dense branch (the future 40M scholarly-articles cohort) becomes a + `Delegate` → sub-table → many leaves; +- single-child chains collapse (no branch = no information = nothing stored). + +The register's size ≈ the occupied branch count (≈ the OWL/DOLCE class count, +KB–MB), **never** the 256⁴ = 4.3B virtual address space. + +**It reuses `NiblePath` as the address — it does NOT invent a new key.** A +register lookup walks `NiblePath` nibble by nibble (`child`/`try_child`), +matching compressed ranges; `is_ancestor_of` decides delegation containment; +`basin()` extracts the DOLCE root nibble; `packed()` yields the `(u64, u8)` the +directory stores. + +### The shipped primitive it builds on +- `contract::hhtl::NiblePath` — the entire address algebra (`root`, `child`, + `try_child`, `basin`, `parent`, `depth`, `is_ancestor_of`, `packed`, + `FAN_OUT=16`, `MAX_DEPTH=16`). **The register stores ranges of NiblePaths; + it never re-encodes identity.** +- `ontology::wikidata_hhtl::WikidataClass::nibble_path()` — the seed: every + curated class already emits its `NiblePath` from `dolce_id` + subclass path. + D-LWS-1's register is the inverse index over exactly these paths. +- `ontology::ttl_parse::{parse_ttl_directory, parse_into_proposals}` — the + occupied branch points for the *first* register are the classes parsed from + `data/ontologies/*.ttl` (FIBO/DUL/schema.org/QUDT), NOT a Wikidata dump. + +### Firewall / honesty +- Lives in the **hub** (`lance-graph-contract` for the type if zero-dep clean, + else `lance-graph-ontology`). Proposed home: `contract::hhtl` sibling module + `contract::radix_register` (zero-dep: it is pure `NiblePath` + ranges + a + `Vec`-backed trie; no Lance, no Arrow). **Verify zero-dep before placing in + contract;** if it needs ontology types, place in `lance-graph-ontology`. +- `aerial` is NOT touched. The register is an addressing structure the hub + owns; the proposer never sees it. +- **Honest substrate:** built and tested on the on-disk TTL classes + the 6 + `curated_wikidata_classes()` fixtures. The 38× headroom / 2.6%-full / + 4.3B-virtual numbers are **DESIGN TARGETS**, asserted on fixtures, not + measured on 115M (that is D-LWS-9). + +### Which probe / gate +- **Gate P1** sizes the fan-out: the register's branching factor (4/12/16 + split, or the frozen 16-way `NiblePath` default) is a frozen-ISA choice that + P1's Louvain/CLAM measurement must back before it is frozen append-only. Until + P1 is green, D-LWS-1 ships the **16-way `NiblePath`-native** register (the + conservative, already-frozen choice) and leaves the re-parameterization + (256⁴ byte-aligned) as a documented CONJECTURE. + +### Acceptance (fixture-level) +- Round-trip: every `curated_wikidata_classes()` path inserts, looks up, and + the register reconstructs the exact `NiblePath` (CAM-exact, no similarity). +- Path compression: a single-child chain (person → human) stores ONE branch + point, not two (assert occupied-branch count < path count). +- Delegation: a synthetic dense cohort (≥2 leaves under one nibble range) + produces a `Delegate`, a sparse one a `Leaf`. +- Empty-space proof: the 97% unoccupied virtual space materializes zero + entries (assert register size ≈ occupied count, not fan-out^depth). + +--- + +## D-LWS-2 — Delta-card value model (`reconstruct = deck ⊗ delta`) + +**Status: Queued. Label: NEW (composes shipped `FieldMask::inherit` + `ClassView`).** + +### Scope +The VALUE side of the one idea: **a card stores the surprise; the deck stores +the expectation.** An entity's stored content is a **small delta over the +inherited frozen archetype** (its class deck). Reconstruct = `deck ⊗ delta`. +This D-id ships: +1. A `DeltaCard` type = `{ class_path: NiblePath, presence_delta: FieldMask, + value_bits: }` — the per-entity surprise, nothing else. The modal + member of a class is the **empty card** (`FieldMask::EMPTY` delta, zero value + bits): it *is* its archetype, stores nothing. +2. A `reconstruct(deck: &ClassView, card: &DeltaCard) -> ResolvedEntity` that + overlays the card onto the deck. + +### The shipped primitive it builds on +- `contract::class_view::FieldMask::inherit(delta)` (verified @ `class_view.rs:136`) + — **this IS the `deck ⊗ delta` operator for the presence half.** The archetype's + mask `inherit`s the card's delta mask. The KEY-side (#442 `wikidata_landing` + already proved "human ⊂ person inherits path + mask-as-delta"); D-LWS-2 + generalizes it to the VALUE side. +- `contract::class_view::ClassView` (trait) + `ClassId = u16` + + `StructuralSignature` — the deck. Resolve-not-store: the deck holds + fields/labels/DOLCE; the card holds neither (zero schema bits, zero label bits). +- `ontology::wikidata_hhtl::WikidataClass::{presence_mask, signature, dcls_triple}` + — the fixture decks. `dcls_triple()` already returns the + `(ClassId, StructuralSignature, FieldMask)` triple a card resolves against. + +### The honest boundary (carry this verbatim from the integration map) +The delta carries the **compressible profile** (the inherited-archetype +deviation), NOT irreducible specifics. Non-composable, irreducible facts +(`birth_date`, `population`, a novel signature step) are **stored values**, never +a 2-bit axis — this is the **generator-vs-derivable split** (shared with +D-LWS-3). A fusion entity outside any cohort = a wider delta or a fork. + +### Firewall / honesty +- Lives in the hub. The `DeltaCard` type is a candidate for + `lance-graph-contract` (zero-dep: `NiblePath` + `FieldMask` + a small value + payload). **Verify zero-dep;** the `reconstruct` against a live `ClassView` + may belong in `lance-graph-ontology`. +- `aerial` untouched. (`aerial` *proposes* which cohort a row joins via splat, + offline — D-LWS-2 only *reconstructs* given a chosen deck. The proposer's + similarity never enters the value model.) + +### Which probe / gate +- **Gate P2 (delta-card truthfulness)** is THIS D-id's falsifier. The bit-width + claim ("8–16 delta bits suffice") ships only once D-LWS-8's per-cohort + residual histogram is low on the real fixtures. Until then D-LWS-2 ships the + *mechanism* (`reconstruct`) with the bit-width left as a measured parameter, + NOT a hardcoded constant. +- **Free-energy framing (CLAUDE.md The Click):** the card's bit-width IS the + residual surprise `F = (1−likelihood) + kl`; the archetype is the prior, the + delta is the prediction error. Stated as design rationale, not a code claim. + +### Acceptance (fixture-level) +- The modal member of each `curated_wikidata_classes()` cohort reconstructs from + an EMPTY card (zero delta bits) — "absence IS the inheritance." +- A surprising member (e.g. a class with an extra presence bit vs its parent) + reconstructs from a card carrying exactly that one `FieldMask` bit, verified + via `FieldMask::inherit`. +- Round-trip exactness: `reconstruct(deck, encode(entity)) == entity` for the + presence half (CAM-exact; the value-bit half is exact up to the P2-measured + width). + +--- + +## D-LWS-3 — RISC compose-cache + per-predicate composability flag + +**Status: Queued. Label: NEW (composes shipped `ComposeTable` + blasgraph `mxm`).** + +### Scope +**Store the generators, compute the closure.** Storing "every entity related to +every other" = 113M² ≈ 10¹⁶ edges (catastrophe). Instead store +parent/child/spouse generators (~N) and **derive** "related to Y in ≤7 hops" on +demand. This D-id ships: +1. A **per-predicate composability flag** (~12k Wikidata predicates, but + seeded on the on-disk ontology predicates first): each predicate is + `Generator(store)` or `Derivable(compose)`. Non-composable facts + (`birth_date`, `population`) are `Generator` always (irreducible values). +2. A **compose-cache**: derived multi-hop edges computed via + `ComposeTable::compose_chain` (each hop = a u8 table lookup) / blasgraph + `mxm` matrix-power, cached as evictable B-frame entries (≤7 hops). + +### The shipped primitive it builds on +- `bgz-tensor::attention::ComposeTable` (verified @ `attention.rs:49`): + `compose(a, b) -> u8` (one hop), `compose_chain(a, b, c) -> u8` (two hops), + `build(palette)`. **The closure is a fold of `compose` over the path — the + N²-avoidance is literally this table.** +- blasgraph `mxm` (matrix-power semiring multiply in + `lance-graph/src/graph/blasgraph/`) — the bulk alternative for dense + reachability fronts. +- The DOLCE 1-bit (`class_resolver::dolce_id`): **generators = `continuant` = + permanent/cold**; **composed multi-hop paths = `occurrent` = temporary/ + evictable** (shared eviction policy with D-LWS-7). + +### The hub problem dissolves +*United States*, *human*, *Earth* never store their millions of inbound +back-edges — they are **reached** by composing forward generators. Hubs were +only a problem if you imagined materializing them. Stated as design rationale. + +### Firewall / honesty +- `ComposeTable` lives in `bgz-tensor` (standalone, excluded crate, zero-dep). + The compose-cache + composability flag live in the hub + (`lance-graph-ontology` for the predicate flag table; `lance-graph` for the + blasgraph `mxm` driver). `aerial` untouched. +- **Honest substrate:** the predicate flag table is seeded and tested on the + on-disk ontology predicates (FIBO/schema.org/QUDT relations), NOT the 12k + Wikidata predicates. The 12k figure is a DESIGN TARGET for D-LWS-9. + +### Which probe / gate +- **Gate P3 (compose vs materialize)** is THIS D-id's falsifier: the ≤7-hop + hit-rate + eviction churn vs a stored-edge baseline. If the hit-rate is low + (closure NOT reachable in ≤7 hops) the composability flags are wrong, or the + generator set is too sparse — D-LWS-3 does not graduate from fixture to + behavior until P3 is green. +- **D-ARM-7:** if a *derived* edge is ever promoted to a stored generator (a + reclassification of the composability flag), that write passes the Jirak floor + first. Read-time composition is exempt. + +### Acceptance (fixture-level) +- A 3-hop derivable relation over the fixture graph reconstructs via + `compose_chain` and equals the stored-edge ground truth. +- A `Generator` predicate (e.g. a fixture `birth_date`) is never composed — the + flag forces a stored lookup. +- Eviction: a composed B-frame entry under an `occurrent` predicate evicts; a + `continuant` generator does not. + +--- + +## D-LWS-4 — I/P/B frame model over Lance versioning + +**Status: Queued. Label: NEW (rides shipped `VersionedGraph`; the fragment-level GOP is CONJECTURE — see RISK).** + +### Scope +The cold floor IS a keyframe/delta store (the x264/265 capstone). Map: + +| video | spine | shipped seam | +|---|---|---| +| **I-frame** | frozen radix trie (D-LWS-1) + compacted base (self-decodable, exact, rare) | `VersionedGraph` base version + D-LWS-1 register | +| **P-frame** | appended entities + CLAM-clustered new arrivals + corrections (cheap, references the keyframe) | a new `VersionedGraph` version (append-only write) | +| **B-frame** | the RISC compose-cache (D-LWS-3) — multi-hop derived, references multiple bases, evictable | in-memory compose-cache, never persisted | +| **GOP** | keyframe + accumulated deltas, periodically re-baselined by **compaction** | a deliberate version-gated re-emit | + +This D-id ships the **frame-classification + overlay-resolve logic**: given a +`NiblePath` + a version, resolve = base I-frame + N P-frame deltas overlaid +(the LSM/video-seek read-amplification, bounded by GOP length). + +### The shipped primitive it builds on +- `lance-graph::graph::versioned::VersionedGraph` (verified @ `versioned.rs:98`): + `at_version(n)` (time-travel = seek to a frame), `version()` (current frame + number), `GraphDiff {from_version, to_version}` (the P-frame delta between two + versions), Merkle seals (`graph_seal_check` — the keyframe integrity check). + **Each write already creates a new Lance version → that IS a P-frame append.** +- `neighborhood::clam::measure_cluster_radii` — CLAM is adaptive *inside the + delta*: it **proposes** placement of new arrivals as a P-frame (offline, + similarity), the keyframe never moves. (Probe only — the clusterer that acts + on the radii is NEW; see §note on CLAM.) +- D-LWS-1 (the frozen radix) is the I-frame's address half; D-LWS-3 (compose- + cache) is the B-frame. + +### The frozen-vs-adaptive tension resolves here +CLAM is adaptive inside the delta (proposes); the keyframe never moves; +**compaction = re-emit a fresh keyframe = the amortized schema upgrade** (the +one deliberate version-gated moment, carrying the ontology-version byte per +`I-LEGACY-API-FEATURE-GATED`). Adaptive proposes (in the delta); frozen ships +(the keyframe). Deltas are *exact* (a P-frame is lossless); CLAM similarity +*decides* the delta, is never stored *as* the address (the two-trees iron-rule +guard: addressing = exact CAM; similarity = discovery-only, +`faiss-homology`/`I-VSA-IDENTITIES`). + +### Firewall / honesty / RISK +- **RISK (carried from §1):** the integration map says "Lance fragment- + versioning." Grep shows this repo wires **dataset-level** `VersionedGraph`, + NOT Lance **fragment** APIs (`add_columns`/`compact`/`FragmentMetadata` — + zero usage in `crates/lance-graph/src/`). Two honest options, both + documented, decision deferred to the integration-lead: + - **(a) Ride dataset versioning (built seam, ships now):** I=base version, + P=append version, GOP-compaction = re-emit a baseline dataset. Coarser + granularity (whole-dataset, not fragment). + - **(b) Wire Lance fragment APIs (NEW, finer GOP):** use Lance's native + `Fragment` + `compact` so a P-frame is a fragment append and GOP-compaction + is fragment compaction (the integration map's literal intent). This is a + NEW Lance-binding task, not a shipped seam — labelled CONJECTURE until a + spike proves the Lance version on `Cargo.lock` (`lance =6.0.0`) exposes the + needed fragment surface to this crate. +- `aerial` untouched. The frame model is a hub-side cold-floor concern. + +### Which probe / gate +- **Gate P1** backs P-frame placement (CLAM radii must coincide with cohort + boundaries for "CLAM clusters new arrivals" to be real). +- **Gate P3** sets the **GOP/compaction cadence** (eviction churn → how often to + re-baseline). +- **D-ARM-7:** a P-frame that persists a *discovered* rule/edge passes the Jirak + floor first (a P-frame append is a live write). + +### Acceptance (fixture-level) +- Resolve-by-overlay: an entity whose value lives in a P-frame delta resolves to + `base ⊗ delta` and equals ground truth (riding `at_version` / `GraphDiff` on a + fixture `VersionedGraph`). +- Read-amplification bound: resolving across K P-frames touches exactly K+1 + versions (assert the seek cost = GOP length). +- Compaction: re-emitting a keyframe collapses K P-frames into one base; a + subsequent resolve touches 1 version (assert amplification reset). + +--- + +## D-LWS-5 — The `NiblePath`-keyed tiered hydration manager (THE missing runtime piece) + +**Status: Queued. Label: NEW (the synthesis — composes ALL of D-LWS-1..4 + shipped `MailboxSoaView`/`RouteAction`/`dolce_id`/`WitnessTable`).** + +### Scope +The one missing runtime piece named in both companion docs. It is the +**hot mailbox-SoA ↔ cold Lance** manager, keyed by `NiblePath`: +- **lazy-load** a basin's cold rows on first touch (cold `VersionedGraph` read → + hot `MailboxSoaView` SoA), addressed by `NiblePath`, **NOT** by DataFusion + join (address, not join — the cold path splits in two; the join serves only + business-SQL ground truth, off the HHTL hot path); +- **foveated adjacency prefetch** via the `RouteAction` cascade (D-LWS-6); +- **evict** cold/occurrent arenas on the DOLCE 1-bit (D-LWS-7). + +It is a **manager/coordinator**, not a store: it owns the residency decision +(what is hot), delegates addressing to D-LWS-1, value reconstruction to +D-LWS-2/D-LWS-4, and adjacency to D-LWS-3/D-LWS-6. + +### The shipped primitives it builds on +- `contract::hhtl::NiblePath` — the single allocation key. One O(1) address = + ontology position = memory arena = spatial coord. +- `contract::soa_view::MailboxSoaView` / `MailboxSoaOwner` (verified + `soa_view.rs:28/90`) — the hot resident carrier. The manager hydrates INTO a + `MailboxSoaOwner` and hands out read-only `&[T]` views (E-SOA-VIEW-IS-A-BORROW; + never copies, never caches a label — the SoA stays agnostic forever, core + inv #1 / C2). +- `lance-graph::graph::versioned::VersionedGraph` — the cold floor read + (`at_version`), via D-LWS-4's overlay-resolve. +- `contract::witness_table::WitnessTable<64>` + `WitnessEntry` — the per-cohort + (6-bit) Markov W-slot arc the resident row carries; traversal walks W-refs + backward without dereferencing the full SPO store per hop. **(NOT a 16384-bit + VSA bundle — that is retired legacy, survives only as the discovery carrier.)** +- `causal-edge::CausalEdge64` — the resident-row planner edge whose W-slot + points into the `WitnessTable`. +- `ontology::class_resolver::dolce_id` — the residence policy key (D-LWS-7). + +### The bounded-hot / unbounded-cold invariant +- Wikidata is **32-bit-addressed (cold), never resident**; the hot envelope is + the documented **64K–256K** concurrent mailbox window + (`MailboxSoaView`/`witness_table.rs` envelope). You **foveate** the spine: + 256K holds whole corpora + a hydrated Wikidata slice at once. The manager's + job is to keep the foveal region hot and let the periphery stay cold. +- The widths nest: 6-bit cohort ⊂ 16-bit book ⊂ 18-bit hot envelope (256K) ⊂ + 32-bit world. (The 16-bit book tier is CONJECTURE — see RISK on + `EpisodicWitness64`; the 6-bit cohort `WitnessTable` and 32-bit `mailbox_ref` + ends are in code.) + +### Firewall / honesty +- The manager lives in the **hub** — proposed home `lance-graph` (it needs + `VersionedGraph` + blasgraph, which are hub-only) with the residency policy + types in `lance-graph-contract` if zero-dep. **`aerial` is NOT a dependency + and is NOT depended upon by this manager.** The proposer feeds *discovery* + (what lands where, offline); the manager does *runtime residency*. Distinct + layers. +- **Honest substrate:** D-LWS-5 is built and tested hydrating the 6 + `curated_wikidata_classes()` fixtures + the on-disk TTL classes from a fixture + `VersionedGraph`. **No 115M load** — that is D-LWS-9, gated on all probes + + D-ARM-7. + +### Which probe / gate +- **All three probes** gate the manager's behavior-on-real-data: + P1 (cohort residency is local), P2 (hydrated cards reconstruct truthfully), + P3 (adjacency composes, doesn't materialize). +- **D-ARM-7 is a HARD PREREQUISITE for any WRITE the manager performs** (any + P-frame persist, any reclassification, any hydrated rule). Read-only + hydration is exempt. The manager MUST refuse to persist a discovered artifact + until D-ARM-7's gate function is wired and passed. + +### Acceptance (fixture-level) +- First-touch hydration: addressing a cold `NiblePath` loads exactly that + basin's rows into the `MailboxSoaOwner`, and a second touch is a hot hit (no + re-read). +- Address-not-join: the hydration path issues a `VersionedGraph` columnar read + keyed by `NiblePath`, NOT a DataFusion join (assert no join on the hot path). +- Agnostic SoA: the hot view exposes only structure + address + the + `CausalEdge64`/`WitnessTable` arc; NO label is ever stored hot (assert the + SoA carries no string). +- Bounded envelope: hydrating > the 256K envelope triggers eviction (D-LWS-7), + never unbounded growth. +- **Write-refusal:** attempting to persist a discovered rule without a passed + Jirak gate returns an error (the D-ARM-7 prerequisite is enforced in code, not + just documented). + +--- + +## D-LWS-6 — Foveated prefetch cascade (RouteAction-driven) + +**Status: Queued. Label: NEW (composes shipped `HhtlCache::route` + `ComposeTable`).** + +### Scope +Like Google-Maps tile prefetch: the adjacent area streams into the hot context +before reasoning pans to it. When the foveal `NiblePath` is hydrated (D-LWS-5), +the manager prefetches adjacency via the **`RouteAction` cascade**: for each +candidate neighbor archetype pair `(a, b)`, the route decides +`Skip | Attend | Compose | Escalate`. Only `Attend`/`Compose` neighbors are +prefetched; `Skip` (the ~60% majority) costs nothing. + +### The shipped primitive it builds on +- `bgz-tensor::hhtl_cache::RouteAction` (verified @ `hhtl_cache.rs:37`) + + `HhtlCache::route(a, b) -> RouteAction` (verified @ `:200`; `HipCache` alias + @ `:510`). The doc literally calls `route` "the prefetch decision." The + cascade's documented distribution (Skip ~60% / Attend ~35% / Compose rare / + Escalate ~5%) is exactly the foveated-periphery economics. +- `bgz-tensor::attention::ComposeTable` — the `Compose` arm resolves a + multi-hop neighbor via `compose_chain` (shared with D-LWS-3). + +### Firewall / honesty +- `RouteAction`/`HhtlCache`/`ComposeTable` all live in `bgz-tensor` (standalone, + zero-dep). The prefetch driver lives in the hub (D-LWS-5's manager). `aerial` + untouched. +- **Honest substrate:** the cascade is tested on the fixture adjacency derived + from on-disk TTL relations. The Skip/Attend percentages are bgz-tensor's + documented design figures, asserted as a sanity range on fixtures, not + measured on 115M. + +### Which probe / gate +- **Gate P3:** the prefetch hit-rate (did the prefetched periphery get used?) is + part of the compose-vs-materialize measurement; low hit-rate ⇒ the cascade is + over-fetching (re-tune the route thresholds). + +### Acceptance (fixture-level) +- A `Skip` pair is never hydrated; an `Attend` pair is; a `Compose` pair + resolves via `compose_chain` without a stored edge. +- Prefetch is bounded by the 256K envelope (prefetch yields to eviction). + +--- + +## D-LWS-7 — Eviction on the DOLCE continuant/occurrent 1-bit + +**Status: Queued. Label: NEW (composes shipped `dolce_id`).** + +### Scope +The ontology's own top split IS the cache policy. **DOLCE = a 1-bit +permanent/temporary residence policy:** +- **continuant** (Endurant / Quality / Abstract — wholly present, persists) ⇒ + **permanent / cold-persist / resident-priority**; +- **occurrent** (Perdurant — temporal parts, happens-then-ends) ⇒ + **ephemeral / evictable** (the Baton/event traffic; the B-frame compose-cache). + +One eviction policy, derived from the ontology. The manager evicts occurrent +arenas first under envelope pressure; continuant generators are sticky. + +### The shipped primitive it builds on +- `ontology::class_resolver::dolce_id` (verified @ `class_resolver.rs:45`): + `ENDURANT=0`, `PERDURANT=1`, `QUALITY=2`, `ABSTRACT=3`. **The derived 1-bit = + `dolce_id == PERDURANT` ⇒ occurrent ⇒ evictable; else continuant ⇒ + permanent.** The 4-facet `dolce_id 0..3` stays cache-resolvable (do NOT drop + the axis — the residence bit is *derived*, per the invariant guard); eviction + keys on the derived bit. +- `WikidataClass::dolce_id` field — every fixture class already carries it. + +### Firewall / honesty +- Lives in the hub (D-LWS-5's manager + the `dolce_id` resolver). `aerial` + untouched. +- **Invariant guard (verbatim):** keep `dolce_id 0..3` in the cache; the + residence bit is *derived*, not a replacement — never collapse the 4-facet + axis to 1 bit at rest. + +### Which probe / gate +- No probe gates eviction correctness directly; P3's eviction-churn measurement + informs the GOP cadence (D-LWS-4), and the occurrent/B-frame eviction is the + same policy as the compose-cache (D-LWS-3). + +### Acceptance (fixture-level) +- Under simulated envelope pressure, an occurrent (`PERDURANT`) arena evicts + before a continuant one. +- A continuant generator survives eviction (sticky). +- The 4-facet `dolce_id` is still resolvable post-eviction (the axis is not + destroyed). + +--- + +## D-LWS-8 — Probe harness (the 3 falsifiers, on real TTL + fixtures) + +**Status: Queued. Label: NEW (composes shipped `splat_louvain_modularity` + `clam` + `FieldMask`).** + +### Scope +This D-id PRODUCES the three gates. It is the falsifier harness, runnable on the +**real on-disk ontologies** (`data/ontologies/*.ttl`) + curated fixtures — NOT +on a Wikidata dump. Three probes: +1. **Partition locality (P1):** run `jc/examples/splat_louvain_modularity.rs` + (Louvain = popcount-AND over `AwarenessPlane16K`) + `clam::measure_cluster_radii` + on the FIBO/schema.org/DUL subtree; report modularity + whether CLAM radii + coincide with cohort boundaries. +2. **Delta-card residual (P2):** reconstruct each fixture entity from its + `FieldMask` delta over its archetype; histogram the residual per cohort. +3. **Compose hit-rate (P3):** measure ≤7-hop reachability + compose-cache churn + via `ComposeTable::compose_chain` / blasgraph `mxm` vs a stored-edge baseline. + +### The shipped primitives it builds on +- `jc/examples/splat_louvain_modularity.rs` (verified; Louvain-CLAM locality). +- `lance-graph::graph::neighborhood::clam::{measure_cluster_radii, + analyze_pareto_convergence, ParetoAnalysis}` (verified; the radius probe). +- `contract::class_view::FieldMask::inherit` (verified; the residual measurement). +- `ontology::ttl_parse::parse_ttl_directory` (verified; the real-data loader). + +### Firewall / honesty +- The harness lives where the examples live (`crates/jc/examples/` for the + Louvain driver; a hub-side test/bench for P2/P3). `jc` is the cert crate; the + hub owns the residual + compose measurements. `aerial` untouched. +- **This is the honesty backbone of the whole plan:** every CONJECTURE label in + D-LWS-1..7 is discharged (promoted to FINDING or corrected) by a D-LWS-8 probe + result recorded in the companion knowledge docs, per the CLAUDE.md insight + update cycle (Claim → Probe → Result → promote/correct). + +### Which probe / gate +- D-LWS-8 IS the gates. It does not consume a gate; it produces P1/P2/P3. + +### Acceptance +- Each probe runs to completion on real TTL + fixtures and emits a pass/fail + against its documented threshold (§3). Results recorded in + `delta-card-addressing-integration-map.md` Probes section + `EPIPHANIES.md`. + +--- + +## D-LWS-9 — DEFERRED: full Wikidata load (115M) into the spine + +**Status: Queued — DEFERRED (terminal). Label: NEW + CONJECTURE (no dump on disk).** + +### Scope +The full 115M-entity Wikidata load into the spine: the ndjson→`WikidataClass` +loader (named as "Remaining" in the D-ARM-14 STATUS_BOARD row), the dense-cohort +`Delegate` sub-tables (40M scholarly articles), the 12k-predicate composability +table, the full I/P/B GOP over the real corpus. + +### Hard prerequisites (ALL must be green) +- **Every probe (P1, P2, P3) PASSED** on the real TTL + fixtures (D-LWS-8). If + any probe fails, the design is wrong at the fixture scale and the 115M load is + premature. +- **D-ARM-7 landed** and wired: no hydrated rule / discovered edge / + reclassification writes the live store without passing the Jirak floor. +- D-LWS-1..7 shipped and behavior-validated on fixtures. + +### Honest substrate +- **There is NO 115M Wikidata dump on disk** (grepped). This D-id cannot start + until a dump is provisioned AND the gates are green. It is the only D-id that + touches real Wikidata scale; everything before it is validatable today on + `data/ontologies/*.ttl` + 6 curated classes. +- Labelled **CONJECTURE** end-to-end until the gates discharge the design. + +--- + +## 4. Firewall summary (the one-line contract per crate) + +| Crate | Role | This plan's rule | +|---|---|---| +| `lance-graph-arm-discovery` (`aerial`) | zero-dep PROPOSER | **untouched.** Never gains a heavy dep. Feeds discovery offline (splat → cohort proposals); never does runtime residency. | +| `lance-graph-contract` | zero-dep CONTRACT | gains zero-dep types only (`DeltaCard`, radix-register type, residency-policy enum) IF verified zero-dep; else they go to the hub. | +| `lance-graph-ontology` | ONTOLOGY (hub) | owns the radix register seed, composability flag table, `reconstruct`, `dolce_id` residence. | +| `lance-graph` | SPINE (hub) | owns the hydration manager, the I/P/B overlay over `VersionedGraph`, the blasgraph `mxm` driver, the prefetch cascade. | +| `bgz-tensor` | standalone codec | provides `ComposeTable` + `RouteAction`/`HhtlCache` (consumed, not modified). | +| `jc` | standalone cert | provides `jirak::prove` (D-ARM-7 engine) + the Louvain probe example. | + +**The firewall holds:** aerial stays the zero-dep proposer; the hub owns +contract/ontology and the entire runtime hydration layer. No D-id in this plan +proposes making aerial depend on heavy crates. + +--- + +## 5. Risk register + +| # | Risk | Mitigation | +|---|---|---| +| R1 | **`EpisodicWitness64` does not exist** (cited in both companion docs). | Plan cites only `WitnessTable<64>`/`WitnessEntry` (verified). The 16-bit "book" witness tier is CONJECTURE. Flag to integration-lead: the companion docs should be corrected or `EpisodicWitness64` shipped. | +| R2 | **Lance *fragment*-versioning not wired** (only dataset-level `VersionedGraph`). | D-LWS-4 ships option (a) dataset-versioning now; option (b) fragment APIs is a NEW spike, CONJECTURE until the `lance =6.0.0` fragment surface is confirmed reachable from this crate. | +| R3 | **CLAM is a probe, not a clusterer.** | Every "CLAM-clustered" claim builds on `measure_cluster_radii` (offline placement decision); the clusterer that acts on radii is NEW. P1 gates whether radii coincide with cohorts at all. | +| R4 | **All three probes are CONJECTURE.** A failing probe invalidates the design at fixture scale. | D-LWS-8 runs them on real TTL + fixtures BEFORE D-LWS-9; gates are kill-switches, not decoration. | +| R5 | **D-ARM-7 (Jirak floor) is Queued, not shipped.** | Hard prerequisite enforced in code (D-LWS-5 write-refusal acceptance test), not just documented. No live write without it. | +| R6 | **Fan-out freeze is one-shot** (frozen ISA, append-only). | D-LWS-1 ships the conservative 16-way `NiblePath`-native register; the 256⁴ re-parameterization is CONJECTURE, frozen only after P1. | +| R7 | **Zero-dep placement of new contract types unverified.** | Each new type's home is decided by a `cargo check`/`cargo tree` zero-dep verification at implementation time; the plan names both candidate homes. | + +--- + +## 6. Board hygiene (for the implementing session, NOT this planning agent) + +Per CLAUDE.md Mandatory Board-Hygiene Rule, the session that IMPLEMENTS any +D-LWS-* must, in the same commit: +- prepend `.claude/board/INTEGRATION_PLANS.md` (this plan's index entry); +- add the D-LWS-1..9 rows to `.claude/board/STATUS_BOARD.md` (Queued → … → Shipped); +- prepend `.claude/board/AGENT_LOG.md` on completion. + +**This planning agent (W1) does NOT touch any `.claude/board/*` file** — the +orchestrator owns those (per the wave iron rules). This plan file is the only +artifact W1 writes. + +--- + +## 7. Cross-references +- THE design: `.claude/knowledge/delta-card-addressing-integration-map.md`. +- Framing: `.claude/knowledge/agnostic-lazy-world-spine.md`. +- Probe-1 driver: `crates/jc/examples/splat_louvain_modularity.rs`. +- Jirak floor: `crates/jc/src/jirak.rs`; ISSUE `ARM-JIRAK-FLOOR`; STATUS_BOARD D-ARM-7. +- Related plans: `.claude/plans/streaming-arm-nars-discovery-v1.md` (D-ARM arc), + `.claude/specs/wikidata-hhtl-load.md` (120→38GB structural compression). +- Iron rules: `I-VSA-IDENTITIES`, `I-NOISE-FLOOR-JIRAK`, + `I-LEGACY-API-FEATURE-GATED`; `CLAUDE.md` The Click (free-energy = prior + + prediction-error). diff --git a/crates/deepnsm/Cargo.lock b/crates/deepnsm/Cargo.lock index f775e11b..529cfcab 100644 --- a/crates/deepnsm/Cargo.lock +++ b/crates/deepnsm/Cargo.lock @@ -39,9 +39,9 @@ checksum = "7c02d123df017efcdfbd739ef81735b36c5ba83ec3c59c80a9d7ecc718f92e50" [[package]] name = "arrow-array" -version = "57.3.0" +version = "58.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "4c8955af33b25f3b175ee10af580577280b4bd01f7e823d94c7cdef7cf8c9aef" +checksum = "cfd33d3e92f207444098c75b42de99d329562be0cf686b307b097cc52b4e999e" dependencies = [ "ahash", "arrow-buffer", @@ -57,9 +57,9 @@ dependencies = [ [[package]] name = "arrow-buffer" -version = "57.3.0" +version = "58.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "c697ddca96183182f35b3a18e50b9110b11e916d7b7799cbfd4d34662f2c56c2" +checksum = "0c6cd424c2693bcdbc150d843dc9d4d137dd2de4782ce6df491ad11a3a0416c0" dependencies = [ "bytes", "half", @@ -69,9 +69,9 @@ dependencies = [ [[package]] name = "arrow-data" -version = "57.3.0" +version = "58.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "1fdd994a9d28e6365aa78e15da3f3950c0fdcea6b963a12fa1c391afb637b304" +checksum = "3c88210023a2bfee1896af366309a3028fc3bcbd6515fa29a7990ee1baa08ee0" dependencies = [ "arrow-buffer", "arrow-schema", @@ -82,9 +82,9 @@ dependencies = [ [[package]] name = "arrow-schema" -version = "57.3.0" +version = "58.3.0" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "8c872d36b7bf2a6a6a2b40de9156265f0242910791db366a2c17476ba8330d68" +checksum = "f633dbfdf39c039ada1bf9e34c694816eb71fbb7dc78f613993b7245e078a1ed" [[package]] name = "autocfg" @@ -210,6 +210,12 @@ dependencies = [ "ndarray", ] +[[package]] +name = "equivalent" +version = "1.0.2" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "877a4ace8713b0bcf2a4e7eec82529c029f1d0619886d18145fea96c3ffe5c0f" + [[package]] name = "find-msvc-tools" version = "0.1.9" @@ -327,6 +333,12 @@ dependencies = [ "wasip2", ] +[[package]] +name = "glob" +version = "0.3.3" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "0cc23270f6e1808e30a928bdc84dea0b9b4136a8bc82338574f23baf47bbd280" + [[package]] name = "half" version = "2.7.1" @@ -341,9 +353,9 @@ dependencies = [ [[package]] name = "hashbrown" -version = "0.16.1" +version = "0.17.1" source = "registry+https://github.com/rust-lang/crates.io-index" -checksum = "841d1cc9bed7f9236f321df977030373f4a4163ae1a7dbfe1a51a2c1a51d9100" +checksum = "ed5909b6e89a2db4456e54cd5f673791d7eca6732202bbf2a9cc504fe2f9b84a" [[package]] name = "holograph" @@ -383,6 +395,22 @@ dependencies = [ "cc", ] +[[package]] +name = "indexmap" +version = "2.14.0" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "d466e9454f08e4a911e14806c24e16fba1b4c121d1ea474396f396069cf949d9" +dependencies = [ + "equivalent", + "hashbrown", +] + +[[package]] +name = "itoa" +version = "1.0.18" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "8f42a60cbdf9a97f5d2305f08a87dc4e09308d1276d28c869c684d7777685682" + [[package]] name = "js-sys" version = "0.3.97" @@ -407,6 +435,11 @@ dependencies = [ [[package]] name = "lance-graph-contract" version = "0.1.0" +dependencies = [ + "glob", + "serde", + "serde_yaml", +] [[package]] name = "libc" @@ -451,8 +484,7 @@ dependencies = [ "num-complex", "num-integer", "num-traits", - "p64", - "phyllotactic-manifold", + "paste", "portable-atomic", "portable-atomic-util", "rawpointer", @@ -503,15 +535,10 @@ source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "9f7c3e4beb33f85d45ae3e3a1792185706c8e16d043238c593331cc7cd313b50" [[package]] -name = "p64" -version = "0.1.0" -dependencies = [ - "phyllotactic-manifold", -] - -[[package]] -name = "phyllotactic-manifold" -version = "0.1.0" +name = "paste" +version = "1.0.15" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "57c0d7b74b563b49d38dae00a0c37d4d6de9b432382b2892f0574ddcae73fd0a" [[package]] name = "pin-project-lite" @@ -570,6 +597,12 @@ version = "1.0.22" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "b39cdef0fa800fc44525c84ccb54a029961a8215f9619753635a9c0d2538d46d" +[[package]] +name = "ryu" +version = "1.0.23" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "9774ba4a74de5f7b1c1451ed6cd5285a32eddb5cccb8cc655a4e50009e06477f" + [[package]] name = "serde" version = "1.0.228" @@ -600,6 +633,19 @@ dependencies = [ "syn", ] +[[package]] +name = "serde_yaml" +version = "0.9.34+deprecated" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "6a8b1a1a2ebf674015cc02edccce75287f1a0130d394307b36743c2f5d504b47" +dependencies = [ + "indexmap", + "itoa", + "ryu", + "serde", + "unsafe-libyaml", +] + [[package]] name = "shlex" version = "1.3.0" @@ -658,6 +704,12 @@ version = "1.0.24" source = "registry+https://github.com/rust-lang/crates.io-index" checksum = "e6e4313cd5fcd3dad5cafa179702e2b244f760991f45397d14d4ebf38247da75" +[[package]] +name = "unsafe-libyaml" +version = "0.2.11" +source = "registry+https://github.com/rust-lang/crates.io-index" +checksum = "673aac59facbab8a9007c7f6108d11f63b603f7cabff99fabf650fea5c32b861" + [[package]] name = "version_check" version = "0.9.5" diff --git a/crates/jc/Cargo.toml b/crates/jc/Cargo.toml index 162d433f..7eeb47f6 100644 --- a/crates/jc/Cargo.toml +++ b/crates/jc/Cargo.toml @@ -57,3 +57,6 @@ name = "splat_jaccard_adamic_adar" [[example]] name = "splat_perturbationslernen" + +[[example]] +name = "ontology_locality_probe" diff --git a/crates/jc/examples/ontology_locality_probe.rs b/crates/jc/examples/ontology_locality_probe.rs new file mode 100644 index 00000000..cb453580 --- /dev/null +++ b/crates/jc/examples/ontology_locality_probe.rs @@ -0,0 +1,941 @@ +//! Ontology partition-locality probe — the empirical falsifier for the +//! "16 family pointers / inherited nothingness" claim in +//! `.claude/knowledge/delta-card-addressing-integration-map.md`. +//! +//! ## What this probe measures (and what it does NOT) +//! +//! The addressing map claims a frozen ontology radix gives ~0-bit-per-row +//! cost because (a) `subClassOf` edges are overwhelmingly *local* (both +//! endpoints in the same top-level DOLCE-style basin), so a reference can be +//! a 16-bit local pointer instead of a 27-bit global QID, and (b) the +//! per-class "family frontier" (distinct parent basins reachable) is small — +//! the design pencils in a 4/12/16 split, so the question is whether ≤16 +//! distinct basins per class is empirically enough. +//! +//! This probe MEASURES those two numbers — plus the modularity Q of the +//! basin partition — on **real `rdfs:subClassOf` graphs** parsed from the +//! ontology TTLs shipped in `data/ontologies/` (DOLCE-Ultralite, schema.org, +//! Odoo-core, PROV-O, QUDT-core, OWL-Time). +//! +//! ### HONEST SCOPE CAVEAT (read before quoting any number) +//! +//! These are REAL ontology `subClassOf` structures, but they are NOT the +//! full 115M-entity Wikidata `P279` graph. There is NO Wikidata dump on +//! disk. This is a *genuine but smaller* falsifier: it tests the locality +//! hypothesis on the same KIND of structure (hand-curated upper + domain +//! ontologies, exactly the "frozen ISA" the map freezes), at 10^2..10^3 +//! classes rather than 10^8. A PASS here means "the locality hypothesis +//! survives on real ontology structure"; it does NOT mean "proven on +//! Wikidata". The verdict text repeats this caveat. +//! +//! ## Definitions +//! +//! - **top-basin** of a class = the root ancestor reached by walking +//! `subClassOf` parents upward (the DOLCE-style top facet). A class with +//! no parent is its own basin (a root). Multi-parent classes pick a +//! deterministic representative root (smallest interned id) so the +//! partition is well-defined; cycles are broken defensively. +//! - **locality** = fraction of `subClassOf` edges `(child -> parent)` whose +//! two endpoints share a top-basin. (Edges to a different basin are the +//! "non-local" references that would need a wider pointer.) +//! - **fan-out** = per class, the number of DISTINCT parent-basins among its +//! direct `subClassOf` parents. Max + histogram answer "is ≤16 enough?". +//! - **modularity Q** = Newman modularity of the basin partition on the +//! undirected `subClassOf` graph, computed with the popcount-AND gain idea +//! reused from `splat_louvain_modularity.rs` (within-partition edge mass +//! via AND of per-basin membership bitsets). +//! +//! Zero external deps (std only) — jc stays standalone. The TTL "parser" is +//! a minimal line scanner for `rdfs:subClassOf` triples ONLY; it is NOT a +//! general Turtle parser and deliberately skips blank-node restrictions +//! (`rdfs:subClassOf [ ... ]`) since those are anonymous OWL restrictions, +//! not class-to-class edges. +//! +//! Run: +//! cargo run --manifest-path crates/jc/Cargo.toml \ +//! --example ontology_locality_probe +//! cargo run --manifest-path crates/jc/Cargo.toml \ +//! --example ontology_locality_probe -- /path/to/ontology/dir + +use std::collections::{BTreeMap, BTreeSet}; +use std::path::{Path, PathBuf}; + +// ── TTL subClassOf line scanner (zero-dep, NOT a general Turtle parser) ───── +// +// Turtle predicate-list shape we handle: +// +// ex:Child a owl:Class ; <- establishes current subject +// rdfs:subClassOf ex:Parent , <- emits (Child -> Parent) +// ex:Other ; <- comma-continued object list +// rdfs:label "..." . <- '.' terminates the subject +// +// Rules: +// * The current subject is the first token of a statement (the token +// before `a` / `rdf:type`, or simply the first token on a line that is +// not whitespace-led when no subject is active). It persists until a +// statement-terminating `.`. +// * `rdfs:subClassOf` / bare `subClassOf` sets the current predicate so a +// following line beginning with `,` continues the object list. +// * Objects that are named IRIs (prefixed `pfx:Local`, `:Local`, or +// ``) become edges. An object that is `[` opens an anonymous OWL +// restriction — SKIPPED (it is not a class-to-class edge). +// * String literals and `#` comments are stripped first so that the word +// "subClassOf" inside an `rdfs:comment "..."` is never mistaken for a +// predicate. + +/// Remove `"..."`/`'''...'''`/`"""..."""` string literals and trailing `#` +/// comments from a single physical line, so the tokenizer never sees TTL +/// content text. Multi-line triple-quoted strings are handled by the caller +/// via the `in_long_string` flag. +fn strip_strings_and_comments(line: &str, in_long_string: &mut bool) -> String { + // Char-based scan (UTF-8 safe: ontology comments contain multi-byte chars + // like the zero-width space U+200B in the QUDT license text, so byte + // slicing would panic on a char boundary). All delimiters we look for + // ('"', '\'', '#', '\\') are single-byte ASCII. + let chars: Vec = line.chars().collect(); + let mut out = String::with_capacity(line.len()); + let n = chars.len(); + let is_triple = |c: &[char], i: usize, q: char| { + i + 3 <= c.len() && c[i] == q && c[i + 1] == q && c[i + 2] == q + }; + let mut i = 0; + while i < n { + if *in_long_string { + // Look for the closing """ or ''' (treated the same). + if is_triple(&chars, i, '"') || is_triple(&chars, i, '\'') { + *in_long_string = false; + i += 3; + } else { + i += 1; + } + continue; + } + // Opening of a long ("""/''') string? + if is_triple(&chars, i, '"') || is_triple(&chars, i, '\'') { + let q = chars[i]; + i += 3; + // Does it also close on this same line? + let mut closed = false; + while i < n { + if is_triple(&chars, i, q) { + i += 3; + closed = true; + break; + } + i += 1; + } + if !closed { + *in_long_string = true; + } + out.push(' '); + continue; + } + let c = chars[i]; + if c == '"' || c == '\'' { + // Single-line quoted literal: skip to matching quote, honoring \". + let quote = c; + i += 1; + while i < n { + let d = chars[i]; + if d == '\\' { + i += 2; + continue; + } + if d == quote { + i += 1; + break; + } + i += 1; + } + out.push(' '); + continue; + } + if c == '#' { + // Rest of line is a comment. + break; + } + out.push(c); + i += 1; + } + out +} + +/// True iff `tok` is a named-IRI object we accept as a subClassOf target: +/// a prefixed name (`pfx:Local` or `:Local`) or an angle-bracket IRI +/// (`<...>`). Rejects blank nodes (`[`, `_:`), list punctuation, and the +/// `owl:Thing`-style roots are accepted (they ARE named classes / valid +/// basins). We DO reject `rdf:type`-ish predicates by construction because +/// this is only ever called on object position. +fn is_named_iri(tok: &str) -> bool { + if tok.is_empty() { + return false; + } + if tok.starts_with('<') { + return tok.len() > 2; // + } + if tok.starts_with("_:") { + return false; // explicit blank node label + } + if tok.starts_with('[') || tok.starts_with(']') { + return false; + } + // prefixed name: must contain a ':' and start with an identifier or ':' + let first = tok.chars().next().unwrap(); + (first == ':' || first.is_alphabetic()) && tok.contains(':') +} + +/// Normalize an object/subject token into a stable class key. Strips a +/// trailing `;` `,` `.` punctuation and surrounding `<>`; leaves prefixed +/// names as-is. Returns `None` for things that are not class references. +fn normalize_iri(tok: &str) -> Option { + let t = tok.trim_matches(|c| c == ';' || c == ',' || c == '.'); + if t.is_empty() { + return None; + } + if t.starts_with('<') && t.ends_with('>') && t.len() > 2 { + return Some(t.to_string()); + } + if is_named_iri(t) { + return Some(t.to_string()); + } + None +} + +/// Parse all `rdfs:subClassOf` / bare-`subClassOf` class-to-class edges from +/// a TTL document. Returns a vec of `(child, parent)` IRI-key pairs. +/// +/// Self-loops (`X subClassOf X`) and edges into blank-node restrictions are +/// dropped. This is the function the `#[cfg(test)]` parser test exercises. +pub fn parse_subclass_edges(ttl: &str) -> Vec<(String, String)> { + const SUBCLASS: &str = "subClassOf"; // matches rdfs:subClassOf AND bare subClassOf + let mut edges: Vec<(String, String)> = Vec::new(); + let mut current_subject: Option = None; + let mut predicate_is_subclass = false; + let mut in_long_string = false; + // Depth of nested `[ ... ]` blank-node restrictions. While > 0 we are + // INSIDE an anonymous OWL restriction and emit no edges; the restriction + // spans multiple physical lines, so this persists across the line loop. + let mut bracket_depth: i32 = 0; + + for raw_line in ttl.lines() { + let line = strip_strings_and_comments(raw_line, &mut in_long_string); + let leading_ws = raw_line.starts_with(char::is_whitespace); + + // Split into whitespace tokens (Turtle is whitespace-delimited at this + // granularity; we already stripped strings/comments). + let toks: Vec<&str> = line.split_whitespace().collect(); + if toks.is_empty() { + // A blank physical line does not by itself end a statement. + continue; + } + + let mut idx = 0; + + // A statement that begins flush-left (no leading whitespace) and whose + // first token is a named IRI / blank starts a NEW subject — UNLESS the + // line is a pure object-list continuation beginning with ',' (handled + // below) or a directive (@prefix / @base / PREFIX / BASE). + let first = toks[0]; + let is_directive = first.starts_with('@') + || first.eq_ignore_ascii_case("prefix") + || first.eq_ignore_ascii_case("base"); + if is_directive { + // Directives don't carry subjects or edges; but a directive still + // can be terminated by '.', which must not clobber subject state of + // a real statement (directives are always flush-left & self + // contained), so just skip the whole line. + continue; + } + + if bracket_depth == 0 + && !leading_ws + && first != "," + && first != ";" + && !first.starts_with('[') + { + // New subject candidate (only when not inside a blank node). + if let Some(subj) = normalize_iri(first) { + current_subject = Some(subj); + } else { + current_subject = None; + } + predicate_is_subclass = false; + idx = 1; + } + + // Walk remaining tokens, tracking predicate switches and emitting + // edges while the active predicate is subClassOf AND we are at + // bracket depth 0 (outside any anonymous restriction). + while idx < toks.len() { + let tok = toks[idx]; + + // Update bracket depth from any '[' / ']' characters in the token, + // then move on if the token is pure bracket punctuation. A '[' + // opening means the CURRENT subClassOf object is an anonymous + // restriction; we suppress emission until the matching ']' but + // stay in subClassOf predicate mode so a following ',' continues + // the OUTER object list. + let opens = tok.matches('[').count() as i32; + let closes = tok.matches(']').count() as i32; + if opens > 0 || closes > 0 { + bracket_depth += opens - closes; + if bracket_depth < 0 { + bracket_depth = 0; + } + // If the token is only brackets (possibly with ',' / ';'), + // there is nothing else to interpret on it. + let stripped: String = tok + .chars() + .filter(|&c| c != '[' && c != ']' && c != ',' && c != ';') + .collect(); + if stripped.is_empty() { + idx += 1; + continue; + } + } + + // Anything inside a blank node is ignored entirely. + if bracket_depth > 0 { + idx += 1; + continue; + } + + // Object-list continuation: ',' keeps the current predicate. + if tok == "," { + idx += 1; + continue; + } + // ';' ends the current predicate's object list (a new predicate + // follows on this or a later line). + if tok == ";" { + predicate_is_subclass = false; + idx += 1; + continue; + } + // '.' terminates the whole statement → no active subject. + if tok.starts_with('.') && tok.len() == 1 { + current_subject = None; + predicate_is_subclass = false; + idx += 1; + continue; + } + + // Predicate detection: rdfs:subClassOf or bare subClassOf. + let bare = tok.trim_end_matches([';', ',']); + if bare == SUBCLASS || bare.ends_with(":subClassOf") || bare == "rdfs:subClassOf" { + predicate_is_subclass = true; + idx += 1; + continue; + } + // In subClassOf object position: emit a named-IRI edge. + if predicate_is_subclass { + if let (Some(child), Some(parent)) = + (current_subject.clone(), normalize_iri(tok)) + { + if child != parent { + edges.push((child, parent)); + } + } + idx += 1; + continue; + } + + // Not in subClassOf mode: a token like `a`, `rdf:type`, + // `owl:disjointWith`, `rdfs:label` is a (non-subclass) predicate; + // it just resets predicate state. We do not need its objects. + if bare == "a" || bare.contains(':') { + predicate_is_subclass = false; + } + idx += 1; + } + } + edges +} + +// ── class graph: intern IRIs, build parent adjacency, assign top-basins ───── + +/// Interned subClassOf DAG over class IRIs. +pub struct ClassGraph { + /// id -> IRI key (for printing). + pub names: Vec, + /// Direct parents of each class (deduplicated, sorted). + pub parents: Vec>, + /// All edges as interned (child, parent) id pairs. + pub edges: Vec<(usize, usize)>, +} + +impl ClassGraph { + /// Build from `(child, parent)` IRI-key edges. Every IRI appearing in any + /// position becomes a node (a parent that is never a child is a root). + pub fn from_edges(iri_edges: &[(String, String)]) -> Self { + let mut id_of: BTreeMap = BTreeMap::new(); + let mut names: Vec = Vec::new(); + let intern = |s: &str, names: &mut Vec, id_of: &mut BTreeMap| { + if let Some(&id) = id_of.get(s) { + id + } else { + let id = names.len(); + names.push(s.to_string()); + id_of.insert(s.to_string(), id); + id + } + }; + let mut edges: Vec<(usize, usize)> = Vec::new(); + for (c, p) in iri_edges { + let ci = intern(c, &mut names, &mut id_of); + let pi = intern(p, &mut names, &mut id_of); + edges.push((ci, pi)); + } + let n = names.len(); + let mut parents: Vec> = vec![Vec::new(); n]; + for &(c, p) in &edges { + parents[c].push(p); + } + for ps in parents.iter_mut() { + ps.sort_unstable(); + ps.dedup(); + } + Self { names, parents, edges } + } + + pub fn n_classes(&self) -> usize { + self.names.len() + } + + /// Assign each class to its top-basin = the root ancestor reached by + /// walking parents upward. Multi-parent: follow the parent with the + /// SMALLEST interned id (deterministic representative). Cycles: broken by + /// a visited-set; the entry node of a cycle becomes its own basin. + /// Returns `basin[id] = root_id`. + pub fn assign_basins(&self) -> Vec { + let n = self.n_classes(); + let mut basin = vec![usize::MAX; n]; + for start in 0..n { + if basin[start] != usize::MAX { + continue; + } + // Walk up to a root, recording the path; memoize on the way back. + let mut path: Vec = Vec::new(); + let mut visiting: BTreeSet = BTreeSet::new(); + let mut cur = start; + let root; + loop { + if let Some(&memo) = basin.get(cur) { + if memo != usize::MAX { + root = memo; + break; + } + } + if visiting.contains(&cur) { + // Cycle: treat `cur` as the basin root for this SCC entry. + root = cur; + break; + } + visiting.insert(cur); + path.push(cur); + // Pick the smallest-id parent (deterministic). No parent → root. + match self.parents[cur].iter().min() { + Some(&p) => cur = p, + None => { + root = cur; + break; + } + } + } + for id in path { + basin[id] = root; + } + if basin[start] == usize::MAX { + basin[start] = root; + } + } + basin + } +} + +// ── metric 1: locality ────────────────────────────────────────────────────── + +/// Fraction of edges whose child and parent share a top-basin. +/// Returns (local_edges, total_edges, fraction). Empty graph → fraction 0. +pub fn locality(edges: &[(usize, usize)], basin: &[usize]) -> (usize, usize, f64) { + let total = edges.len(); + if total == 0 { + return (0, 0, 0.0); + } + let local = edges + .iter() + .filter(|&&(c, p)| basin[c] == basin[p]) + .count(); + (local, total, local as f64 / total as f64) +} + +// ── metric 2: fan-out (distinct parent-basins per class) ──────────────────── + +/// Per-class count of DISTINCT parent-basins among its direct subClassOf +/// parents. Returns (max_fanout, histogram) where histogram[k] = #classes +/// whose fan-out == k. Classes with no parents contribute fan-out 0. +pub fn fan_out(graph: &ClassGraph, basin: &[usize]) -> (usize, BTreeMap) { + let mut hist: BTreeMap = BTreeMap::new(); + let mut max_fo = 0usize; + for c in 0..graph.n_classes() { + let distinct: BTreeSet = graph.parents[c].iter().map(|&p| basin[p]).collect(); + let fo = distinct.len(); + max_fo = max_fo.max(fo); + *hist.entry(fo).or_insert(0) += 1; + } + (max_fo, hist) +} + +// ── metric 3: modularity Q of the basin partition ────────────────────────── +// +// Newman modularity on the UNDIRECTED subClassOf graph (each subClassOf edge +// contributes one undirected link between child and parent): +// +// Q = Σ_c [ e_c / m - (a_c / 2m)^2 ] +// +// where m = |E|, e_c = number of edges fully inside basin c, a_c = sum of +// degrees of nodes in basin c. We reuse the `splat_louvain_modularity.rs` +// idea — the within-community edge mass is a popcount-AND between a node's +// neighbour bitset and the basin-membership bitset — but with dynamically +// sized `Vec` planes so the probe handles ontologies with thousands of +// classes (the contract's fixed 16,384-bit `AwarenessPlane16K` is too small +// for schema.org). Self-loops are excluded by construction (the parser drops +// `X subClassOf X`). + +/// A dynamically sized bitset (the standalone analogue of `AwarenessPlane16K`). +struct BitPlane(Vec); + +impl BitPlane { + fn zero(n_bits: usize) -> Self { + BitPlane(vec![0u64; n_bits.div_ceil(64)]) + } + #[inline] + fn set(&mut self, idx: usize) { + self.0[idx / 64] |= 1u64 << (idx % 64); + } + #[inline] + fn and_popcount(&self, other: &BitPlane) -> u32 { + self.0 + .iter() + .zip(other.0.iter()) + .map(|(a, b)| (a & b).count_ones()) + .sum() + } +} + +/// Compute Newman modularity Q of the basin partition. Returns Q in +/// [-0.5, 1.0]. Empty graph → 0.0. +pub fn modularity_q(graph: &ClassGraph, basin: &[usize]) -> f64 { + let n = graph.n_classes(); + let m = graph.edges.len(); + if m == 0 || n == 0 { + return 0.0; + } + let two_m = 2.0 * m as f64; + + // Undirected neighbour bitset per node (both directions of each edge). + let mut neigh: Vec = (0..n).map(|_| BitPlane::zero(n)).collect(); + let mut degree = vec![0u32; n]; + for &(c, p) in &graph.edges { + neigh[c].set(p); + neigh[p].set(c); + degree[c] += 1; + degree[p] += 1; + } + + // Group node ids by basin; build a membership bitset per basin. + let mut members: BTreeMap> = BTreeMap::new(); + for (id, &b) in basin.iter().enumerate() { + members.entry(b).or_default().push(id); + } + + let mut q = 0.0; + for ids in members.values() { + let mut plane = BitPlane::zero(n); + for &id in ids { + plane.set(id); + } + // e_c counted twice (once per endpoint) via Σ_u popcount(neigh[u] AND plane). + let mut e_c_times_two = 0u32; + let mut a_c = 0.0; + for &id in ids { + e_c_times_two += neigh[id].and_popcount(&plane); + a_c += degree[id] as f64; + } + let e_c = e_c_times_two as f64 / 2.0; + q += (e_c / m as f64) - (a_c / two_m).powi(2); + } + q +} + +// ── verdict ────────────────────────────────────────────────────────────────── + +/// Verdict tier for the locality hypothesis. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub enum Verdict { + /// High locality AND fan-out fits the family frontier. + Pass, + /// Locality decent but borderline, or fan-out near the cap. + Marginal, + /// Locality low — local-pointer assumption does not hold. + Fail, +} + +impl Verdict { + pub fn as_str(self) -> &'static str { + match self { + Verdict::Pass => "PASS", + Verdict::Marginal => "MARGINAL", + Verdict::Fail => "FAIL", + } + } +} + +/// Decide the verdict from the measured numbers. +/// +/// Thresholds (stated, not hand-waved): +/// * locality ≥ 0.90 AND max_fanout ≤ 16 → PASS (the map's claim) +/// * locality ≥ 0.75 (or max_fanout in 17..=32) → MARGINAL +/// * otherwise → FAIL +/// +/// The "16" frontier is the design's pencilled cap; max_fanout > 16 means a +/// single class needs more than 16 distinct family pointers, breaking the +/// 4/12/16 split as stated (though a wider frontier byte would still work). +pub fn verdict(locality_frac: f64, max_fanout: usize) -> Verdict { + if locality_frac >= 0.90 && max_fanout <= 16 { + Verdict::Pass + } else if locality_frac >= 0.75 || (max_fanout > 16 && max_fanout <= 32) { + Verdict::Marginal + } else { + Verdict::Fail + } +} + +// ── load real ontology TTLs from a directory ──────────────────────────────── + +/// All parsed `(child, parent)` IRI edges plus the sorted list of TTL files +/// they came from. +type LoadedOntology = (Vec<(String, String)>, Vec); + +/// Recursively collect `*.ttl` files under `dir`, parse subClassOf edges from +/// each, and return (all_edges, sorted_file_list). I/O errors on individual +/// files are skipped with a note to stderr (the probe is best-effort over +/// whatever real ontologies are present). +fn load_dir(dir: &Path) -> std::io::Result { + let mut edges: Vec<(String, String)> = Vec::new(); + let mut files: Vec = Vec::new(); + let mut stack = vec![dir.to_path_buf()]; + while let Some(d) = stack.pop() { + let rd = match std::fs::read_dir(&d) { + Ok(rd) => rd, + Err(e) => { + eprintln!(" (skip dir {}: {})", d.display(), e); + continue; + } + }; + for entry in rd.flatten() { + let path = entry.path(); + if path.is_dir() { + stack.push(path); + } else if path.extension().map(|e| e == "ttl").unwrap_or(false) { + match std::fs::read_to_string(&path) { + Ok(text) => { + let mut e = parse_subclass_edges(&text); + edges.append(&mut e); + files.push(path); + } + Err(e) => eprintln!(" (skip {}: {})", path.display(), e), + } + } + } + } + files.sort(); + Ok((edges, files)) +} + +// ── main ───────────────────────────────────────────────────────────────────── + +fn main() { + // Data dir: arg 1, else the repo-default `data/ontologies`. + let arg = std::env::args().nth(1); + let dir = arg + .clone() + .map(PathBuf::from) + .unwrap_or_else(|| PathBuf::from("data/ontologies")); + + println!("══════════════════════════════════════════════════════════════════════"); + println!(" Ontology partition-locality probe (probe 1: partition locality)"); + println!("══════════════════════════════════════════════════════════════════════"); + println!(); + println!(" SUBSTRATE: REAL rdfs:subClassOf graphs from {}", dir.display()); + println!(" This is a GENUINE but SMALLER falsifier (10^2..10^3 classes)."); + println!(" It is NOT the full 115M-entity Wikidata P279 graph — there is no"); + println!(" Wikidata dump on disk. A PASS means the locality hypothesis survives"); + println!(" on real ontology structure, NOT that it is proven on Wikidata."); + println!(); + + let (iri_edges, files) = match load_dir(&dir) { + Ok(r) => r, + Err(e) => { + eprintln!("FATAL: cannot read {}: {}", dir.display(), e); + eprintln!("Pass a data dir as arg 1, e.g.:"); + eprintln!(" cargo run --manifest-path crates/jc/Cargo.toml \\"); + eprintln!(" --example ontology_locality_probe -- /abs/path/to/ontologies"); + std::process::exit(1); + } + }; + + if iri_edges.is_empty() { + eprintln!("No rdfs:subClassOf edges found under {}.", dir.display()); + eprintln!("(Found {} .ttl files but no class-to-class subClassOf triples.)", files.len()); + std::process::exit(1); + } + + println!(" TTL files parsed ({}):", files.len()); + for f in &files { + // Show the file name + how many edges it alone contributes. + if let Ok(text) = std::fs::read_to_string(f) { + let n = parse_subclass_edges(&text).len(); + let name = f.file_name().and_then(|s| s.to_str()).unwrap_or("?"); + println!(" {:<28} {:>5} subClassOf edges", name, n); + } + } + println!(); + + let graph = ClassGraph::from_edges(&iri_edges); + let basin = graph.assign_basins(); + let n_basins: BTreeSet = basin.iter().copied().collect(); + + let (local, total, loc_frac) = locality(&graph.edges, &basin); + let (max_fo, fo_hist) = fan_out(&graph, &basin); + let q = modularity_q(&graph, &basin); + let v = verdict(loc_frac, max_fo); + + println!("──────────────────────────────────────────────────────────────────────"); + println!(" VERDICT TABLE (measured on real ontology subClassOf structure)"); + println!("──────────────────────────────────────────────────────────────────────"); + println!(" classes (nodes) : {}", graph.n_classes()); + println!(" subClassOf edges : {}", total); + println!(" top-basins (root facets) : {}", n_basins.len()); + println!(); + println!(" LOCALITY : {}/{} = {:.4} ({:.2}% of edges are intra-basin)", + local, total, loc_frac, loc_frac * 100.0); + println!(" (the map's '~90% local' claim — measured value above)"); + println!(); + println!(" FAN-OUT (distinct parent-basins per class)"); + println!(" max : {} (is <=16 enough? {})", + max_fo, if max_fo <= 16 { "YES" } else { "NO — exceeds the pencilled 16-frontier" }); + println!(" histogram (fanout -> #classes):"); + for (k, cnt) in &fo_hist { + let bar = "#".repeat((*cnt).min(60)); + println!(" {:>3} -> {:>5} {}", k, cnt, bar); + } + println!(); + println!(" MODULARITY Q (basin partition, Newman)"); + println!(" Q : {:.4}", q); + println!(" (Q>0.3 = clear community structure; Q->1 = near-perfectly modular)"); + println!(); + + println!("══════════════════════════════════════════════════════════════════════"); + println!(" VERDICT : {}", v.as_str()); + println!("══════════════════════════════════════════════════════════════════════"); + match v { + Verdict::Pass => { + println!(" High locality ({:.1}%) AND max fan-out {} <= 16.", loc_frac * 100.0, max_fo); + println!(" ⇒ On REAL ontology structure, 16-bit LOCAL references + a <=16"); + println!(" family frontier ARE real: the vast majority of subClassOf"); + println!(" references stay inside one top-basin, and no class needs more"); + println!(" than 16 distinct parent-basin pointers."); + } + Verdict::Marginal => { + println!(" Locality {:.1}% / max fan-out {}. The hypothesis is PARTIALLY", loc_frac * 100.0, max_fo); + println!(" supported: either locality is below the 90% target, or a few"); + println!(" classes exceed the 16-frontier (a wider frontier byte would fix"); + println!(" those). The local-pointer idea is plausible but not clean here."); + } + Verdict::Fail => { + println!(" Locality {:.1}% / max fan-out {}. The local-pointer assumption", loc_frac * 100.0, max_fo); + println!(" does NOT hold on this structure: too many subClassOf edges cross"); + println!(" basins, so 16-bit local references would miss their targets."); + } + } + println!(); + println!(" HONEST CAVEAT (mandatory): measured on REAL ontologies"); + println!(" (DOLCE-Ultralite, schema.org, Odoo, PROV-O, QUDT, OWL-Time), NOT on"); + println!(" Wikidata. Same KIND of frozen-ISA structure, ~10^3 classes not 10^8."); + println!(" This FALSIFIES-or-survives the claim on real data; it does NOT prove"); + println!(" it at Wikidata scale. The Wikidata P279 run remains the open probe."); + println!("══════════════════════════════════════════════════════════════════════"); +} + +// ── tests ───────────────────────────────────────────────────────────────────── + +#[cfg(test)] +mod tests { + use super::*; + + /// A tiny inline TTL exercising every parser path: prefixed object, + /// `:Local` object, comma-continued object list, a blank-node restriction + /// that MUST be skipped, the word "subClassOf" buried in a comment string + /// that MUST NOT be parsed, and a `.`-terminated statement. + const TINY_TTL: &str = r#" +@prefix ex: . +@prefix rdfs: . + +ex:Animal a owl:Class ; + rdfs:comment "Top type. Note: every Dog subClassOf Animal informally." ; + rdfs:label "Animal" . + +ex:Dog a owl:Class ; + rdfs:subClassOf ex:Animal ; + rdfs:label "Dog" . + +ex:Puppy a owl:Class ; + rdfs:subClassOf ex:Dog , + ex:Animal , + [ rdf:type owl:Restriction ; + owl:onProperty ex:hasParent ; + owl:someValuesFrom ex:Dog ] ; + rdfs:label "Puppy" . + +:LocalThing a owl:Class ; + rdfs:subClassOf :OtherLocal . +"#; + + #[test] + fn parser_extracts_expected_edges_only() { + let mut edges = parse_subclass_edges(TINY_TTL); + edges.sort(); + let mut expected = vec![ + ("ex:Dog".to_string(), "ex:Animal".to_string()), + ("ex:Puppy".to_string(), "ex:Dog".to_string()), + ("ex:Puppy".to_string(), "ex:Animal".to_string()), + (":LocalThing".to_string(), ":OtherLocal".to_string()), + ]; + expected.sort(); + assert_eq!(edges, expected, "parser must emit exactly the 4 named-class edges"); + } + + #[test] + fn parser_skips_comment_text_and_blank_nodes() { + let edges = parse_subclass_edges(TINY_TTL); + // The comment mentions "Dog subClassOf Animal" — must NOT produce an + // edge with a literal-word subject/object. + assert!( + !edges.iter().any(|(c, _)| c == "Dog" || c == "every"), + "comment text must not become an edge" + ); + // The blank-node restriction on Puppy must NOT add an edge to a '['. + assert!( + !edges.iter().any(|(_, p)| p.starts_with('[')), + "blank-node restriction must be skipped" + ); + // Exactly 4 edges total (Puppy has 2 named parents, not 3). + assert_eq!(edges.len(), 4); + } + + #[test] + fn parser_handles_angle_bracket_iri_objects() { + let ttl = r#" +ex:A a owl:Class ; + rdfs:subClassOf . +"#; + let edges = parse_subclass_edges(ttl); + assert_eq!( + edges, + vec![("ex:A".to_string(), "".to_string())] + ); + } + + /// Build a planted 2-basin subClassOf forest with a KNOWN number of + /// cross-basin edges, then assert the locality fraction is exactly the + /// hand-computed value. + /// + /// Basin A: rootA <- a1, a2, a3 (3 intra-basin edges) + /// Basin B: rootB <- b1, b2 (2 intra-basin edges) + /// Cross : a3 -> rootB (1 cross-basin edge) + /// Total 6 edges, 5 local ⇒ locality = 5/6. + fn planted_two_basin() -> (Vec<(String, String)>, &'static str) { + let edges = vec![ + ("a1".into(), "rootA".into()), + ("a2".into(), "rootA".into()), + ("a3".into(), "rootA".into()), + ("b1".into(), "rootB".into()), + ("b2".into(), "rootB".into()), + ("a3".into(), "rootB".into()), // the one cross-basin edge + ]; + (edges, "5/6") + } + + #[test] + fn locality_on_planted_two_basin_is_five_sixths() { + let (iri_edges, _) = planted_two_basin(); + let graph = ClassGraph::from_edges(&iri_edges); + let basin = graph.assign_basins(); + + // a3 has parents {rootA, rootB}; smallest interned id wins as its + // representative basin. Interning order: a1,rootA,a2,a3,b1,rootB,b2. + // rootA interns before rootB, so a3's basin = rootA. + let id = |s: &str| graph.names.iter().position(|n| n == s).unwrap(); + assert_eq!(basin[id("a3")], basin[id("rootA")], "a3 should land in basin A"); + + let (local, total, frac) = locality(&graph.edges, &basin); + assert_eq!(total, 6); + assert_eq!(local, 5, "exactly one edge (a3->rootB) crosses basins"); + assert!((frac - 5.0 / 6.0).abs() < 1e-12, "locality must be exactly 5/6"); + } + + #[test] + fn two_clean_basins_give_perfect_locality_and_high_q() { + // No cross edges: two disjoint stars ⇒ locality = 1.0, Q should be + // clearly positive (two well-separated communities). + let iri_edges: Vec<(String, String)> = vec![ + ("a1".into(), "rootA".into()), + ("a2".into(), "rootA".into()), + ("b1".into(), "rootB".into()), + ("b2".into(), "rootB".into()), + ]; + let graph = ClassGraph::from_edges(&iri_edges); + let basin = graph.assign_basins(); + let (_, _, frac) = locality(&graph.edges, &basin); + assert!((frac - 1.0).abs() < 1e-12, "fully disjoint basins ⇒ locality 1.0"); + + let q = modularity_q(&graph, &basin); + assert!(q > 0.3, "two clean communities should give Q > 0.3, got {q}"); + } + + #[test] + fn fan_out_counts_distinct_parent_basins() { + // a3 has two parents in two different basins ⇒ fan-out 2 for a3. + let (iri_edges, _) = planted_two_basin(); + let graph = ClassGraph::from_edges(&iri_edges); + let basin = graph.assign_basins(); + let (max_fo, hist) = fan_out(&graph, &basin); + // a3's two parents rootA, rootB are in basins {rootA, rootB} → 2 distinct. + assert_eq!(max_fo, 2, "a3 reaches 2 distinct parent-basins"); + // Most classes have fan-out 0 (roots) or 1 (single parent). + assert!(hist.contains_key(&0)); + assert!(hist.contains_key(&1)); + assert_eq!(hist.get(&2), Some(&1), "exactly one class has fan-out 2"); + } + + #[test] + fn verdict_thresholds() { + assert_eq!(verdict(0.95, 8), Verdict::Pass); + assert_eq!(verdict(0.95, 17), Verdict::Marginal); // fan-out over 16 + assert_eq!(verdict(0.80, 4), Verdict::Marginal); // locality below 90% + assert_eq!(verdict(0.50, 4), Verdict::Fail); + assert_eq!(verdict(0.0, 0), Verdict::Fail); + } + + #[test] + fn cycle_is_broken_defensively() { + // A 2-cycle must not infinite-loop; both nodes get a basin. + let iri_edges: Vec<(String, String)> = + vec![("x".into(), "y".into()), ("y".into(), "x".into())]; + let graph = ClassGraph::from_edges(&iri_edges); + let basin = graph.assign_basins(); + assert_eq!(basin.len(), 2); + assert!(basin.iter().all(|&b| b != usize::MAX), "every node assigned"); + } +} diff --git a/crates/lance-graph-contract/src/soa_view.rs b/crates/lance-graph-contract/src/soa_view.rs index 4f4695c0..51198c1d 100644 --- a/crates/lance-graph-contract/src/soa_view.rs +++ b/crates/lance-graph-contract/src/soa_view.rs @@ -72,6 +72,28 @@ pub trait MailboxSoaView { // add `fn qualia(&self) -> &[crate::qualia::QualiaI4_16D]` when the first consumer // (planner strategy selection) needs it; keep the read surface minimal until then. + // NOTE (follow-up, P2 of the three-Markovs / EW64 reactive-seam ordering): + // the EpisodicWitness64 column accessor is intentionally omitted for now — + // add `fn episodic_witness(&self) -> &[EpisodicWitness64]` (same deferred- + // accessor pattern as `qualia` above) when the first consumer needs it. + // + // WHAT EpisodicWitness64 IS: it is **AriGraph living in the mailbox SoA view**. + // AriGraph is a Markov chain in the cold path (`lance-graph::graph::arigraph`: + // `episodic` / `witness_corpus` / `triplet_graph`); this column is that same + // episodic graph **promoted to the hot path** as a per-row SoA column — the + // `CausalEdge64` W-slot → witness arc (the deterministic "Markov #1" chain; + // see `witness_table.rs`: "the chain of W-references across edges forms a + // Markov-style belief-update arc through episodic-reference vectors"). EW64 is + // the *particle* (discrete, addressable, exact witness pointer); the windowed + // projection `arigraph::markov_soa` is the *wave*. Both ARE AriGraph. + // + // STATUS: `EpisodicWitness64` is NOT YET a code symbol (a queued design — see + // EPIPHANIES `E-EW64-IS-PREDICTIVE-PREFETCH`; the shipped seeds are the 6-bit + // W-slot `causal-edge::CausalEdge64` + `WitnessTable<64>`/`WitnessEntry` + + // `arigraph::{episodic,witness_corpus}`). Like every column the contract holds + // it stays AGNOSTIC: the witness arc carries SPO from ANY source — the + // *language* layer (DeepNSM/COCA) stays strictly upstream and never reaches in. + // ── per-row scalar read (mirrors `MailboxSoA::energy_at`) ── /// Energy at `row`. Default indexes [`energy`](MailboxSoaView::energy); override diff --git a/crates/lance-graph/src/graph/arigraph/markov_soa.rs b/crates/lance-graph/src/graph/arigraph/markov_soa.rs new file mode 100644 index 00000000..332a6ce9 --- /dev/null +++ b/crates/lance-graph/src/graph/arigraph/markov_soa.rs @@ -0,0 +1,325 @@ +// SPDX-License-Identifier: Apache-2.0 +// SPDX-FileCopyrightText: Copyright The Lance Authors + +//! `markov_soa` — the EXPLICIT, AUDITABLE, **vocabulary-agnostic** SoA-window +//! proposer (the Markov *wave* over a window of the Markov *particle* arc). +//! +//! ## markov_soa IS AriGraph (cold path promoted to hot path) +//! +//! This is not a generic projector that merely *lives in* AriGraph — **it is +//! AriGraph**. AriGraph is a Markov chain in the **cold path**; `markov_soa` is +//! that same chain **promoted to the hot path** (the per-mailbox SoA). Same +//! object, same agnostic nature, hot instead of cold. EW64 / the `CausalEdge64` +//! W-slot → witness arc is the *particle* (discrete, addressable, exact); this +//! windowed projection is the *wave* (accumulated resonance). Both are AriGraph. +//! +//! It previously lived (wrongly) in `deepnsm`, which made the agnostic hot-path +//! graph depend on a *language sensor* — a layer inversion. Dependency flows +//! AriGraph(core) → sensor, never the reverse. +//! +//! ## AriGraph is agnostic — and is NOT necessarily English +//! +//! AriGraph holds SPO from ANY source (business, GoBD, Wikidata, English text); +//! its agnosticism is structural — the SoA row is three **opaque `u16` ranks** +//! that carry no language. The match metric is **AriGraph's own +//! `cam_pq::DistanceTables`** (the graph's native semantic distance), injected +//! as `Fn(u16, u16) -> u8` so the projector itself names no encoding. +//! +//! **The language layer stays UPSTREAM, in DeepNSM, and never reaches in here.** +//! DeepNSM / COCA-4096 / the grammar templates are the *English-language input +//! sensor*: they scan flat data (usually English), parse it, and EMIT SPO +//! triplets into AriGraph. They must stay English — the grammar templates get +//! messy the instant they are not. Injecting a COCA/language distance into this +//! hot-path graph would be the GoBD-with-Rumi error: running a *language* lens +//! over an *agnostic* graph. Don't. The injected distance here is AriGraph's +//! cam_pq, not a language table. SPO *can* be English (when DeepNSM produced it), +//! but the SoA / AriGraph mailbox-view is never *forced* into a language. +//! +//! ## Strictly a fuzzy proposer — "hybrid+ autocomplete" (Markov #2) +//! +//! Output is a **best-guess match** (System-1 priming, "feels like a Sicilian +//! with a pinch of death trap"): it proposes *where to look* / *what this +//! resembles*, **never asserts truth**. The deterministic particle chain +//! (CE64→witness arc + the 32k SPO-W triplets) ALWAYS confirms. A wrong guess +//! costs a cheap reprioritization, never a wrong answer. **Invariant: the fuzz +//! is only legitimate while leashed to the deterministic chain that confirms it +//! — an unleashed bundle degrades into "sink-in-and-pray" (Markov #3).** +//! +//! ## STATUS: provisional / unverified-offline +//! +//! Authored against the grounded `contract::soa_view::MailboxSoaView` surface, +//! but `lance-graph` core does NOT build in the offline sandbox (its +//! `lance`/`datafusion`/`arrow` deps fetch from crates.io). Compile-verify on a +//! full checkout before relying on it. The truly-correct home is *inside the +//! EW64-in-SoA seam* (P1+P2 of the three-Markovs ordering); this module is the +//! agnostic wave-projector that seam will host. + +use lance_graph_contract::soa_view::MailboxSoaView; + +/// An SPO triple as three **opaque** `u16` ranks — vocabulary-agnostic. The +/// class above the mailbox says which vocabulary decodes these (COCA / business +/// / QID); the rank itself carries no meaning (C2 agnostic register). +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub struct SpoRanks { + /// Subject rank (opaque; vocabulary resolved by the class). + pub s: u16, + /// Predicate rank (opaque). + pub p: u16, + /// Object rank (opaque; may be a no-role sentinel). + pub o: u16, +} + +/// One row's contribution to a projection, recorded for audit — the thing that +/// makes the wave NOT a black box: every fold is attributable. All integer. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub struct RowContribution { + /// SoA row index that contributed. + pub row: usize, + /// `entity_type`/`class_id` of the row (the class that resolves its vocabulary). + pub class_id: u16, + /// `|delta|` from the focal row — the recency/proximity prior, integer. + pub proximity: u32, +} + +/// The provenance of a projection: the ordered list of what folded in. A +/// projection + this + the SoA fully reconstruct the wave — nothing lost. +#[derive(Debug, Clone, Default, PartialEq)] +pub struct BundleProvenance { + /// Source mailbox id (which cohort this projection summarizes). + pub mailbox_id: u32, + /// Rows that contributed, in fold order. + pub contributions: Vec, +} + +impl BundleProvenance { + /// How many rows contributed a triple. + #[must_use] + pub fn row_count(&self) -> usize { + self.contributions.len() + } +} + +/// A deterministic, vocabulary-agnostic projection of a SoA window: the opaque +/// rank-triples in a ±radius window + their provenance. The triples stay +/// **addressable** (no superposition destroys the register); matching is an +/// injected per-vocabulary distance, never float cosine, never a learned embed. +#[derive(Debug, Clone, Default, PartialEq)] +pub struct WaveProjection { + /// The opaque rank-triples in the window, in fold order — the explicit content. + pub triples: Vec, + /// The replayable construction. + pub provenance: BundleProvenance, +} + +impl WaveProjection { + /// **Best-guess match** to another projection — the System-1 priming read. + /// Deterministic, integer: for each triple here take the nearest triple + /// there by mean per-role distance under the injected `dist` closure, then + /// average. `dist(a, b)` is **AriGraph's own** `cam_pq::DistanceTables` + /// (the graph's native semantic distance), injected so this function names + /// no encoding. NOT a language/COCA table — language stays upstream in + /// DeepNSM. `0.0` if either side is empty. + #[must_use] + pub fn best_guess_match(&self, other: &WaveProjection, dist: impl Fn(u16, u16) -> u8) -> f32 { + if self.triples.is_empty() || other.triples.is_empty() { + return 0.0; + } + let mut acc = 0.0f32; + for a in &self.triples { + let mut nearest = u8::MAX; + for b in &other.triples { + let d = ((dist(a.s, b.s) as u16 + dist(a.p, b.p) as u16 + dist(a.o, b.o) as u16) + / 3) as u8; + if d < nearest { + nearest = d; + } + } + // similarity = 1 - normalized distance (caller's table is the metric; + // u8::MAX = maximally dissimilar). Integer-derived, deterministic. + acc += 1.0 - (nearest as f32 / u8::MAX as f32); + } + acc / self.triples.len() as f32 + } +} + +/// Folds a [`MailboxSoaView`] window into the opaque rank-triples + provenance. +/// Deterministic: same SoA + focal + radius ⇒ identical triples and provenance. +/// `row_triple(row) -> Option` resolves a row to its triple (from the +/// deterministic SoA/AriGraph state); untripled rows are skipped, not recorded. +/// The projector invents nothing and names no vocabulary. +#[derive(Debug, Clone, Copy)] +pub struct SoaWavePrimer { + /// ±window radius over mailboxes (the Markov proximity prior). + pub radius: u32, +} + +impl Default for SoaWavePrimer { + fn default() -> Self { + Self { radius: 5 } + } +} + +impl SoaWavePrimer { + /// New primer with an explicit ±radius window. + #[must_use] + pub fn new(radius: u32) -> Self { + Self { radius } + } + + /// Project the window centered on `focal_row`. + pub fn project(&self, soa: &V, focal_row: usize, row_triple: F) -> WaveProjection + where + V: MailboxSoaView, + F: Fn(usize) -> Option, + { + let mut triples = Vec::new(); + let mut contributions = Vec::new(); + let n = soa.n_rows(); + let r = self.radius as i32; + let class_ids = soa.class_id(); + for d in -r..=r { + let row_i = focal_row as i32 + d; + if row_i < 0 || row_i as usize >= n { + continue; + } + let row = row_i as usize; + let Some(t) = row_triple(row) else { continue }; + triples.push(t); + contributions.push(RowContribution { + row, + class_id: class_ids[row], + proximity: d.unsigned_abs(), + }); + } + WaveProjection { + triples, + provenance: BundleProvenance { + mailbox_id: soa.mailbox_id(), + contributions, + }, + } + } +} + +#[cfg(test)] +mod tests { + use super::*; + use lance_graph_contract::collapse_gate::MailboxId; + use lance_graph_contract::kanban::KanbanColumn; + + struct FakeSoa { + entity_type: Vec, + } + impl MailboxSoaView for FakeSoa { + fn mailbox_id(&self) -> MailboxId { + 42 + } + fn n_rows(&self) -> usize { + self.entity_type.len() + } + fn w_slot(&self) -> u8 { + 0 + } + fn current_cycle(&self) -> u32 { + 0 + } + fn phase(&self) -> KanbanColumn { + KanbanColumn::Planning + } + fn energy(&self) -> &[f32] { + &[] + } + fn edges_raw(&self) -> &[u64] { + &[] + } + fn meta_raw(&self) -> &[u32] { + &[] + } + fn entity_type(&self) -> &[u16] { + &self.entity_type + } + } + + fn row_triple(row: usize) -> Option { + Some(SpoRanks { + s: row as u16, + p: (row + 1) as u16, + o: (row + 2) as u16, + }) + } + fn soa(n: usize) -> FakeSoa { + FakeSoa { + entity_type: (0..n as u16).collect(), + } + } + + #[test] + fn projection_is_deterministic() { + let s = soa(20); + let p = SoaWavePrimer::new(3); + let a = p.project(&s, 10, row_triple); + let b = p.project(&s, 10, row_triple); + assert_eq!(a, b); + assert_eq!(a.provenance.row_count(), 7); + } + + #[test] + fn window_clamps_and_records_proximity() { + let s = soa(20); + let proj = SoaWavePrimer::new(5).project(&s, 1, row_triple); + assert_eq!(proj.provenance.row_count(), 7); // rows 0..=6 + assert_eq!( + proj.provenance + .contributions + .iter() + .find(|c| c.row == 1) + .unwrap() + .proximity, + 0 + ); + assert_eq!( + proj.provenance + .contributions + .iter() + .find(|c| c.row == 6) + .unwrap() + .proximity, + 5 + ); + assert_eq!(proj.provenance.mailbox_id, 42); + } + + #[test] + fn match_uses_injected_distance_no_vocabulary_named() { + // identity distance: equal ranks → near (0), else far (max). + let dist = |x: u16, y: u16| -> u8 { + if x == y { + 0 + } else { + u8::MAX + } + }; + let s = soa(20); + let p = SoaWavePrimer::new(2); + let here = p.project(&s, 10, row_triple); + let same = p.project(&s, 10, row_triple); + let far = p.project(&s, 2, row_triple); + let self_m = here.best_guess_match(&same, dist); + let far_m = here.best_guess_match(&far, dist); + assert!( + self_m > far_m, + "identical window must out-resemble a distant one" + ); + assert!((self_m - 1.0).abs() < 1e-6, "exact-twin match = 1.0"); + } + + #[test] + fn empty_matches_zero() { + let dist = |_: u16, _: u16| 0u8; + let empty = WaveProjection::default(); + let s = soa(5); + let ne = SoaWavePrimer::new(2).project(&s, 2, row_triple); + assert_eq!(empty.best_guess_match(&ne, dist), 0.0); + assert_eq!(ne.best_guess_match(&empty, dist), 0.0); + } +} diff --git a/crates/lance-graph/src/graph/arigraph/mod.rs b/crates/lance-graph/src/graph/arigraph/mod.rs index faf4e889..1bd94729 100644 --- a/crates/lance-graph/src/graph/arigraph/mod.rs +++ b/crates/lance-graph/src/graph/arigraph/mod.rs @@ -7,6 +7,7 @@ pub mod episodic; pub mod language; +pub mod markov_soa; pub mod orchestrator; pub mod retrieval; pub mod sensorium;