diff --git a/.claude/board/AGENT_LOG.md b/.claude/board/AGENT_LOG.md index ac8c4849..b6e7fc8e 100644 --- a/.claude/board/AGENT_LOG.md +++ b/.claude/board/AGENT_LOG.md @@ -1,3 +1,13 @@ +## 2026-06-16 — 5-specialist framing of #497 OCR-transcode plans → plans rebaselined to #498 + probes spec'd + +**Main thread (Opus 4.8 1M) + 5 Opus specialists in parallel** (cascade-architect / family-codec-smith / palette-engineer / dto-soa-savant / truth-architect), each read the 7 merged #497 plans + post-#498 source in full (Rule 7 — read, don't grep-judge). Operator: *"review the plans against your awareness of the new architecture incl. the last 15 PR arc (Morton Cascade + Helix 48 + turbovec residue) — send 5 specialist framing it."* See `EPIPHANIES.md` E-OCR-PLAN-DRIFT-1 for the consolidated framing. + +**Two showstoppers:** (1) the "reversible without a hash" migration rationale is FALSE in code (no `residue→rank` inverse; `vocabulary.rs` is a stored string-table keyed by rank) — truth-architect; (2) the "Morton-tile stacked-pyramid perturbation-shader cascade" does NOT exist (0 hits; Morton rejected for Hilbert) — cascade-architect. **Convergent drift (≥4 lenses):** dead 48 B HelixResidue (now 6 B), D-OCR-50 already shipped (#498), `ValueSchema::Ocr`/`Meta`-5-jobs/`TurbovecResidue`-wrong-carrier §0 tripwires, HHTL = coherent address-trie not a blur. + +**Outcome:** all 7 plans corrected on `claude/wonderful-hawking-lodtql` (rebaselined #496→#498, Morton purged, reversibility reframed, §0 tripwires fixed, master critical-path fixed = the open CodeRabbit Major on #497). New `ocr-probes-v1.md` (4 gating probes OCR-RT/DET/POST/SCHEMA + 3 cascade perf probes). **OCR-SCHEMA shipped as a contract test** (`ocr::tests::ocr_schema_fit_rides_existing_preset_no_new_variant`). contract 620 lib green; fmt clean. Both #497 + #498 review threads resolved/dispositioned. + +**Next:** open the follow-up PR; run OCR-DET (deepnsm example) + OCR-RT (needs deepnsm+helix wiring) before any transcode code is funded. + ## 2026-06-15 — integrated-cognitive-planner-v1: 3-hardener verdicts folded (§9) + §0 anti-invention guardrail **Main thread (Opus 4.8 1M) + 3 Opus brutal hardeners** (PP-13 brutally-honest-tester / PP-15 baton-handoff-auditor / PP-16 preflight-drift-auditor), all pinned to the plan by `file:line`. Verdicts: **HOLD / CATCH-LATENT / READY-TO-DISPATCH** — all fixes spec-text, no architectural rewrite; all three confirmed the grounding + dependency-wall claims + measure-first ratio. diff --git a/.claude/board/EPIPHANIES.md b/.claude/board/EPIPHANIES.md index 3a3f949a..9fbffac0 100644 --- a/.claude/board/EPIPHANIES.md +++ b/.claude/board/EPIPHANIES.md @@ -1,3 +1,26 @@ +## 2026-06-16 — E-OCR-PLAN-DRIFT-1 — the #497 OCR-transcode plans drifted from the substrate in 6 ways; 2 were showstoppers + +**Status:** FINDING (5-specialist framing — cascade-architect / family-codec-smith / palette-engineer / dto-soa-savant / truth-architect, each read the merged plans + source in full). +**Confidence:** High — every claim cited plan file:line vs current source file:line; convergent across ≥4 lenses for the load-bearing ones. + +**Context.** #497 (Tesseract→tesseract-rs transcode plan family, 7 design docs) and #498 (helix `Signed360` + GUID keystone) merged within hours. The #497 plans were authored against the pre-#498 branch, so they reason against a substrate that shifted under them. Five specialists framed the merged plans against the post-#498 architecture. + +**The two showstoppers:** +1. **The "reversible without a hash" rationale is false in code** (truth-architect). The migration's headline — "OCR text reconstructs from residue + codebook, no string column" — has no support: `deepnsm/vocabulary.rs` maps `rank→&str` via a stored table, every decode entry point takes a *known* rank as input, and there is no `residue→rank` inverse (helix encode is lossy). The "reversible residue" was a renamed stored string-table keyed by index — the very thing it claimed to avoid. +2. **The "Morton-tile stacked-pyramid perturbation-shader cascade" does not exist** (cascade-architect). 0 hits in either repo; Morton is explicitly *rejected* for Hilbert (`linalg/hilbert.rs:50`). Three deliverables (D-OCR-52, the reconstruction round-trip, the whole soa-centroid synthesis plan) were built on a fabricated subsystem name. + +**Convergent drift (≥4 lenses):** +- Plans argue "HelixResidue is 48 B, category-wrong, don't use it" — #498 made it **6 B** (a stored `Signed360` place index), which IS the keep-the-index design the plan wanted. Every byte budget was dead (Full 154→112, carve `[32,186)`→`[32,144)`). +- D-OCR-50 (`LayoutBlock::to_node_row`) already SHIPPED in #498 — described as future work. +- §0 tripwires: `ValueSchema::Ocr` (5th preset variant = anti-invention violation; ride Full/Compressed); `Meta` u64 overloaded 5 ways (split → Energy/Plasticity/residues/content-store); `TurbovecResidue` is the *edge* codec (rank-only fidelity) — wrong carrier for glyph→word (use DeepNSM CamCodes). +- HHTL Doc→Page→Block→Line→Token onto HEEL/HIP/TWIG+family is a *coherent address-trie, NOT a Frankenstein* (family-codec + cascade) — but it spends the similarity-basin semantics on layout, so OCR nodes must be `classid`-marked as layout-addressed. + +**Disposition.** All 7 plans corrected (rebaselined to #498; Morton purged → real primitives `framebuffer::build_mipmap_pyramid` / `splat3d/depth_cascade` / CAKES; reversibility reframed to identity→content-store + codebook-as-repair-signal; §0 tripwires fixed; master critical-path fixed per the open CodeRabbit Major). Unmeasured claims (int8-exact LSTM, bit-reproducible diff, 200k-LOC 1:1 layout) gated behind 4 probes in `ocr-probes-v1.md` (OCR-RT/DET/POST/SCHEMA); **OCR-SCHEMA shipped as a contract test** proving OCR rides an existing preset (no new `ValueSchema` variant). + +**Lesson.** When two PRs touch the same substrate within hours, the later merge silently invalidates the earlier plan's premises. Plans citing sizes/budgets/file:line must be rebaselined the moment a substrate PR lands — and "reversible / never-stored" claims must be PROVEN against the actual decode path before becoming a migration's rationale. + +--- + ## 2026-06-13 — E-TURBOVEC-AMX-WRONG-TOOL-1 — AMX accelerates the operation TurboQuant deliberately removed **Status:** FINDING (benchmarked; AVX-512+VNNI host, `amx_available=false`). diff --git a/.claude/plans/ocr-canonical-soa-integration-v1.md b/.claude/plans/ocr-canonical-soa-integration-v1.md index a2e9ecff..2bf5c670 100644 --- a/.claude/plans/ocr-canonical-soa-integration-v1.md +++ b/.claude/plans/ocr-canonical-soa-integration-v1.md @@ -2,7 +2,7 @@ > **Type:** plan (sub-plan — the one that binds OCR to the lance-graph substrate). Deliverables D-OCR-50/51/52/53. > **Status:** PLANTED 2026-06-15 — design only. THIS is "use the new architecture we raced for." -> **Front:** post-#496. Integration surface = `canonical_node.rs` (`NodeGuid`/`EdgeBlock`/`EdgeCodecFlavor`/`NodeRow`/`ValueTenant`/`ValueSchema`/`NodeRowPacket`) + `class_view.rs` (`ClassView`/`FieldMask`). +> **Front:** post-#498 (helix `Signed360` right-sized **48 B → 6 B**; OCR keystone `LayoutBlock::to_node_row` + `BlockKind::entity_type` + `classid_read_mode` **already SHIPPED** in `ocr.rs`; `ENVELOPE_LAYOUT_VERSION` = 2; value carve `[32,144)`, Full 112 B / Compressed 56 B). Integration surface = `canonical_node.rs` (`NodeGuid`/`EdgeBlock`/`EdgeCodecFlavor`/`NodeRow`/`ValueTenant`/`ValueSchema`/`NodeRowPacket`) + `class_view.rs` (`ClassView`/`FieldMask`) + shipped `ocr.rs`. > **Canon anchors:** OGAR/CLAUDE.md P0 GUID; lance-graph/CLAUDE.md SoA node (`4ea6ac9`); soa-three-tier-model; DeepNSM crate (`lance-graph/crates/deepnsm`); helix/CAM-PQ (`crates/helix`, `bgz-tensor` CAM-PQ). > **Skip-by-rule:** OCR introduces NO bespoke row geometry. It rides the existing value-tenant carve. @@ -35,41 +35,69 @@ point of the splat-native / "one representation, many views" doctrine, applied t - Mint an OCR class family in OGAR (`ogar-ontology`): `Document → Page → Block → Line → Token`, with leaf token subtypes (`Word`, `Number`, `Date`, `Currency`, - `Glyph`, `TableCell`). Until OGAR mints them, hardcode the classid prefix space - per the reserve-don't-reclaim ladder (the classid bytes stay reserved at offset 0). + `Glyph`, `TableCell`). **Stay at classid `0x0000_0000` (bootstrap address, + identity-only discrimination) until OGAR actually mints the class** — do NOT + hardcode a non-zero classid prefix, which would wake prefix-routing with no + registry entry and fall silently to `ReadMode::DEFAULT` (matches shipped `ocr.rs`, + which writes `NodeGuid::new(classid, 0,0,0, FAMILY_DEFAULT, identity)`). +- **HHTL = a layout-address trie for OCR nodes, NOT a similarity cascade.** The + 5 layout levels map onto 3 key tiers + family + identity as a *prefix* + decomposition: Document/Page/Block → HEEL/HIP/TWIG (radix-walk prefix), Line → + family (locality basin), Token → identity. This deliberately forgoes the + *similarity-basin* reading of HEEL/HIP/TWIG/family (canon's coarse→fine + neighbourhood tiers); the OCR `classid` marks these nodes as layout-addressed so no + cross-document family-purity / two-basin benchmark runs against these coordinates. - `ClassView` for the OCR class declares `edge_codec_flavor` (`CoarseOnly`) and - `value_schema` (the OCR preset, §3). + `value_schema` (ride `Full` POC / `Compressed` — no new variant, §3). ## 3. OCR `ValueSchema` preset over EXISTING tenants (D-OCR-51) -The 480-byte value slab already carves into `VALUE_TENANTS`. An OCR token is **not -a stored string and not a hash** — it is the *terminal of the perturbation cascade*, -reconstructed exactly like every other node. Text = codebook index + residue. +The 480-byte value slab already carves into `VALUE_TENANTS`. An OCR token's +**recognized string is NOT stored in the node** (I-VSA-IDENTITIES, enforced by +shipped `ocr.rs:97-101`): the node is the *identity that points to* OCR content; +the string + pixel geometry live in an external content store keyed by `identity`. +The value tenants carry typed scalars + a compressed similarity coordinate for +*repair / disambiguation*, never a reversible text payload. -| Tenant (existing) | OCR role | +| Tenant (existing, post-#498 sizes) | OCR role | |---|---| -| helix residue = **centroid attention field** (NOT a stored code) | The 24-bit golden index is the **query↔centroid alignment** (φ-spiral direction = how this point attends to its place-centroid); the Morton-tile stacked-pyramid perturbation-shader is **multi-scale attention** (coarse centroid → fine perturbation = HHTL cascade in residue space). The field is **evaluated from the φ-template, never stored** ("8K resolution at Super-8 cost" — only the index is kept). Place=HHTL centroid; residue=perturbation off it. The 48-byte `ValueTenant::HelixResidue` is category-wrong (stores a field that must be computed) — do NOT use it. | -| `TurbovecResidue` (16 B, PQ) | PQ edge residue → CAKES nearest-valid-token search over the codebook | -| `Meta` (u64) | codebook index/anchor + confidence + char-confusion/NSM-repair flags + recoder-code fallback for true-OOV | +| `HelixResidue` (**6 B = 48 bit `Signed360`**, NOT 48 B) | A **stored** golden-spiral *place index* (rim 3 B + sign-partition polar 1 B + golden azimuth 2 B; `helix/src/residue.rs:63-116`). The 6 B IS the kept index; the multi-scale field is the *deterministic decode* of it (`RollingFloor::quantize` / `HemispherePoint::lift`, pure `&self`) — "8K resolution at Super-8 cost." It is a place code, **not** a confidence carrier. (The old "48-byte, category-wrong, do NOT use it" line was written pre-#498 against a bits→bytes slip; the tenant is **6 B** and is exactly the keep-the-index design — use it.) | +| `TurbovecResidue` (16 B, `Pq32x4`) | The **edge-block** PQ residue (`EdgeCodecFlavor::Pq32x4`, rank-preserving / absolute-distance-lossy, ICC 0.11–0.29). NOT the glyph→word carrier — nearest-**valid**-token needs absolute distance, so the glyph→word search uses **DeepNSM's `Codebook` CamCodes** (6×256×16, 6 B; `deepnsm/src/codebook.rs`) + `vocabulary.rs` reverse, not this tenant. | +| `Meta` (u64) | A SMALL codebook anchor only (a ≤12-bit vocab rank fits). It does NOT carry confidence (→ `Energy`, shipped `ocr.rs:112-114`), repair flags (→ `Plasticity`), or the OOV recoder-code (→ external content store). `Meta` is the cognitive `MetaWord`; overloading it 5 ways is an I-LEGACY-API-FEATURE-GATED hazard (one u64, different meaning per class). Prefer a future `ValueTenant::OcrEvidence` (OD-1) for OCR-specific evidence. | | `EntityType` (u16) | token subtype (Word/Number/Date/Glyph/TableCell) | | `Plasticity` (u32) | correction history / last-repair stamp | -**Reconstruction (this is the round-trip, and it answers Codex P1):** -`text ⇄ codebook_index(Meta) + field-eval(helix 24-bit golden-index attention ⊕ TurbovecResidue PQ)`. Decode = -the DeepNSM Morton-tile **stacked-pyramid perturbation-shader cascade** applied to -the residue → CAKES nearest-valid-token over the codebook (DeepNSM `vocabulary` / -coca `word_frequency`) → the word. No `Fingerprint` hash, no string column. The -reversibility lives in residue + codebook, which is the architecture's whole point. - -**True-OOV (no codebook neighbor — a raw code like `69B8`):** falls back to the -**recoder-code residue** — `recodebeam` already emits recoder codes, not pixels, so -the codes themselves are the reversible payload in `Meta`, repaired by the -char-confusion grammar (D-OCR-52). Still a residue, never a hash. - -**ValueSchema:** `Cognitive` does NOT include `HelixResidue`/`TurbovecResidue`, so -OCR needs a dedicated **`ValueSchema::Ocr`** = `FieldMask` over -{`HelixResidue`,`TurbovecResidue`,`Meta`,`EntityType`,`Plasticity`}. Selection only; -moves no tenant (canon: tenants never move/reuse). +**Recognition vs reconstruction — be precise (corrects the pre-#498 framing).** +The recognized string is recovered by **identity → external content-store lookup**, +not by inverting a residue. There is **no `residue → codebook-rank` inverse** in the +code: `deepnsm/vocabulary.rs` maps `rank → &str` via a stored table, and every +`nearest_words(rank,k)` / `word_neighbors(word,k)` entry point takes a *known* +rank/word as input. So "reversible without a hash" is NOT a property of the substrate +today — the codebook code is a **repair / disambiguation signal**, not a lossless +text payload. The honest mapping: +- **text** → external content store keyed by `identity` (I-VSA-IDENTITIES). +- **`Meta` codebook anchor + `TurbovecResidue` / `HelixResidue`** → similarity + coordinates that feed *repair* (DeepNSM plausibility + char-confusion + CAKES + nearest-valid-token), NOT round-trip text recovery. +- the multi-scale decode uses the **real** primitives — `framebuffer::build_mipmap_pyramid` + / the HHTL `splat3d/depth_cascade` / the helix φ-template / the CAKES ladder + (`high_heel.rs:16-24`) — there is **no** "Morton-tile stacked-pyramid perturbation-shader" + in either repo (Morton is explicitly rejected for Hilbert in `linalg/hilbert.rs:50`). +- **Gate:** if a measured `residue → rank` round-trip is ever wanted, it must be + PROVEN first (probe **OCR-RT**, see `ocr-probes-v1.md`) — it is CONJECTURE today. + +**True-OOV (a raw code like `69B8`):** `recodebeam` emits recoder codes; the code + +its char-confusion repair (D-OCR-52) live in the content store with the token text, +keyed by `identity`. Not bundled into the node. + +**ValueSchema:** do **NOT** add a 5th `ValueSchema::Ocr` enum variant — that is a +contract-surface addition against the #496 §0 anti-invention guardrail. Shipped +`ocr.rs` already transcodes by riding the POC-`Full` default (`classid_read_mode → +Full`) and writing only the tenants it populates. Post-POC, OCR rides the existing +**`Compressed`** preset (already = Fingerprint + HelixResidue + TurbovecResidue + +EntityType) — or, if a distinct tenant set is truly needed, **mint an OCR class** in +OGAR whose `ClassView` selects existing tenants (the §0-sanctioned opt-in route). +New capability = new column/class, never a new enum variant. ## 4. Repair: DeepNSM + CAM/PQ nearest-valid-token (D-OCR-52) @@ -81,12 +109,18 @@ The recognizer emits candidates+confidence; repair is the brainstem we already h `deepnsm/word_frequency`.) - **Word layer = `deepnsm`:** `vocabulary` → `codebook` → `parser`/`pos` → `encoder` → `similarity`/`cam64`/`crystal_neighborhood`. Word-level plausibility + disambiguation. -- **Nearest-valid-token = helix / CAM-PQ / CAKES:** the glyph `TurbovecResidue` - (PQ) + `HelixResidue` feed CAKES nearest-valid-token; CHAODA (clustered-hierarchical outlier detection) flags anomalous - tokens (likely-misrecognized). This is `bgz-tensor` CAM-PQ + `crates/helix`. - -Repaired token writes back: corrected text → `Fingerprint`/`EntityType`, repair -provenance → `Meta`/`Plasticity`. +- **Nearest-valid-token = DeepNSM codebook + the L1 CAM-PQ cascade (NOT Hamming):** + glyph → **DeepNSM `Codebook` CamCodes** (`deepnsm/codebook.rs`) → `vocabulary` + reverse, ranked through the **L1** CAM-PQ stroke cascade (`ndarray::hpc::cam_pq` + `cascade_query`) + CLAM DFS-sieve (`clam.rs` `knn_dfs_sieve`); CHAODA + (clustered-hierarchical outlier detection) flags anomalous tokens. The Hamming + σ-band `Cascade` (`cascade.rs`) is for binary fingerprints only — palette/codebook + data is L1 (see `bgz-tensor::hdr_belichtung`). `TurbovecResidue` is the edge codec, + not the glyph carrier (§3). + +Repaired token writes back: corrected **text → external content store** (keyed by +`identity`, I-VSA-IDENTITIES — never into `Fingerprint`); token subtype → `EntityType`; +confidence → `Energy`; repair provenance / last-repair stamp → `Plasticity`. ## 5. Persistence + planner (kv-lance / surreal) @@ -101,17 +135,25 @@ provenance → `Meta`/`Plasticity`. The transcode oracle (D-OCR-2x) makes OCR a **deterministic regression source for the whole SoA migration**: the same line crop → C++ Tesseract text AND Rust port -text AND the resulting `NodeRow` bytes. Because every stage is supposed to be -bit-reproducible (DeepNSM bit-reproducible, envelope version-stamped, CausalEdge64 -locked), a golden-file diff over (crop → NodeRow) exercises exactly the muscles the -migration must harden: `ndarray::hpc` hydration, the envelope LE round-trip, and -SIMD numeric exactness. OCR is the best external oracle the substrate has. +text AND the resulting `NodeRow` bytes. The VSA bind/bundle path is integer +(bit-reproducible), but **DeepNSM's repair / similarity stage is f32** (`encoder.rs` +similarity → f32; `pipeline.rs` weighted blend) — so the **frozen-mode** golden diff +MUST exclude (or pin) the f32 repair stage, and the helix `floor_version` MUST be +fixed in the golden bytes (else the rolling floor rolls and the diff spuriously +fails). With those carve-outs (probe **OCR-DET**), a golden-file diff over +(crop → NodeRow) exercises exactly the muscles the migration must harden: +`ndarray::hpc` hydration, the envelope LE round-trip, and SIMD numeric exactness. +OCR is a strong external oracle for the substrate. ## 7. Deliverables -- **D-OCR-50:** OCR class + HHTL address scheme; `ClassView` impl for OCR class. -- **D-OCR-51:** `ValueSchema` OCR preset (FieldMask over existing tenants); a token - round-trips token→NodeRow→token with no geometry change. +- **D-OCR-50 (PARTIALLY SHIPPED in #498):** block→`NodeRow` already lands via + `ocr.rs` `LayoutBlock::to_node_row` + `BlockKind::entity_type`. Remaining: (a) + token-grain nodes (OD-50a), (b) populate the HHTL layout-trie (HHT currently 0), + (c) mint the OGAR OCR class. Re-cast as *extend the shipped `ocr.rs`*, not build. +- **D-OCR-51:** OCR rides the `Full` POC / `Compressed` preset (**NO** new + `ValueSchema` variant — §0); a token lands token→NodeRow with identity→content-store + text recovery (no in-row reversible text — see §3). - **D-OCR-52:** DeepNSM + character-confusion layer + CAM/PQ repair wired; a known OCR-garbage fixture (`69B8`, `rn`→`m`) is repaired by plausibility. - **D-OCR-53:** golden-file (crop → NodeRow bytes) regression green, shared with the diff --git a/.claude/plans/ocr-probes-v1.md b/.claude/plans/ocr-probes-v1.md new file mode 100644 index 00000000..ee3e3203 --- /dev/null +++ b/.claude/plans/ocr-probes-v1.md @@ -0,0 +1,98 @@ +# OCR Transcode — Gating Probes v1 + +> **Type:** plan (probe queue for the tesseract-rs OCR transcode family). +> **Status:** PLANTED 2026-06-16 — from the 5-specialist framing of #497 (cascade / +> family-codec / palette / dto-soa / truth-architect) against the post-#498 substrate. +> **Why:** the #497 plan family makes several load-bearing claims that are **asserted, +> not measured**. Per the workspace insight-update cycle (CLAUDE.md: Claim → Probe → +> Run → FINDING/correct), these probes gate the expensive transcode work. Run the +> cheap ones (< 3 h, existing crates) BEFORE funding the ~200k-LOC layout transcode. +> **Cross-ref:** `tesseract-rs-transcode-master-v1.md`, `ocr-canonical-soa-integration-v1.md`, +> `soa-centroid-attention-field-synthesis-v1.md`. + +--- + +## The four primary gating probes + +### OCR-RT — residue → codebook-rank round-trip (settles the "reversible without a hash" claim) + +- **Claim under test:** "an OCR token is reversible through residue + codebook, no + hash/string column" (the migration's headline rationale). +- **Current evidence (FINDING):** there is **no `residue → rank` inverse** in code. + `deepnsm/vocabulary.rs` maps `rank → &str` via a stored table; every + `nearest_words(rank,k)` / `word_neighbors(word,k)` entry point takes a *known* + rank/word as input. Helix `encode` is lossy quantization; `from_bytes` recovers the + 6 bytes, never the source `n`. So the round-trip does not exist today. +- **Probe:** given a word → its codebook rank → encode to (helix `Signed360` 6 B ⊕ + turbovec PQ residue), attempt to recover the **rank** from the residue bytes ALONE + (no stored-rank lookup). Needs deepnsm `Codebook` + helix `Signed360` wired in one + crate (they are not today — that wiring is itself part of the gate). +- **Pass:** ≥ 99 % of the 4096-word vocab round-trips residue→rank→word exactly. +- **Fail:** < 99 %, OR recovery requires the original rank as input ⇒ "reversible + without a hash" is FALSE; the corrected plans already say text = identity → + content-store lookup, codebook = repair signal (this probe confirms or lifts that). +- **Cost:** ~80 LOC once deepnsm+helix are co-located; the wiring is the real work. + +### OCR-DET — repair-stage determinism (settles the bit-reproducible golden diff) + +- **Claim under test:** D-OCR-53 "crop → NodeRow byte golden diff is bit-reproducible." +- **Current evidence:** DeepNSM repair/similarity is **f32** (`encoder.rs` similarity + → f32; `pipeline.rs` weighted blend). VSA bind/bundle is integer, but repair is on + the critical path before the NodeRow is written. +- **Probe:** run the DeepNSM repair path (`nearest_words` + similarity) twice / across + a scalar-vs-SIMD toggle on 1k garbage fixtures; compare repaired-token bytes. +- **Pass:** byte-identical both runs ⇒ repair may stay inside the frozen-mode boundary. +- **Fail:** any divergence ⇒ the f32 repair stage MUST be carved out of (or pinned in) + the frozen-mode bit-repro boundary, and the helix `floor_version` MUST be pinned in + the golden bytes (else the rolling floor rolls and the diff spuriously fails). +- **Cost:** ~60 LOC, < 1 h, pure deepnsm (compilable today as a `deepnsm` example). + +### OCR-POST — GGUF posterior parity (gates the entire LSTM host swap) + +- **Claim under test:** "int8-exact LSTM posteriors" across `.traineddata` → GGUF → + candle → ndarray-AMX, so the transcoded recodebeam yields 1:1 text. +- **Probe:** dump libtesseract per-timestep `[T,C]` posteriors for ONE crop; run the + same recognizer GGUF via embedanything(candle)→ndarray; compare. +- **Pass:** max per-timestep |Δposterior| within the int8 quantization step on the crop. +- **Fail:** candle cannot express Tesseract's BiLSTM/CTC (OD-10b), OR Δ exceeds the + int8 step ⇒ the "1:1 text" chain is unfounded; the swap needs a candle-fork op + before any decode work proceeds. +- **Cost:** needs candle wiring + a GGUF model + libtesseract oracle — NOT runnable in + this checkout. A 1-crop spike, not a 1k-crop acceptance, until it passes once. + +### OCR-SCHEMA — ValueSchema fit (settles the §0 anti-invention question) + +- **Claim under test:** "OCR needs a dedicated `ValueSchema::Ocr`." +- **Current evidence (FINDING):** shipped `ocr.rs` rides the POC-`Full` default and + writes per-tenant; `Compressed` already = {Fingerprint, HelixResidue, TurbovecResidue, + EntityType}. A 5th enum variant collides with the #496 §0 guardrail. +- **Probe (pure design / test, ~30 min):** assert against `canonical_node.rs` that the + OCR tenant set (minus the deferred-to-content-store fields) is ⊆ an existing preset's + `field_mask()`; if it fits `Full`/`Compressed`, no new variant is needed. +- **Pass:** a tenant-by-tenant table fitting an existing preset ⇒ ride it (no enum change). +- **Fail:** a genuine gap ⇒ escalate to operator to **mint an OCR class** (ClassView + selecting existing tenants), NOT extend the `ValueSchema` enum. +- **Cost:** compilable today as a `lance-graph-contract` test (`ocr_schema_fit`). + +--- + +## Secondary probes (cascade performance — convert asserted numbers to facts) + +These back the perf claims ("95 % pairs skipped", "8K at Super-8 cost", early-exit): + +- **P-OCR-EARLYEXIT:** CAM-PQ stroke-cascade skip-rate on a real OCR-token codebook + (`cam_pq.rs` `cascade_query` with calibrated `heel_threshold`/`branch_threshold`). +- **P-OCR-CAKES-RECALL:** `clam.rs` `knn_dfs_sieve` recall vs brute force on the token codebook. +- **P-HELIX-OCR-FIDELITY:** reuse the already-owed #459 ≥ 0.9980 Pearson floor gate for + the residue round-trip fidelity (still NOT RUN — see #459 deferred). + +--- + +## DAG honesty + +The master critical path ends `… → {50,51} → 53` (golden diff). D-OCR-53 silently +assumes OCR-RT + OCR-DET + OCR-POST all hold; none is measured. **OCR-DET and +OCR-SCHEMA cost < 2 h with existing crates** — run them first; record results here +(CONJECTURE → FINDING). If **OCR-RT** fails, the "use the new architecture's +reversibility" rationale collapses regardless of transcode fidelity, so it is the +single highest-leverage probe before the ~200k-LOC layout transcode is funded. diff --git a/.claude/plans/soa-centroid-attention-field-synthesis-v1.md b/.claude/plans/soa-centroid-attention-field-synthesis-v1.md index 952fdc64..b03bad72 100644 --- a/.claude/plans/soa-centroid-attention-field-synthesis-v1.md +++ b/.claude/plans/soa-centroid-attention-field-synthesis-v1.md @@ -8,19 +8,26 @@ ## 0. The one idea -The **48-bit helix residue + Morton-tile stacked-pyramid perturbation-shader IS a -centroid attention field.** Place (HHTL) = centroid; residue (24-bit golden index) -= each point's perturbation off it = the **query↔key alignment**; the pyramid = -**multi-scale attention** (coarse centroid → fine). The field is *evaluated from the -φ-spiral template, never stored*. Everything below is a **read of this one field at -a different scale** — not separate engines bolted together. +The **6-byte (48-bit) helix `Signed360` place index IS the stored handle of a +centroid attention field.** Place (HHTL) = centroid; the stored 6 B index (rim + +sign-partition polar + golden azimuth) is each point's perturbation off it; the +**field is the deterministic *decode* of that stored index** via the φ-template +(`HemispherePoint::lift` / `RollingFloor::quantize`) — "8K resolution at Super-8 +cost": the **index is stored, the field is evaluated from it** (NOT "never stored" — +that earlier framing fought the stored-index reality and would break the zero-copy +`NodeRowPacket` round-trip the D-OCR-53 golden diff depends on). Multi-scale +coarse→fine uses the **real** primitives — `framebuffer::build_mipmap_pyramid`, the +HHTL `splat3d/depth_cascade`, the CAKES ladder (`high_heel.rs`). There is **no** +"Morton-tile stacked-pyramid perturbation-shader" in either repo (Morton is +explicitly rejected for Hilbert, `linalg/hilbert.rs:50`) — drop the name. Everything +below is a **read of this one field at a different scale** — not separate engines. ## 1. The reads (each is the same field, different scale) | Capability | Real crate / source | What it is, as a field read | |---|---|---| | **Perception (ONNX/LSTM)** | embedanything(candle)/GGUF host | emits a **query** into the field (golden index + posteriors); the ONLY learned-perceptual part, stays hosted | -| **Attention eval** | `helix` (golden index, curve-ruler, `DistanceLut`) | query↔centroid alignment; Morton pyramid = coarse→fine resolution | +| **Attention eval** | `helix` (stored 6 B `Signed360` index, curve-ruler, `DistanceLut`) | query↔centroid alignment; coarse→fine via mipmap pyramid / HHTL depth-cascade (NOT "Morton") | | **Markov context building / bundling** | `deepnsm::markov_bundle`, `encoder` | temporal **superposition along the field** = the bundling read (context = bundled perturbations) | | **Quorum + NARS reasoning** | `causal-edge::{pearl,nars,syllogism}` | centroid **coupling** = edge read; quorum = agreement of multiple field reads; NARS truth = coupling strength | | **Grammar heuristics** | `deepnsm::{parser,pos,morphology,spo,syllogism}` | syntactic **field masks** = structured attention over the field | @@ -42,6 +49,14 @@ autoencoder is replaced by the integer codebook-distance oracle (the field) and still mines rules — neurosymbolic learning with NO autoencoder, NO SGD, NO seed. What's missing is only **plasticity** (centroid drift), not a learner. +> **CONJECTURE (gate before citing as fact):** the central equivalence "VSA +> bind/bundle/similarity *are* the helix field operations" is **asserted, not +> measured**. Probe it: does `deepnsm::markov_bundle` over a set of residues produce +> the same alignment *ordering* as `helix::DistanceLut` over the same residues +> (Spearman ρ ≥ target)? Until that probe runs, this plan stays a phase-2 marker, +> not a buildable spec — it is the synthesis plan most at risk of unification-without- +> measurement. See `ocr-probes-v1.md`. + ## 3. Phase-2: make the field plastic (the "learning edges") Not new tenants — **the field adapts**: @@ -59,7 +74,7 @@ coupling resolve to the token. So ONNX = the perceptual *encoder into* the field the field = everything symbolic/sequential/relational. One substrate, two scales. ## 5. Determinism split (non-negotiable) -- **Frozen mode** (centroids/gains fixed) → bit-reproducible → the Tesseract oracle + golden-file harness run here. +- **Frozen mode** (centroids/gains fixed, helix `floor_version` **pinned** in the golden bytes, f32 repair stage carved out) → bit-reproducible → the Tesseract oracle + golden-file harness run here. - **Plastic mode** (field adapts) → live use; NOT golden-diffable; gated by snapshot. Two modes, explicitly separated, or the bit-repro guarantee is lost. diff --git a/.claude/plans/tesseract-rs-ast-dll-codegen-v1.md b/.claude/plans/tesseract-rs-ast-dll-codegen-v1.md index 22adaa05..e356d932 100644 --- a/.claude/plans/tesseract-rs-ast-dll-codegen-v1.md +++ b/.claude/plans/tesseract-rs-ast-dll-codegen-v1.md @@ -2,7 +2,7 @@ > **Type:** plan (sub-plan). Deliverables D-OCR-40/41/42. The transcode *mechanism*. > **Status:** PLANTED 2026-06-15 v2 — layout IS in scope (1:1 raw-pointer), not skipped. -> **Front:** post-#496. Uses `AdaWorldAPI/ruff` AST/codegen crates as the Rust-emission engine. +> **Front:** post-#498. Uses `AdaWorldAPI/ruff` AST/codegen crates as the Rust-emission engine. > **Canon anchors:** master §4. Deterministic + diff-gated (bit-reproducibility doctrine). > **Skip-by-rule:** only leaf/mechanical modules are codegen targets; ownership-heavy code is transcribed faithfully as raw-pointer Rust (1:1), with safe-refactor deferred to a later oracle-gated pass. @@ -75,7 +75,7 @@ the harness is re-runnable to prove the commit equals the generator output. ## 6. Deliverables - **D-OCR-40:** libclang → stable IR dump for the codegen-target module set; NOT-CODEGENABLE flagging works. -- **D-OCR-41:** IR → committed Rust via ruff emission; re-run is byte-identical. +- **D-OCR-41:** IR → committed Rust via ruff emission; re-run is byte-identical (CONJECTURE — a determinism property of an as-yet-unbuilt harness; prove with a re-run diff once D-OCR-40 lands). - **D-OCR-42:** behavioral + structural diff-gate green for the target modules vs the FFI oracle. ## 7. Open decisions diff --git a/.claude/plans/tesseract-rs-transcode-master-v1.md b/.claude/plans/tesseract-rs-transcode-master-v1.md index 2c6f51e3..4abd4427 100644 --- a/.claude/plans/tesseract-rs-transcode-master-v1.md +++ b/.claude/plans/tesseract-rs-transcode-master-v1.md @@ -3,7 +3,7 @@ > **Type:** plan family root. SUPERSEDES v1 (which wrongly skipped layout). > **Status:** PLANTED 2026-06-15 v2 — design locked. 1:1 behavioral transcode of ALL > of Tesseract; the LSTM forward is the ONLY swapped component. -> **Front:** post-#496. Hosts: `embedanything` DTO (GGUF→candle→ndarray-AMX, per +> **Front:** post-#498 (helix `Signed360` 48 B→6 B; OCR keystone `LayoutBlock::to_node_row` SHIPPED; `ENVELOPE_LAYOUT_VERSION`=2). Hosts: `embedanything` DTO (GGUF→candle→ndarray-AMX, per > `.grok/NDARRAY_BGZ_EMBEDANYTHING_INTEGRATION.md`); `bgz_tensor` weight store. > **Canon:** OGAR/CLAUDE.md GUID P0; lance-graph/CLAUDE.md SoA node; canonical_node.rs. @@ -69,16 +69,19 @@ the only change the OCR use forces on the shared interface. Narrow, additive. | D-OCR-31 | minimal Leptonica ops on image/imageproc (numeric parity) | — | | D-OCR-40 | AST-DLL clang→IR→Rust codegen harness (ruff emission) | — | | D-OCR-42 | oracle diff-gate (every module vs libtesseract FFI) | D-OCR-21,30 | -| D-OCR-50 | OCR token → canonical NodeRow (OGAR class, HHTL, ValueSchema, edges) | canon | -| D-OCR-52 | DeepNSM + char-confusion + CAM/PQ token repair | D-OCR-50 | -| D-OCR-53 | bit-reproducibility harness (crop→text→NodeRow golden diff) | D-OCR-21,30,**50,51** | +| D-OCR-50 | OCR token → canonical NodeRow (**PARTIALLY SHIPPED #498**: `ocr.rs` block→NodeRow; remaining: token-grain + HHTL trie + OGAR class) | canon (#498) | +| D-OCR-52 | DeepNSM + char-confusion + CAM/PQ token repair (L1 cascade, not Hamming) | D-OCR-50 | +| D-OCR-53 | bit-reproducibility harness (crop→text→NodeRow golden diff; f32 repair carved out, `floor_version` pinned) | D-OCR-21,30,**50,51** | -Critical path: **40 → {10,30} → 16 → 21 → 42 → 53**. D-OCR-15 parallel (tiny). The -layout transcode (30) and the recognizer host (16) are independent until decode. +Critical path: **40 → {10,30} → 16 → 21 → 42 → {50,51} → 53**. D-OCR-15 parallel +(tiny). The layout transcode (30) and the recognizer host (16) are independent until +decode. D-OCR-53 (golden diff) needs the SoA row layout defined first, so D-OCR-50/51 +(in `ocr-canonical-soa-integration-v1`) precede it on the path — matching its +dependency list above. ## 5. Success criteria -1. `tesseract-rs` reproduces libtesseract output **byte-identical** on ≥10k line crops AND full-page layouts (oracle gate) — layout included, since it's 1:1. +1. `tesseract-rs` reproduces libtesseract output **byte-identical** on ≥10k line crops AND full-page layouts (oracle gate) — layout included, since it's 1:1. **CONJECTURE until measured:** byte-identity of raw-pointer-transcribed ~200k-LOC layout AND the GGUF→candle→ndarray int8 posterior path are the two biggest unproven claims — gate them with probe **OCR-POST** (posterior parity on one crop) BEFORE funding the full transcode. See `ocr-probes-v1.md`. 2. LSTM runs ONLY via embedanything(candle)→ndarray; no transcoded LSTM kernels. 3. Zero crates.io (forks + ndarray/lance family); `ort` sole opt-in C++ (off by default). 4. Recognized tokens land as canonical NodeRows with no bespoke geometry. diff --git a/crates/lance-graph-contract/src/ocr.rs b/crates/lance-graph-contract/src/ocr.rs index d6d69d8b..6724a942 100644 --- a/crates/lance-graph-contract/src/ocr.rs +++ b/crates/lance-graph-contract/src/ocr.rs @@ -232,4 +232,40 @@ mod tests { "a tenant the transcode doesn't populate stays zero" ); } + + #[test] + fn ocr_schema_fit_rides_existing_preset_no_new_variant() { + // Probe OCR-SCHEMA (.claude/plans/ocr-probes-v1.md): the OCR value tenants + // fit an EXISTING ValueSchema preset, so a 5th `ValueSchema::Ocr` enum variant + // is NOT needed (#496 §0 anti-invention). The codec-residue set OCR rides — + // HelixResidue + TurbovecResidue + EntityType (+ Fingerprint) — is exactly + // `Compressed`; everything else OCR could want is in the POC `Full` default. + let compressed = ValueSchema::Compressed; + for t in [ + ValueTenant::HelixResidue, + ValueTenant::TurbovecResidue, + ValueTenant::EntityType, + ValueTenant::Fingerprint, + ] { + assert!( + compressed.has(t), + "Compressed already carries {t:?} — OCR rides it" + ); + } + // The shipped transcode rides POC `Full`, which carries every tenant OCR touches + // (incl. Meta anchor / Energy confidence / Plasticity provenance). + let full = ValueSchema::Full; + for t in [ + ValueTenant::HelixResidue, + ValueTenant::TurbovecResidue, + ValueTenant::EntityType, + ValueTenant::Meta, + ValueTenant::Energy, + ValueTenant::Plasticity, + ] { + assert!(full.has(t), "Full POC default carries {t:?}"); + } + // Both presets are layout-preserving — riding either needs no ENVELOPE_LAYOUT_VERSION bump. + assert!(compressed.is_layout_preserving() && full.is_layout_preserving()); + } }