diff --git a/.claude/plans/ocr-canonical-soa-integration-v1.md b/.claude/plans/ocr-canonical-soa-integration-v1.md new file mode 100644 index 00000000..a2e9ecff --- /dev/null +++ b/.claude/plans/ocr-canonical-soa-integration-v1.md @@ -0,0 +1,127 @@ +# OCR → Canonical SoA Integration v1 + +> **Type:** plan (sub-plan — the one that binds OCR to the lance-graph substrate). Deliverables D-OCR-50/51/52/53. +> **Status:** PLANTED 2026-06-15 — design only. THIS is "use the new architecture we raced for." +> **Front:** post-#496. Integration surface = `canonical_node.rs` (`NodeGuid`/`EdgeBlock`/`EdgeCodecFlavor`/`NodeRow`/`ValueTenant`/`ValueSchema`/`NodeRowPacket`) + `class_view.rs` (`ClassView`/`FieldMask`). +> **Canon anchors:** OGAR/CLAUDE.md P0 GUID; lance-graph/CLAUDE.md SoA node (`4ea6ac9`); soa-three-tier-model; DeepNSM crate (`lance-graph/crates/deepnsm`); helix/CAM-PQ (`crates/helix`, `bgz-tensor` CAM-PQ). +> **Skip-by-rule:** OCR introduces NO bespoke row geometry. It rides the existing value-tenant carve. + +--- + +## 0. Intent + +An OCR token is not a foreign payload that needs a boundary adapter — it **is** a +canonical SoA node. This plan defines the mapping so recognized text lands directly +in the substrate: addressed by HHTL, classed by OGAR, valued by a `ValueSchema` +preset over *existing* `ValueTenant`s, edged by `EdgeCodecFlavor`, repaired by +DeepNSM + CAM/PQ, and persisted via `NodeRowPacket`. Zero boundary tax — the whole +point of the splat-native / "one representation, many views" doctrine, applied to OCR. + +## 1. OCR token → `NodeRow` mapping (D-OCR-50) + +**Key (`NodeGuid`, 16 B):** `classid · HEEL · HIP · TWIG · family · identity`. +- `classid` = the minted OCR class prefix (see §2). `0x0000_0000` fallback until OGAR mints it. +- HHTL path (HEEL/HIP/TWIG) = document → page → block (the layout hierarchy from `ocrs::layout_analysis`). +- `family` (3 B) = line/region basin; `identity` (3 B) = token ordinal within the basin. + → `local_key()` (trailing 6 B) addresses a token within its line after the trie walk. + +**Edges (`EdgeBlock`, 16 B = 12 in-family + 4 out-of-family):** +- in-family (12): reading-order + local-layout adjacency (prev/next token, same-line + neighbors, baseline siblings). `EdgeCodecFlavor::CoarseOnly` (1 B/slot) — pure topology. +- out-of-family (4): inherited adapters — (A) table-cell membership, (B) block/column + parent, (C) semantic/coref link (post-DeepNSM), (D) source-region (bbox → page geometry). + +## 2. OCR class + HHTL address scheme (D-OCR-50) + +- Mint an OCR class family in OGAR (`ogar-ontology`): `Document → Page → Block → + Line → Token`, with leaf token subtypes (`Word`, `Number`, `Date`, `Currency`, + `Glyph`, `TableCell`). Until OGAR mints them, hardcode the classid prefix space + per the reserve-don't-reclaim ladder (the classid bytes stay reserved at offset 0). +- `ClassView` for the OCR class declares `edge_codec_flavor` (`CoarseOnly`) and + `value_schema` (the OCR preset, §3). + +## 3. OCR `ValueSchema` preset over EXISTING tenants (D-OCR-51) + +The 480-byte value slab already carves into `VALUE_TENANTS`. An OCR token is **not +a stored string and not a hash** — it is the *terminal of the perturbation cascade*, +reconstructed exactly like every other node. Text = codebook index + residue. + +| Tenant (existing) | OCR role | +|---|---| +| helix residue = **centroid attention field** (NOT a stored code) | The 24-bit golden index is the **query↔centroid alignment** (φ-spiral direction = how this point attends to its place-centroid); the Morton-tile stacked-pyramid perturbation-shader is **multi-scale attention** (coarse centroid → fine perturbation = HHTL cascade in residue space). The field is **evaluated from the φ-template, never stored** ("8K resolution at Super-8 cost" — only the index is kept). Place=HHTL centroid; residue=perturbation off it. The 48-byte `ValueTenant::HelixResidue` is category-wrong (stores a field that must be computed) — do NOT use it. | +| `TurbovecResidue` (16 B, PQ) | PQ edge residue → CAKES nearest-valid-token search over the codebook | +| `Meta` (u64) | codebook index/anchor + confidence + char-confusion/NSM-repair flags + recoder-code fallback for true-OOV | +| `EntityType` (u16) | token subtype (Word/Number/Date/Glyph/TableCell) | +| `Plasticity` (u32) | correction history / last-repair stamp | + +**Reconstruction (this is the round-trip, and it answers Codex P1):** +`text ⇄ codebook_index(Meta) + field-eval(helix 24-bit golden-index attention ⊕ TurbovecResidue PQ)`. Decode = +the DeepNSM Morton-tile **stacked-pyramid perturbation-shader cascade** applied to +the residue → CAKES nearest-valid-token over the codebook (DeepNSM `vocabulary` / +coca `word_frequency`) → the word. No `Fingerprint` hash, no string column. The +reversibility lives in residue + codebook, which is the architecture's whole point. + +**True-OOV (no codebook neighbor — a raw code like `69B8`):** falls back to the +**recoder-code residue** — `recodebeam` already emits recoder codes, not pixels, so +the codes themselves are the reversible payload in `Meta`, repaired by the +char-confusion grammar (D-OCR-52). Still a residue, never a hash. + +**ValueSchema:** `Cognitive` does NOT include `HelixResidue`/`TurbovecResidue`, so +OCR needs a dedicated **`ValueSchema::Ocr`** = `FieldMask` over +{`HelixResidue`,`TurbovecResidue`,`Meta`,`EntityType`,`Plasticity`}. Selection only; +moves no tenant (canon: tenants never move/reuse). + +## 4. Repair: DeepNSM + CAM/PQ nearest-valid-token (D-OCR-52) + +The recognizer emits candidates+confidence; repair is the brainstem we already have: +- **Character/orthographic layer (new, thin, below DeepNSM):** `0/O 1/I/l 5/S rn/m` + confusion table + number/date/currency/table-cell grammars. Repairs orthography on + OOV garbage (codes, IDs like `69B8`) BEFORE the word layer. (This is the only + genuinely greenfield code; the word-frequency half already exists as + `deepnsm/word_frequency`.) +- **Word layer = `deepnsm`:** `vocabulary` → `codebook` → `parser`/`pos` → `encoder` + → `similarity`/`cam64`/`crystal_neighborhood`. Word-level plausibility + disambiguation. +- **Nearest-valid-token = helix / CAM-PQ / CAKES:** the glyph `TurbovecResidue` + (PQ) + `HelixResidue` feed CAKES nearest-valid-token; CHAODA (clustered-hierarchical outlier detection) flags anomalous + tokens (likely-misrecognized). This is `bgz-tensor` CAM-PQ + `crates/helix`. + +Repaired token writes back: corrected text → `Fingerprint`/`EntityType`, repair +provenance → `Meta`/`Plasticity`. + +## 5. Persistence + planner (kv-lance / surreal) + +- `NodeRowPacket` → `SoaEnvelope` → Lance (kv-lance backend, per `surrealdb` fork). + OCR nodes are ordinary rows; a Lance version is a coherent page/document snapshot. +- `surreal_container` as the **OCR-job control plane** (per its role: planner / AST + adapter / time-series / kanban): kanban of OCR jobs (queued→detect→recognize→ + repair→persisted via the Rubicon transitions already in `soa_view.rs`), time-series + of throughput, AST API for the repair-grammar (compile-time vs JIT grammars). + +## 6. Bit-reproducibility harness (D-OCR-53) — the migration payoff + +The transcode oracle (D-OCR-2x) makes OCR a **deterministic regression source for +the whole SoA migration**: the same line crop → C++ Tesseract text AND Rust port +text AND the resulting `NodeRow` bytes. Because every stage is supposed to be +bit-reproducible (DeepNSM bit-reproducible, envelope version-stamped, CausalEdge64 +locked), a golden-file diff over (crop → NodeRow) exercises exactly the muscles the +migration must harden: `ndarray::hpc` hydration, the envelope LE round-trip, and +SIMD numeric exactness. OCR is the best external oracle the substrate has. + +## 7. Deliverables + +- **D-OCR-50:** OCR class + HHTL address scheme; `ClassView` impl for OCR class. +- **D-OCR-51:** `ValueSchema` OCR preset (FieldMask over existing tenants); a token + round-trips token→NodeRow→token with no geometry change. +- **D-OCR-52:** DeepNSM + character-confusion layer + CAM/PQ repair wired; a known + OCR-garbage fixture (`69B8`, `rn`→`m`) is repaired by plausibility. +- **D-OCR-53:** golden-file (crop → NodeRow bytes) regression green, shared with the + SoA migration suite. **Prereq: D-OCR-50 + D-OCR-51** (class/HHTL/ValueSchema must + define the row layout before bytes can be golden-diffed). + +## 8. Open decisions + +- **OD-1:** dedicated `ValueTenant::OcrEvidence` vs ride `Meta`+`HelixResidue` (POC rides). +- **OD-50a:** is a "Token" one node, or is a "Line" the node and tokens are value-slab + sub-records? (Node-per-token is simpler + edges are natural; node-per-line is denser.) +- **OD-52a:** character-confusion layer as a `deepnsm` submodule vs a sibling + `coca-codebook` crate. (Word-frequency half already lives in `deepnsm/word_frequency`.) diff --git a/.claude/plans/soa-centroid-attention-field-synthesis-v1.md b/.claude/plans/soa-centroid-attention-field-synthesis-v1.md new file mode 100644 index 00000000..952fdc64 --- /dev/null +++ b/.claude/plans/soa-centroid-attention-field-synthesis-v1.md @@ -0,0 +1,69 @@ +# SoA Centroid Attention Field — Unified Synthesis v1 + +> **Type:** plan (phase-2 marker / co-architecture). Unifies recognition + reasoning + grammar as reads of ONE field. +> **Status:** PLANTED 2026-06-15. Gated on `cycle-coherent-soa-snapshot-v1` (plastic field ⇒ COW writes). +> **Canon:** helix crate (golden-index residue, φ-template); deepnsm; causal-edge (pearl/nars); TEKAMOLO (#495). + +--- + +## 0. The one idea + +The **48-bit helix residue + Morton-tile stacked-pyramid perturbation-shader IS a +centroid attention field.** Place (HHTL) = centroid; residue (24-bit golden index) += each point's perturbation off it = the **query↔key alignment**; the pyramid = +**multi-scale attention** (coarse centroid → fine). The field is *evaluated from the +φ-spiral template, never stored*. Everything below is a **read of this one field at +a different scale** — not separate engines bolted together. + +## 1. The reads (each is the same field, different scale) + +| Capability | Real crate / source | What it is, as a field read | +|---|---|---| +| **Perception (ONNX/LSTM)** | embedanything(candle)/GGUF host | emits a **query** into the field (golden index + posteriors); the ONLY learned-perceptual part, stays hosted | +| **Attention eval** | `helix` (golden index, curve-ruler, `DistanceLut`) | query↔centroid alignment; Morton pyramid = coarse→fine resolution | +| **Markov context building / bundling** | `deepnsm::markov_bundle`, `encoder` | temporal **superposition along the field** = the bundling read (context = bundled perturbations) | +| **Quorum + NARS reasoning** | `causal-edge::{pearl,nars,syllogism}` | centroid **coupling** = edge read; quorum = agreement of multiple field reads; NARS truth = coupling strength | +| **Grammar heuristics** | `deepnsm::{parser,pos,morphology,spo,syllogism}` | syntactic **field masks** = structured attention over the field | +| **Relative-pronoun / syntax order** | TEKAMOLO resolver (#495) | resolves adverbial/relative-pronoun binding = constrained attention path | +| **Rule learning (the real "aerial")** | `lance-graph-arm-discovery::aerial` — Aerial+ transcode (arXiv 2504.19354), **autoencoder replaced by integer codebook-distance oracle** (palette256, ρ=0.9973 vs cosine) | mines SPO association rules **float-free / bitwise-deterministic** → `arm_to_truth_u8` → `CausalEdge64` confidence_u8 + i4 mantissa. This IS "learning edges" — the field's codebook distance replaces the f32 autoencoder. | +| **Episodic / coref** | AriGraph (`EpisodicWitness64`) | temporal chain read = the field over witness-time | +| **Nearest-valid-token** | `crystal_neighborhood`, `cam64`, CAKES + `turbovec` | field-alignment argmax = read-off to codebook word | + +## 2. Why this is one object, not a pipeline + +VSA bind/bundle/similarity **are** the field operations: bind = perturbation off +centroid, bundle = the pyramid's coarse-level superposition, similarity = field +alignment (`DistanceLut`). So DeepNSM's markov_bundle is the *symbolic readout* of +the field; NARS/quorum is the *edge coupling*; grammar/TEKAMOLO are *attention +masks*. No separate learning machine is needed — the attention field already does +binding/bundling/attention in one structure (Frady/Kleyko 1707.01429: trained-RNN +⊁ VSA for symbol sequences). **`aerial` is the proof in-tree:** Aerial+'s f32 +autoencoder is replaced by the integer codebook-distance oracle (the field) and +still mines rules — neurosymbolic learning with NO autoencoder, NO SGD, NO seed. +What's missing is only **plasticity** (centroid drift), not a learner. + +## 3. Phase-2: make the field plastic (the "learning edges") + +Not new tenants — **the field adapts**: +- centroid **drift** (place-centroids move toward corpus density); +- shader **perturbation-gain** adaptation (the pyramid's response sharpens); +- timed by `Plasticity` tenant; coupled by `CausalEdge64` strength (NARS mantissa moves). +Evaluated from the φ-template (not materialized). **Hard dep:** `cycle-coherent-soa-snapshot` +COW — plastic field mutates per cycle; without snapshot it thrashes Lance. + +## 4. ONNX combination (operator's point) + +The ONNX-shaped recognizer and the field **meet at the query boundary**: ONNX emits +posteriors → the field's golden-index query; field eval + grammar masks + NARS +coupling resolve to the token. So ONNX = the perceptual *encoder into* the field; +the field = everything symbolic/sequential/relational. One substrate, two scales. + +## 5. Determinism split (non-negotiable) +- **Frozen mode** (centroids/gains fixed) → bit-reproducible → the Tesseract oracle + golden-file harness run here. +- **Plastic mode** (field adapts) → live use; NOT golden-diffable; gated by snapshot. +Two modes, explicitly separated, or the bit-repro guarantee is lost. + +## 6. Open +- **OD-A:** RESOLVED — "aerial" = `lance-graph-arm-discovery::aerial` (Aerial+ rule-mining, codebook-distance oracle, not AriGraph). +- **OD-B:** centroid drift rule — Hebbian on `Plasticity`, or NARS-revision on `CausalEdge64`? (probe-gate, measure first.) +- **OD-C:** operator sign-off required for any new tenant (anti-invention guardrail) — phase-2 should need NONE (field is evaluated, not stored). diff --git a/.claude/plans/tesseract-rs-ast-dll-codegen-v1.md b/.claude/plans/tesseract-rs-ast-dll-codegen-v1.md new file mode 100644 index 00000000..22adaa05 --- /dev/null +++ b/.claude/plans/tesseract-rs-ast-dll-codegen-v1.md @@ -0,0 +1,87 @@ +# tesseract-rs — AST-DLL C++→Rust Codegen Harness v1 + +> **Type:** plan (sub-plan). Deliverables D-OCR-40/41/42. The transcode *mechanism*. +> **Status:** PLANTED 2026-06-15 v2 — layout IS in scope (1:1 raw-pointer), not skipped. +> **Front:** post-#496. Uses `AdaWorldAPI/ruff` AST/codegen crates as the Rust-emission engine. +> **Canon anchors:** master §4. Deterministic + diff-gated (bit-reproducibility doctrine). +> **Skip-by-rule:** only leaf/mechanical modules are codegen targets; ownership-heavy code is transcribed faithfully as raw-pointer Rust (1:1), with safe-refactor deferred to a later oracle-gated pass. + +--- + +## 0. Intent + +Transcode the *mechanical* C++ leaf modules (container parse, unicharset, recoder, +dawg node-arrays, weight-matrix struct walks) into Rust by a **deterministic, +reviewable codegen harness** rather than by hand — so the faithful tier is +auditable and re-runnable. The harness pairs a **clang C++ AST frontend** with a +**Rust emission backend built on the `ruff` AST/codegen crates**. + +## 1. Why ruff (honest scoping) + +`ruff` is a *Python* toolchain — `ruff_python_parser` / `ruff_python_ast` parse +**Python**, not C++. So ruff is **not** the C++ frontend. Its value here is the +mature, battle-tested **Rust-side AST → source emission discipline**: +`ruff_python_codegen` (AST → formatted source), `ruff_formatter` (the formatting +IR), `ruff_source_file`, and the `ruff_python_dto_check` pattern (structural +invariant checks on a typed AST). We reuse those *patterns and crates* as the +emission/formatting backend for a `RustAst → rust source` pipeline. The C++ side is +clang. + +``` +C++ source ──(libclang)──► Clang AST ──► [AST DLL: stable IR dump] ──► RustAst builder + ──► (ruff codegen/formatter discipline) ──► formatted .rs ──► diff-gate vs FFI oracle +``` + +## 2. The "AST DLL" — D-OCR-40 + +The C++ AST is extracted once into a **stable, serializable IR** (the "AST DLL"): +a libclang traversal that dumps the subset we transcode (struct/enum decls, plain +methods, table initializers, fixed-size array walks) as a typed IR — independent of +clang version drift, so the emission step is reproducible. Functions touching +pointers-into-mutable-graphs, virtual dispatch, or template metaprogramming are +**flagged NOT-CODEGENABLE** and routed to hand-port/replace (they are layout code — +already skipped, per master §3). + +## 3. Rust emission via ruff crates — D-OCR-41 + +A `RustAst` builder consumes the IR and emits idiomatic Rust: +- field-by-field struct/enum transcription (canon: byte layout preserved); +- table/array initializers → `const`/`static` Rust tables; +- the emission goes through ruff's formatter IR so output is deterministic and + diff-stable (re-running codegen produces byte-identical source). +- a `dto_check`-style pass asserts the **LE byte contract** is preserved per struct + (no silent re-ordering / re-widening — the same invariant the SoA envelope audit + enforces). + +## 4. Diff-gate — D-OCR-42 + +Every codegen'd module is validated against the FFI oracle: +- behavioral: emitted Rust function vs `libtesseract` function on the same inputs + (e.g. unicharset id↔utf8, recoder encode/decode, dawg word-membership) → byte-equal; +- structural: `dto_check` confirms each emitted struct's byte image matches the C++ + `sizeof`/offset dump. +Codegen output is committed (not generated at build) so reviewers see real Rust; +the harness is re-runnable to prove the commit equals the generator output. + +## 5. Module assignment (codegen vs hand vs replace) + +| C++ area | Route | +|---|---| +| `tessdatamanager`, `unicharset`, `unicharcompress` (recoder), `dawg`/`trie` node arrays, `weightmatrix` struct/quant walks | **CODEGEN (D-OCR-41)** | +| `recodebeam` (beam + dawg interaction), int8 GEMV rounding | **HAND-PORT** (numeric/behavioral subtlety) | +| `textord`/`ccstruct` layout | **CODEGEN → faithful raw-pointer Rust (D-OCR-30)** — intrusive ELIST/CLIST transcribed 1:1, NOT replaced | +| Leptonica (~dozen ops only) | hand-port to image/imageproc (D-OCR-31) | + +## 6. Deliverables + +- **D-OCR-40:** libclang → stable IR dump for the codegen-target module set; NOT-CODEGENABLE flagging works. +- **D-OCR-41:** IR → committed Rust via ruff emission; re-run is byte-identical. +- **D-OCR-42:** behavioral + structural diff-gate green for the target modules vs the FFI oracle. + +## 7. Open decisions + +- **OD-3 (from master):** libclang in-process vs clang `-ast-dump=json` consumed by + a Rust IR. JSON is simpler/decoupled; libclang is richer/faster. Default: clang + JSON dump for v1 (decoupled, reproducible), libclang later if needed. +- **OD-40a:** is the AST-DLL harness OCR-specific, or a reusable + `AdaWorldAPI/` tool? (It would also serve other C++→Rust ports.) diff --git a/.claude/plans/tesseract-rs-layout-transcode-v1.md b/.claude/plans/tesseract-rs-layout-transcode-v1.md new file mode 100644 index 00000000..554275ca --- /dev/null +++ b/.claude/plans/tesseract-rs-layout-transcode-v1.md @@ -0,0 +1,44 @@ +# tesseract-rs — Layout (textord/ccstruct) 1:1 Transcode v1 + +> **Type:** plan (sub-plan, v2 — the part v1 wrongly skipped). Deliverables D-OCR-30/31. +> **Status:** PLANTED 2026-06-15. FAITHFUL 1:1, raw-pointer where C++ is intrusive. +> **Canon:** oracle-gated behavioral parity; safe-refactor is a later, separate, oracle-preserving pass. + +--- + +## 0. Intent +Reproduce Tesseract's page layout (the ~tens-of-thousands LOC the v1 plan wrongly +proposed replacing with ocrs) byte-for-byte. This is the bulk of the "free 200k LOC": +mechanical, codegen-amenable in structure, made tractable by accepting raw-pointer Rust. + +## 1. Faithful-transcription ruling +Tesseract layout is intrusive `ELIST`/`CLIST` doubly-linked lists + cyclic mutable +blob graphs (`BLOBNBOX`, `TO_BLOCK`, `ColPartition`, tab-stop finder, baseline fit, +reading order). The 1:1 image is **raw-pointer Rust** (`*mut`, intrusive nodes, +manual lifetimes) — behavior-identical, NOT redesigned. Ownership redesign (arena/ +slotmap/index graphs) is a LATER pass, gated to preserve oracle output. Do NOT +redesign during transcode; that breaks 1:1 and is the trap. + +## 2. Modules (D-OCR-30) +`ccstruct/{blobbox,coutline,polyblk,...}`, `textord/{tabfind,colfind,colpartition, +tablefind,baselinefit,textlineprojection,wordseg,...}`. AST-DLL codegen emits the +struct/method skeletons + intrusive-list ops as raw-pointer Rust; the gnarly +control flow is reviewed against the C++ AST, diff-gated per function. + +## 3. Leptonica ops (D-OCR-31) +Only the ~dozen ops Tesseract calls: Otsu/Sauvola binarize, projection/Radon +deskew, despeckle, connected-component label, scale. Hand-port onto `image`/ +`imageproc` with per-op numeric parity vs the Leptonica fork (built as oracle). +The rest of Leptonica is NOT ported. + +## 4. Acceptance (D-OCR-30/31) +Full-page layout output (block/line/word boxes + reading order) byte-identical to +libtesseract on a fixed page set; per-op image results bit-equal on fixtures. + +## 5. Open +- **OD-30a:** one intrusive-list helper crate (`tess_elist`) shared across modules, or inline per-module? (Shared reduces unsafe surface.) + +## 6. Front-end constraint (carried from retired neural-layout plan) +The PDF/image front-end MUST be pure-Rust to honor the zero-C posture: image inputs +via `image`/`imageproc`; PDF via a pure-Rust path (`ferrules`), **not** `pdfium-render` +(it wraps native PDFium = C). The zero-C acceptance gate is otherwise unsatisfiable. diff --git a/.claude/plans/tesseract-rs-recodebeam-transcode-v1.md b/.claude/plans/tesseract-rs-recodebeam-transcode-v1.md new file mode 100644 index 00000000..2fb8b3df --- /dev/null +++ b/.claude/plans/tesseract-rs-recodebeam-transcode-v1.md @@ -0,0 +1,29 @@ +# tesseract-rs — recodebeam Decoder 1:1 Transcode v1 + +> **Type:** plan (sub-plan, v2). Deliverable D-OCR-21. Replaces v1 lstm-recodebeam (LSTM now hosted). +> **Status:** PLANTED 2026-06-15. Decoder transcoded 1:1 over HOSTED posteriors. +> **Canon:** consumes `[T,C]` from D-OCR-16 (embedanything host); oracle-gated. + +--- + +## 0. Intent +Transcode only the DECODER. The LSTM forward is hosted (D-OCR-16); recodebeam takes +its `[T, n_classes]` posteriors and produces text exactly as Tesseract does. + +## 1. recodebeam (`tesseract-rs/src/recodebeam.rs`) — hand-port, D-OCR-21 +Beam over recoder codes; dawg-constrained + unconstrained beams; certainty/rating +tie-break 1:1. DAWG node-array walks = codegen (D-OCR-40); the beam↔dawg interaction +is hand-ported (behavioral subtlety). Output: text + per-token rating/certainty → +becomes per-token confidence at emit (integration plan). + +## 2. int8 exactness boundary +Numeric exactness now lives at the HOST boundary (D-OCR-16 posteriors), not in +transcoded kernels. recodebeam consumes posteriors; if the host reproduces C++ +posteriors to the int8 contract, the decoder's 1:1 transcription yields 1:1 text. + +## 3. Acceptance +recodebeam text byte-identical to libtesseract on ≥10k crops given oracle posteriors +(isolates decoder correctness from host numeric parity). + +## 4. Open +- **OD-21a:** support `lstm_choice_mode` top-k now (feeds OCR evidence tenant) or later. diff --git a/.claude/plans/tesseract-rs-traineddata-gguf-v1.md b/.claude/plans/tesseract-rs-traineddata-gguf-v1.md new file mode 100644 index 00000000..a9845114 --- /dev/null +++ b/.claude/plans/tesseract-rs-traineddata-gguf-v1.md @@ -0,0 +1,36 @@ +# tesseract-rs — traineddata → GGUF → embedanything Host v1 + +> **Type:** plan (sub-plan, v2). Deliverables D-OCR-10/16. Replaces v1 `traineddata-ndarray`. +> **Status:** PLANTED 2026-06-15. The LSTM is HOSTED, not transcoded. +> **Host chain:** `.traineddata` → GGUF → `embedanything` DTO (candle) → `ndarray` AMX; `bgz_tensor` weight store. +> **Canon:** `.grok/NDARRAY_BGZ_EMBEDANYTHING_INTEGRATION.md`; ndarray::hpc GGUF loader. + +--- + +## 0. Intent +Make Tesseract's recognizer "just another GGUF model behind embedanything." Parse +the `.traineddata`, extract the recognizer net + weights, **export GGUF**, and run +it on the existing inference runbook. No bespoke LSTM kernels; reuse candle's GGUF +loader + ndarray AMX + bgz_tensor storage. + +## 1. Loader (`tesseract-rs/src/traineddata/`) — D-OCR-10 +Parse the modern (LSTM) `.traineddata` components: `lstm` (VGSL net + weights), +`lstm-unicharset`, `lstm-recoder` (UnicharCompress), `unicharset`, `*.dawg`. +Skip legacy (`inttemp`/`normproto`/`shapetable`/adaptive). Container/unicharset/ +recoder/dawg parse = **AST-DLL codegen** (D-OCR-40); VGSL net-spec = hand (tiny grammar). + +## 2. GGUF export — D-OCR-10 +Walk the VGSL graph → emit a GGUF model file: +- map Tesseract layers (Conv, LSTM/BiLSTM, FullyConnected, Output/Softmax) to GGUF tensors + arch metadata; +- preserve int8 quantization + per-row scales exactly (GGUF Q8 / per-tensor scale) so the hosted run can match C++ within the int8 contract; +- `bgz_tensor` stores the exported tensors (compressed, Lance-native, random-access) per its weight-store role. + +## 3. Hosted run — D-OCR-16 +`embedanything::infer_sequence` (the D-OCR-15 extension) loads the GGUF via candle, +runs CNN+BiLSTM on the ndarray AMX path, returns `[T, n_classes]` posteriors. +Acceptance: per-timestep posteriors match a libtesseract dump within the int8 +exactness contract (float path 1:1) on a 1k-crop set. recodebeam (D-OCR-21) consumes these. + +## 4. Open +- **OD-10a:** GGUF int8 (Q8_0) vs keep Tesseract's exact int8 layout in a custom GGUF kv — whichever reproduces C++ posteriors to the bit. +- **OD-10b:** candle's BiLSTM/CTC coverage — confirm candle expresses Tesseract's BiLSTM + softmax exactly, else add the missing op in the candle fork. diff --git a/.claude/plans/tesseract-rs-transcode-master-v1.md b/.claude/plans/tesseract-rs-transcode-master-v1.md new file mode 100644 index 00000000..2c6f51e3 --- /dev/null +++ b/.claude/plans/tesseract-rs-transcode-master-v1.md @@ -0,0 +1,91 @@ +# Tesseract → tesseract-rs — 1:1 Transcode Master Plan v2 + +> **Type:** plan family root. SUPERSEDES v1 (which wrongly skipped layout). +> **Status:** PLANTED 2026-06-15 v2 — design locked. 1:1 behavioral transcode of ALL +> of Tesseract; the LSTM forward is the ONLY swapped component. +> **Front:** post-#496. Hosts: `embedanything` DTO (GGUF→candle→ndarray-AMX, per +> `.grok/NDARRAY_BGZ_EMBEDANYTHING_INTEGRATION.md`); `bgz_tensor` weight store. +> **Canon:** OGAR/CLAUDE.md GUID P0; lance-graph/CLAUDE.md SoA node; canonical_node.rs. + +--- + +## 0. The whole decision in one line + +Transcode **every** Tesseract module 1:1 for behavioral parity (validated against +the `tesseract-rs` FFI oracle), EXCEPT the LSTM recognizer forward pass, which is +**hosted** on the existing runbook: recognizer weights → GGUF → `embedanything` +DTO (candle backend) → `ndarray` AMX path → per-timestep posteriors. Everything +else — container, unicharset, recoder, **textord/ccstruct layout**, recodebeam, +DAWG dict, the minimal Leptonica ops Tesseract calls — is faithfully ported. + +## 1. What is 1:1 transcoded (≈200k LOC, mechanically) + +| Tesseract area | Route | 1:1 fidelity rule | +|---|---|---| +| `ccutil/tessdatamanager`, `unicharset`, `unicharcompress` (recoder) | AST-DLL codegen | byte-faithful tables | +| `dict/{dawg,trie,permdawg}` | AST-DLL codegen | node-array walks 1:1 | +| **`textord/` + `ccstruct/` (layout, tab-stops, ColPartition, reading order)** | AST-DLL codegen → **faithful raw-pointer/unsafe Rust** | ELIST/CLIST + cyclic blob graphs transcribed as raw-pointer intrusive lists; behavior-identical; safe-refactor is a LATER behavior-preserving pass, never a 1:1 deviation | +| `recodebeam` (beam + DAWG interaction) | hand-port | tie-break/normalization 1:1 | +| Leptonica ops Tesseract actually calls (Otsu/Sauvola, deskew, despeckle, CC-label, scale) | hand-port onto `image`/`imageproc` | numeric parity per-op | +| `lstm/{network,lstm,fullyconnected,convolve,weightmatrix}` | **NOT transcoded — HOSTED** | see §2 | + +**The unsafe-is-fine ruling:** a true 1:1 image of intrusive-pointer C++ is +raw-pointer Rust. We accept `unsafe` as the faithful transcription; correctness is +proven by the oracle diff, not by safe-Rust aesthetics. Refactor to arena/index +graphs is a separate, oracle-gated step AFTER 1:1 is green. This is what makes the +200k LOC mechanical (codegen + faithful transcription) instead of an ownership redesign. + +## 2. The ONE swap — LSTM hosted, not ported + +``` +.traineddata LSTM weights ──► GGUF export ──► embedanything DTO (candle) ──► ndarray AMX + ──► per-timestep posteriors ──► (transcoded) recodebeam + DAWG ──► text + confidence +``` + +- The `.traineddata` loader (D-OCR-10) extracts the recognizer net + weights and + **exports GGUF**, not raw ndarray hydration — so it enters the existing runbook + unchanged. `bgz_tensor` stores/streams the weight tensors (its §2.3 use-case). +- `embedanything::infer()` runs the CNN+BiLSTM. recodebeam (transcoded) decodes the + posteriors. Only the matmul/gate compute is delegated; the decode stays 1:1. +- Reuses every existing optimization (candle GGUF loader, ndarray AMX, bgz storage) + — zero new inference stack. + +## 3. The one DTO extension OCR imposes + +`embedanything::infer` today shapes outputs as `EmbeddingVector`/`HypothesisScore`. +The recognizer needs **per-timestep posteriors** (`[T, n_classes]`) for CTC/recodebeam. +→ **D-OCR-15:** add a sequence-output variant to the DTO (`infer_sequence → [T,C]`), +the only change the OCR use forces on the shared interface. Narrow, additive. + +## 4. Deliverable index (D-OCR-NN) + +| ID | Deliverable | Depends | +|---|---|---| +| D-OCR-10 | `.traineddata` parse → unicharset/recoder/DAWG + recognizer-net → **GGUF export** | D-OCR-40 | +| D-OCR-15 | `embedanything` sequence-output (`infer_sequence → [T,C]`) | — | +| D-OCR-16 | recognizer GGUF runs via embedanything(candle)→ndarray; posteriors match C++ | D-OCR-10,15 | +| D-OCR-21 | recodebeam + DAWG transcode (decode over hosted posteriors) | D-OCR-16 | +| D-OCR-30 | **textord/ccstruct layout 1:1** (raw-pointer faithful) | D-OCR-40 | +| D-OCR-31 | minimal Leptonica ops on image/imageproc (numeric parity) | — | +| D-OCR-40 | AST-DLL clang→IR→Rust codegen harness (ruff emission) | — | +| D-OCR-42 | oracle diff-gate (every module vs libtesseract FFI) | D-OCR-21,30 | +| D-OCR-50 | OCR token → canonical NodeRow (OGAR class, HHTL, ValueSchema, edges) | canon | +| D-OCR-52 | DeepNSM + char-confusion + CAM/PQ token repair | D-OCR-50 | +| D-OCR-53 | bit-reproducibility harness (crop→text→NodeRow golden diff) | D-OCR-21,30,**50,51** | + +Critical path: **40 → {10,30} → 16 → 21 → 42 → 53**. D-OCR-15 parallel (tiny). The +layout transcode (30) and the recognizer host (16) are independent until decode. + +## 5. Success criteria + +1. `tesseract-rs` reproduces libtesseract output **byte-identical** on ≥10k line crops AND full-page layouts (oracle gate) — layout included, since it's 1:1. +2. LSTM runs ONLY via embedanything(candle)→ndarray; no transcoded LSTM kernels. +3. Zero crates.io (forks + ndarray/lance family); `ort` sole opt-in C++ (off by default). +4. Recognized tokens land as canonical NodeRows with no bespoke geometry. + +## 6. Sub-plans (re-cast for v2) +- `tesseract-rs-traineddata-gguf-v1` (D-OCR-10/16) — loader → GGUF → embedanything host. +- `tesseract-rs-layout-transcode-v1` (D-OCR-30/31) — textord/ccstruct 1:1 + Leptonica ops. +- `tesseract-rs-recodebeam-transcode-v1` (D-OCR-21) — decoder over hosted posteriors. +- `tesseract-rs-ast-dll-codegen-v1` (D-OCR-40/42) — the codegen harness (covers layout). +- `ocr-canonical-soa-integration-v1` (D-OCR-50/52/53) — OCR token = NodeRow + repair.