From d911a1c9cbf5157194db60e8a30d839fd4e6e589 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 19:49:03 +0200
Subject: [PATCH 01/25] docs(plan): plant ocr-canonical-soa-integration-v1.md

---
 .../plans/ocr-canonical-soa-integration-v1.md | 119 ++++++++++++++++++
 1 file changed, 119 insertions(+)
 create mode 100644 .claude/plans/ocr-canonical-soa-integration-v1.md

diff --git a/.claude/plans/ocr-canonical-soa-integration-v1.md b/.claude/plans/ocr-canonical-soa-integration-v1.md
new file mode 100644
index 00000000..a31524d9
--- /dev/null
+++ b/.claude/plans/ocr-canonical-soa-integration-v1.md
@@ -0,0 +1,119 @@
+# OCR → Canonical SoA Integration v1
+
+> **Type:** plan (sub-plan — the one that binds OCR to the lance-graph substrate). Deliverables D-OCR-50/51/52/53.
+> **Status:** PLANTED 2026-06-15 — design only. THIS is "use the new architecture we raced for."
+> **Front:** post-#496. Integration surface = `canonical_node.rs` (`NodeGuid`/`EdgeBlock`/`EdgeCodecFlavor`/`NodeRow`/`ValueTenant`/`ValueSchema`/`NodeRowPacket`) + `class_view.rs` (`ClassView`/`FieldMask`).
+> **Canon anchors:** OGAR/CLAUDE.md P0 GUID; lance-graph/CLAUDE.md SoA node (`4ea6ac9`); soa-three-tier-model; DeepNSM crate (`lance-graph/crates/deepnsm`); helix/CAM-PQ (`crates/helix`, `bgz-tensor` CAM-PQ).
+> **Skip-by-rule:** OCR introduces NO bespoke row geometry. It rides the existing value-tenant carve.
+
+---
+
+## 0. Intent
+
+An OCR token is not a foreign payload that needs a boundary adapter — it **is** a
+canonical SoA node. This plan defines the mapping so recognized text lands directly
+in the substrate: addressed by HHTL, classed by OGAR, valued by a `ValueSchema`
+preset over *existing* `ValueTenant`s, edged by `EdgeCodecFlavor`, repaired by
+DeepNSM + CAM/PQ, and persisted via `NodeRowPacket`. Zero boundary tax — the whole
+point of the splat-native / "one representation, many views" doctrine, applied to OCR.
+
+## 1. OCR token → `NodeRow` mapping (D-OCR-50)
+
+**Key (`NodeGuid`, 16 B):** `classid · HEEL · HIP · TWIG · family · identity`.
+- `classid` = the minted OCR class prefix (see §2). `0x0000_0000` fallback until OGAR mints it.
+- HHTL path (HEEL/HIP/TWIG) = document → page → block (the layout hierarchy from `ocrs::layout_analysis`).
+- `family` (3 B) = line/region basin; `identity` (3 B) = token ordinal within the basin.
+  → `local_key()` (trailing 6 B) addresses a token within its line after the trie walk.
+
+**Edges (`EdgeBlock`, 16 B = 12 in-family + 4 out-of-family):**
+- in-family (12): reading-order + local-layout adjacency (prev/next token, same-line
+  neighbors, baseline siblings). `EdgeCodecFlavor::CoarseOnly` (1 B/slot) — pure topology.
+- out-of-family (4): inherited adapters — (A) table-cell membership, (B) block/column
+  parent, (C) semantic/coref link (post-DeepNSM), (D) source-region (bbox → page geometry).
+
+## 2. OCR class + HHTL address scheme (D-OCR-50)
+
+- Mint an OCR class family in OGAR (`ogar-ontology`): `Document → Page → Block →
+  Line → Token`, with leaf token subtypes (`Word`, `Number`, `Date`, `Currency`,
+  `Glyph`, `TableCell`). Until OGAR mints them, hardcode the classid prefix space
+  per the reserve-don't-reclaim ladder (the classid bytes stay reserved at offset 0).
+- `ClassView` for the OCR class declares `edge_codec_flavor` (`CoarseOnly`) and
+  `value_schema` (the OCR preset, §3).
+
+## 3. OCR `ValueSchema` preset over EXISTING tenants (D-OCR-51)
+
+The 480-byte value slab already carves into `VALUE_TENANTS`. OCR **rides existing
+tenants** — no new tenant for the POC:
+
+| Tenant (existing) | OCR use |
+|---|---|
+| `Fingerprint` (32 B / 256-bit) | glyph/line identity print (DeepNSM `encoder` XOR-bind/bundle of the crop) |
+| `TurbovecResidue` (16 B, PQ) | glyph embedding → CAKES nearest-valid-token search |
+| `HelixResidue` (48 B) | orthogonal residue: per-token deviation from class centroid (confidence-as-residue) |
+| `Meta` (u64) | packed confidence + NSM-repair flags + token-subtype bits |
+| `EntityType` (u16) | OCR token class discriminator (Word/Number/Date/Glyph/TableCell) |
+| `Plasticity` (u32) | correction history / last-repair stamp |
+
+→ define `ValueSchema::Ocr` (or select `Cognitive` if its mask already covers the
+above) as a `FieldMask` over those `ValueTenant` positions. Selection only — it
+carves *within* the slab, moves nothing (canon: tenants never move/reuse).
+
+**OD-1 (deferred):** a dedicated `ValueTenant::OcrEvidence` (bbox `[f16;4]` +
+per-char confidence + top-k recodebeam candidates) is the clean home for
+recognizer evidence. Adding a tenant is canon-significant, so the POC packs a
+compressed form into `Meta`+`HelixResidue` and defers the dedicated tenant to a
+follow-up once the evidence shape is stable (needs D-OCR-21 `lstm_choice_mode`).
+
+## 4. Repair: DeepNSM + CAM/PQ nearest-valid-token (D-OCR-52)
+
+The recognizer emits candidates+confidence; repair is the brainstem we already have:
+- **Character/orthographic layer (new, thin, below DeepNSM):** `0/O 1/I/l 5/S rn/m`
+  confusion table + number/date/currency/table-cell grammars. Repairs orthography on
+  OOV garbage (codes, IDs like `69B8`) BEFORE the word layer. (This is the only
+  genuinely greenfield code; the word-frequency half already exists as
+  `deepnsm/word_frequency`.)
+- **Word layer = `deepnsm`:** `vocabulary` → `codebook` → `parser`/`pos` → `encoder`
+  → `similarity`/`cam64`/`crystal_neighborhood`. Word-level plausibility + disambiguation.
+- **Nearest-valid-token = helix / CAM-PQ / CAKES:** the glyph `TurbovecResidue`
+  (PQ) + `HelixResidue` feed CAKES nearest-valid-token; CHAODA flags anomalous
+  tokens (likely-misrecognized). This is `bgz-tensor` CAM-PQ + `crates/helix`.
+
+Repaired token writes back: corrected text → `Fingerprint`/`EntityType`, repair
+provenance → `Meta`/`Plasticity`.
+
+## 5. Persistence + planner (kv-lance / surreal)
+
+- `NodeRowPacket` → `SoaEnvelope` → Lance (kv-lance backend, per `surrealdb` fork).
+  OCR nodes are ordinary rows; a Lance version is a coherent page/document snapshot.
+- `surreal_container` as the **OCR-job control plane** (per its role: planner / AST
+  adapter / time-series / kanban): kanban of OCR jobs (queued→detect→recognize→
+  repair→persisted via the Rubicon transitions already in `soa_view.rs`), time-series
+  of throughput, AST API for the repair-grammar (compile-time vs JIT grammars).
+
+## 6. Bit-reproducibility harness (D-OCR-53) — the migration payoff
+
+The transcode oracle (D-OCR-2x) makes OCR a **deterministic regression source for
+the whole SoA migration**: the same line crop → C++ Tesseract text AND Rust port
+text AND the resulting `NodeRow` bytes. Because every stage is supposed to be
+bit-reproducible (DeepNSM bit-reproducible, envelope version-stamped, CausalEdge64
+locked), a golden-file diff over (crop → NodeRow) exercises exactly the muscles the
+migration must harden: `ndarray::hpc` hydration, the envelope LE round-trip, and
+SIMD numeric exactness. OCR is the best external oracle the substrate has.
+
+## 7. Deliverables
+
+- **D-OCR-50:** OCR class + HHTL address scheme; `ClassView` impl for OCR class.
+- **D-OCR-51:** `ValueSchema` OCR preset (FieldMask over existing tenants); a token
+  round-trips token→NodeRow→token with no geometry change.
+- **D-OCR-52:** DeepNSM + character-confusion layer + CAM/PQ repair wired; a known
+  OCR-garbage fixture (`69B8`, `rn`→`m`) is repaired by plausibility.
+- **D-OCR-53:** golden-file (crop → NodeRow bytes) regression green, shared with the
+  SoA migration suite.
+
+## 8. Open decisions
+
+- **OD-1:** dedicated `ValueTenant::OcrEvidence` vs ride `Meta`+`HelixResidue` (POC rides).
+- **OD-50a:** is a "Token" one node, or is a "Line" the node and tokens are value-slab
+  sub-records? (Node-per-token is simpler + edges are natural; node-per-line is denser.)
+- **OD-52a:** character-confusion layer as a `deepnsm` submodule vs a sibling
+  `coca-codebook` crate. (Word-frequency half already lives in `deepnsm/word_frequency`.)

From 9725503960781cad2fe5e0987cfd801ba8f0a345 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 19:49:04 +0200
Subject: [PATCH 02/25] docs(plan): plant tesseract-rs-ast-dll-codegen-v1.md

---
 .../plans/tesseract-rs-ast-dll-codegen-v1.md  | 86 +++++++++++++++++++
 1 file changed, 86 insertions(+)
 create mode 100644 .claude/plans/tesseract-rs-ast-dll-codegen-v1.md

diff --git a/.claude/plans/tesseract-rs-ast-dll-codegen-v1.md b/.claude/plans/tesseract-rs-ast-dll-codegen-v1.md
new file mode 100644
index 00000000..ebe1d21b
--- /dev/null
+++ b/.claude/plans/tesseract-rs-ast-dll-codegen-v1.md
@@ -0,0 +1,86 @@
+# tesseract-rs — AST-DLL C++→Rust Codegen Harness v1
+
+> **Type:** plan (sub-plan). Deliverables D-OCR-40/41/42. The transcode *mechanism*.
+> **Status:** PLANTED 2026-06-15 — design only.
+> **Front:** post-#496. Uses `AdaWorldAPI/ruff` AST/codegen crates as the Rust-emission engine.
+> **Canon anchors:** master §4. Deterministic + diff-gated (bit-reproducibility doctrine).
+> **Skip-by-rule:** only leaf/mechanical modules are codegen targets; ownership-heavy code is hand-ported or replaced.
+
+---
+
+## 0. Intent
+
+Transcode the *mechanical* C++ leaf modules (container parse, unicharset, recoder,
+dawg node-arrays, weight-matrix struct walks) into Rust by a **deterministic,
+reviewable codegen harness** rather than by hand — so the faithful tier is
+auditable and re-runnable. The harness pairs a **clang C++ AST frontend** with a
+**Rust emission backend built on the `ruff` AST/codegen crates**.
+
+## 1. Why ruff (honest scoping)
+
+`ruff` is a *Python* toolchain — `ruff_python_parser` / `ruff_python_ast` parse
+**Python**, not C++. So ruff is **not** the C++ frontend. Its value here is the
+mature, battle-tested **Rust-side AST → source emission discipline**:
+`ruff_python_codegen` (AST → formatted source), `ruff_formatter` (the formatting
+IR), `ruff_source_file`, and the `ruff_python_dto_check` pattern (structural
+invariant checks on a typed AST). We reuse those *patterns and crates* as the
+emission/formatting backend for a `RustAst → rust source` pipeline. The C++ side is
+clang.
+
+```
+C++ source ──(libclang)──► Clang AST ──► [AST DLL: stable IR dump] ──► RustAst builder
+   ──► (ruff codegen/formatter discipline) ──► formatted .rs ──► diff-gate vs FFI oracle
+```
+
+## 2. The "AST DLL" — D-OCR-40
+
+The C++ AST is extracted once into a **stable, serializable IR** (the "AST DLL"):
+a libclang traversal that dumps the subset we transcode (struct/enum decls, plain
+methods, table initializers, fixed-size array walks) as a typed IR — independent of
+clang version drift, so the emission step is reproducible. Functions touching
+pointers-into-mutable-graphs, virtual dispatch, or template metaprogramming are
+**flagged NOT-CODEGENABLE** and routed to hand-port/replace (they are layout code —
+already skipped, per master §3).
+
+## 3. Rust emission via ruff crates — D-OCR-41
+
+A `RustAst` builder consumes the IR and emits idiomatic Rust:
+- field-by-field struct/enum transcription (canon: byte layout preserved);
+- table/array initializers → `const`/`static` Rust tables;
+- the emission goes through ruff's formatter IR so output is deterministic and
+  diff-stable (re-running codegen produces byte-identical source).
+- a `dto_check`-style pass asserts the **LE byte contract** is preserved per struct
+  (no silent re-ordering / re-widening — the same invariant the SoA envelope audit
+  enforces).
+
+## 4. Diff-gate — D-OCR-42
+
+Every codegen'd module is validated against the FFI oracle:
+- behavioral: emitted Rust function vs `libtesseract` function on the same inputs
+  (e.g. unicharset id↔utf8, recoder encode/decode, dawg word-membership) → byte-equal;
+- structural: `dto_check` confirms each emitted struct's byte image matches the C++
+  `sizeof`/offset dump.
+Codegen output is committed (not generated at build) so reviewers see real Rust;
+the harness is re-runnable to prove the commit equals the generator output.
+
+## 5. Module assignment (codegen vs hand vs replace)
+
+| C++ area | Route |
+|---|---|
+| `tessdatamanager`, `unicharset`, `unicharcompress` (recoder), `dawg`/`trie` node arrays, `weightmatrix` struct/quant walks | **CODEGEN (D-OCR-41)** |
+| `recodebeam` (beam + dawg interaction), int8 GEMV rounding | **HAND-PORT** (numeric/behavioral subtlety) |
+| `textord`/`ccstruct` layout, Leptonica | **REPLACE** (ocrs / minimal imageproc) — never enters the harness |
+
+## 6. Deliverables
+
+- **D-OCR-40:** libclang → stable IR dump for the codegen-target module set; NOT-CODEGENABLE flagging works.
+- **D-OCR-41:** IR → committed Rust via ruff emission; re-run is byte-identical.
+- **D-OCR-42:** behavioral + structural diff-gate green for the target modules vs the FFI oracle.
+
+## 7. Open decisions
+
+- **OD-3 (from master):** libclang in-process vs clang `-ast-dump=json` consumed by
+  a Rust IR. JSON is simpler/decoupled; libclang is richer/faster. Default: clang
+  JSON dump for v1 (decoupled, reproducible), libclang later if needed.
+- **OD-40a:** is the AST-DLL harness OCR-specific, or a reusable
+  `AdaWorldAPI/<cpp-transcode>` tool? (It would also serve other C++→Rust ports.)

From 1e246009f487ad15c9adf732395d4306d1ea7c51 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 19:49:05 +0200
Subject: [PATCH 03/25] docs(plan): plant tesseract-rs-lstm-recodebeam-v1.md

---
 .../plans/tesseract-rs-lstm-recodebeam-v1.md  | 79 +++++++++++++++++++
 1 file changed, 79 insertions(+)
 create mode 100644 .claude/plans/tesseract-rs-lstm-recodebeam-v1.md

diff --git a/.claude/plans/tesseract-rs-lstm-recodebeam-v1.md b/.claude/plans/tesseract-rs-lstm-recodebeam-v1.md
new file mode 100644
index 00000000..be4cd8dc
--- /dev/null
+++ b/.claude/plans/tesseract-rs-lstm-recodebeam-v1.md
@@ -0,0 +1,79 @@
+# tesseract-rs — LSTM Forward + recodebeam Decoder v1
+
+> **Type:** plan (sub-plan). Deliverables D-OCR-20/21/22.
+> **Status:** PLANTED 2026-06-15 — design only.
+> **Front:** post-#496. Forward pass targets `ndarray` (SIMD/BLAS/CLAM provider). Oracle = `tesseract-rs` FFI fork.
+> **Canon anchors:** master §4; `ndarray` SIMD kernels; bit-reproducibility doctrine (DeepNSM "bit-reproducible", envelope version-stamp).
+> **Skip-by-rule:** no legacy matcher; no layout. Input is a line crop, output is text + per-step posteriors.
+
+---
+
+## 0. Intent
+
+Run the hydrated LSTM (D-OCR-11) forward on `ndarray` to produce per-timestep
+class posteriors, then decode them with a faithful `recodebeam` (dictionary-aware,
+CTC-style) to text — **byte-identical to Tesseract** on fixed line crops. This is
+the tier that makes "1:1 Tesseract" provable; everything else is plumbing.
+
+## 1. Forward pass (`tesseract-rs/src/lstm/`, on ndarray) — D-OCR-20
+
+Faithful transcode of `lstm/` numerics. Each maps to ndarray ops:
+
+| Tesseract unit | ndarray realization | Exactness note |
+|---|---|---|
+| `WeightMatrix::MatrixDotVector` (int8) | int8 GEMV via ndarray SIMD kernel | **accumulation order + rounding must match** (D-OCR-22) |
+| `FullyConnected` | matmul + bias + activation LUT | activation table must be the same fixed-point LUT |
+| `LSTM` cell (gates i/f/o/g, peephole) | elementwise on ndarray slices | sigmoid/tanh LUTs identical to C++ |
+| `Convolve` / `Maxpool` | im2col + GEMM / window-max | stride/pad identical |
+| `Softmax` / `LogSoftmax` | row softmax | only at the output; feeds the beam |
+
+Float path is straightforward. **The int8 path is where silent drift lives** — it
+is the whole point of D-OCR-22.
+
+## 2. recodebeam decoder (`tesseract-rs/src/recodebeam.rs`) — D-OCR-21
+
+Hand-port (NOT codegen): tie-breaking, normalization, and dawg interaction are
+under-documented and behaviorally subtle.
+
+- Beam over the **recoder** codes (not raw unichars): the `RecodeBeamSearch`
+  maintains dawg-constrained and unconstrained beams; final path picks per
+  Tesseract's certainty/rating rule.
+- DAWG dictionary (`dict/{dawg,trie,permdawg}`) — **codegen-amenable** node-array
+  walks; the *interaction* with the beam is hand-ported.
+- Output: best text + per-token rating/certainty → becomes per-token confidence at
+  the emit stage (master §1).
+
+## 3. int8-SIMD numeric exactness conformance — D-OCR-22
+
+The conformance contract that earns "1:1":
+
+1. Pin the int8 GEMV accumulation order to Tesseract's (block/tile order matters).
+2. Match the fixed-point rounding mode of `IntSimdMatrix` (AVX2/512/NEON variants
+   reduce in a defined order — replicate it, do not "improve" it).
+3. Identical activation LUTs (sigmoid/tanh/softmax) — copy the tables, not the
+   formulae.
+4. Conformance harness: feed N line crops, compare per-timestep argmax AND the
+   full posterior (within 0 ULP for int8) against an FFI dump from the oracle.
+
+## 4. Ground-truth oracle
+
+The `AdaWorldAPI/tesseract-rs` FFI fork (thin bindings, `src/{lib,page_seg_mode}.rs`)
+is built **only** as the oracle: it runs real `libtesseract` to dump (a) per-matrix
+weights, (b) per-timestep posteriors, (c) final decoded text for the same crops.
+The Rust port is diffed against these. The oracle is a dev/test dependency, never a
+runtime path, and the lone place the Leptonica C fork is compiled.
+
+## 5. Deliverables
+
+- **D-OCR-20:** forward pass on ndarray reproduces C++ per-timestep posteriors
+  (float path 1:1; int8 path within the D-OCR-22 contract) on a 1k-crop set.
+- **D-OCR-21:** `recodebeam` + DAWG reproduces C++ decoded text byte-identical on
+  the same set.
+- **D-OCR-22:** int8 conformance harness green on ≥ 10k crops across 2+ languages.
+
+## 6. Open decisions
+
+- **OD-20a:** target one SIMD width first (AVX2) for exactness, then NEON/AVX-512;
+  or define the scalar reference as canonical and treat SIMD as "must equal scalar"?
+- **OD-21a:** support Tesseract's `lstm_choice_mode` (top-k per timestep) now — it
+  feeds the OCR `ValueTenant` top-k candidates (ocr-soa-integration OD-1) — or later?

From 09b0b4e776b922b2c05a7793d9995dee9491e710 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 19:49:05 +0200
Subject: [PATCH 04/25] docs(plan): plant tesseract-rs-neural-layout-ocrs-v1.md

---
 .../tesseract-rs-neural-layout-ocrs-v1.md     | 75 +++++++++++++++++++
 1 file changed, 75 insertions(+)
 create mode 100644 .claude/plans/tesseract-rs-neural-layout-ocrs-v1.md

diff --git a/.claude/plans/tesseract-rs-neural-layout-ocrs-v1.md b/.claude/plans/tesseract-rs-neural-layout-ocrs-v1.md
new file mode 100644
index 00000000..376e15cf
--- /dev/null
+++ b/.claude/plans/tesseract-rs-neural-layout-ocrs-v1.md
@@ -0,0 +1,75 @@
+# tesseract-rs — Neural Layout + Recognition via ocrs/rten v1
+
+> **Type:** plan (sub-plan). Deliverables D-OCR-30/31.
+> **Status:** PLANTED 2026-06-15 — design only. Ships FIRST (POC default recognizer).
+> **Front:** post-#496. Uses `AdaWorldAPI/{ocrs,rten,tract,ort}` forks as-is.
+> **Canon anchors:** master §2 (engine trait), §3 (skip layout). Pure-Rust posture (no C/C++ deps except opt-in `ort`).
+> **Skip-by-rule:** this plan is *why* Tesseract `textord` is never transcoded — neural layout replaces it.
+
+---
+
+## 0. Intent
+
+Provide the default, modern recognition path with **zero C/C++ dependencies** by
+wiring the already-forked neural OCR stack behind the `Recognizer` trait. This is
+both the POC recognizer (ships before the transcode tier) and the permanent
+replacement for Tesseract's layout heuristics.
+
+## 1. The forked stack (confirmed surfaces)
+
+- **`ocrs`** (`ocrs/ocrs/src/`): `detection.rs` (text detection), `layout_analysis.rs`
+  + `layout_analysis/` (reading order / columns), `recognition.rs` (line recognizer),
+  `model.rs`, `preprocess.rs`, `text_items.rs`. Full detect→layout→recognize.
+- **`rten`** (`rten-*` crates): `rten-convert` (ONNX→`.rten`), `rten-onnx`,
+  `rten-model-file` (`.rten`), `rten-text` (CTC/text decode), `rten-simd` /
+  `rten-gemm` / `rten-vecmath` (kernels), `rten-imageproc`. ocrs runs on rten.
+- **`rten-ndarray-demo`**: reference for image-rs + ndarray + rten integration
+  (`mobilenet.rten` present) — the wiring template.
+- **`tract`** (pure-Rust general ONNX/TF) — arbitrary model escape hatch.
+- **`ort`** (ONNX Runtime FFI) — GPU / exotic-op escape, **feature-gated, off by default**.
+
+## 2. Backend wiring — D-OCR-30
+
+`tesseract-rs/src/engine/ocrs_backend.rs` implements `Recognizer`:
+
+```
+preprocess (image/imageproc) ─► ocrs::detection ─► ocrs::layout_analysis (reading order)
+   ─► per-line crops ─► ocrs::recognition (rten / rten-text CTC) ─► tokens + confidence + bbox
+```
+
+- Models: convert the ocrs detection + recognition ONNX to `.rten` via `rten-convert`
+  once; vendor the `.rten` blobs (or fetch in CI). Confirm the converter + current
+  model assets are present in the fork before relying on them (D-OCR-30 acceptance).
+- Output normalized to the **same token struct** the transcode tier emits, so the
+  emit stage (ocr-soa-integration) is engine-agnostic.
+
+## 3. Layout: why neural, not ported — D-OCR-30 (rationale)
+
+Tesseract's `textord/`+`ccstruct/` layout is intrusive linked-list + cyclic-mutable
+blob-graph heuristics (`BLOBNBOX`, `ColPartition`, tab-stop finder). A syntax-directed
+transcode (ruff/AST codegen rewrites *syntax*, not *ownership*) turns it into
+`Rc<RefCell<>>` sludge or `unsafe`. `ocrs` already does detection+layout+reading-order
+as a neural model — strictly better and pure-Rust. **Decision: layout is never
+transcoded; it is `ocrs`.** Tesseract's recognizer (the LSTM) is the only thing worth
+the bit-exact transcode (D-OCR-2x), and even then only as a compat/accuracy fallback.
+
+## 4. tract + ort escape hatches — D-OCR-31
+
+- `tract_backend.rs`: load an arbitrary ONNX OCR model (custom-trained, PaddleOCR
+  export). Pure Rust, CPU. The "I have a specific recognizer" path.
+- `ort_backend.rs`: GPU (CUDA/CoreML/TensorRT) or an op `tract` rejects. Feature
+  `ort-gpu`, off by default; documented as the sole C++ dependency.
+
+## 5. Deliverables
+
+- **D-OCR-30:** `Ocrs` backend end-to-end on a scanned PDF page → tokens+bbox+conf;
+  models present/converted; image→`.rten` path runs with no C deps.
+- **D-OCR-31:** `Tract` backend loads + runs one custom ONNX recognizer; `Ort`
+  backend compiles only under `--features ort-gpu`.
+
+## 6. Open decisions
+
+- **OD-30a:** front-end = hand-rolled `pdfium-render`+`image`+`imageproc`, or adopt
+  `AdaWorldAPI/ferrules` (layout-aware Rust document parser) as the PDF front-end?
+- **OD-30b:** keep a third "OCR-free" branch (`colpali`/`Qwen3-VL-Embedding`,
+  document-image → retrieval directly) as a separate plan, or note-and-defer?

From 1c9736fde0bc253e8fc974f6d009d3097e846311 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 19:49:06 +0200
Subject: [PATCH 05/25] docs(plan): plant
 tesseract-rs-traineddata-ndarray-v1.md

---
 .../tesseract-rs-traineddata-ndarray-v1.md    | 86 +++++++++++++++++++
 1 file changed, 86 insertions(+)
 create mode 100644 .claude/plans/tesseract-rs-traineddata-ndarray-v1.md

diff --git a/.claude/plans/tesseract-rs-traineddata-ndarray-v1.md b/.claude/plans/tesseract-rs-traineddata-ndarray-v1.md
new file mode 100644
index 00000000..8619f89f
--- /dev/null
+++ b/.claude/plans/tesseract-rs-traineddata-ndarray-v1.md
@@ -0,0 +1,86 @@
+# tesseract-rs — traineddata → ndarray Model Loader v1
+
+> **Type:** plan (sub-plan of `tesseract-rs-transcode-master-v1`). Deliverables D-OCR-10/11.
+> **Status:** PLANTED 2026-06-15 — design only.
+> **Front:** post-#496. Reuses `ndarray::hpc` weight-loading pattern (`src/hpc/{gguf.rs,gguf_indexer.rs,safetensors.rs,models/safetensors.rs}`).
+> **Canon anchors:** master plan §4; `ndarray` hpc loaders; `lance-graph-contract` LE column contract.
+> **Skip-by-rule:** legacy classifier `.traineddata` components (templates, adaptive) are NOT loaded.
+
+---
+
+## 0. Intent
+
+Make a modern Tesseract `.traineddata` file loadable in pure Rust as a hydrated
+weight set + symbol tables, using the *same* discipline as the existing GGUF /
+safetensors loaders in `ndarray::hpc`. A `.traineddata` is, for our purposes, just
+another model container: parse the directory of components, pull the LSTM weights,
+and hydrate them into `ndarray` arrays addressable by the forward pass (D-OCR-20).
+
+## 1. `.traineddata` is a TAR-like component bundle
+
+Modern (LSTM) `.traineddata` holds, among ~15 components, the ones we need:
+
+| Component | What it is | We need it for |
+|---|---|---|
+| `lstm` | the recognizer network (VGSL spec + weights) | D-OCR-11 weight hydration |
+| `lstm-unicharset` | LSTM-specific unicharset | symbol ↔ class index |
+| `lstm-recoder` | `UnicharCompress` recoder (codepoint → recode codes) | recodebeam (D-OCR-21) |
+| `unicharset` | full unicharset (props, scripts, ranges) | char props / number-grammar |
+| `*.lstm-*-dawg` | dictionary DAWGs (word/number/punc/system) | dict correction (D-OCR-21) |
+| `version`, `config` | metadata | provenance, default params |
+
+Components we **ignore** (legacy): `inttemp`, `pffmtable`, `normproto`,
+`shapetable`, `*.params-model` (adaptive). The loader skips them by name.
+
+## 2. Loader layout (`tesseract-rs/src/traineddata/`)
+
+```
+traineddata/
+  container.rs     // TessdataManager: offset table parse → component byte slices  (CODEGEN: D-OCR-40)
+  unicharset.rs    // Unicharset parse: id↔utf8, char props, script, ranges          (CODEGEN)
+  recoder.rs       // UnicharCompress: codepoint↔recode-code maps                    (CODEGEN)
+  vgsl.rs          // VGSL network-spec parser → layer graph                          (hand-port: small, gnarly grammar)
+  weights.rs       // weight blobs → ndarray hydration (calls ndarray::hpc pattern)   (hand)
+  dawg.rs          // DAWG/Trie node arrays (squished + unsquished)                   (CODEGEN)
+  mod.rs           // TrainedData { net, unicharset, recoder, dicts }
+```
+
+## 3. Weight hydration — sibling of the GGUF path (D-OCR-11)
+
+The forward pass (D-OCR-20) consumes `ndarray` views, exactly as the GGUF/Qwen
+work does. `weights.rs` mirrors `ndarray::hpc::gguf`:
+
+- Add a `ModelSource::TrainedData` alongside the existing GGUF/safetensors sources
+  in `ndarray::hpc` so the LSTM weights flow through the same hydration/indexer.
+- Each Tesseract `WeightMatrix` (int8 quantized + float scale, or float) becomes
+  an `ndarray` 2-D array plus a per-row scale vector — preserve the quantization
+  exactly (see D-OCR-22; bit-exactness depends on it).
+- No transpose/normalization on load: store as Tesseract stores, defer layout to
+  the forward kernels so the numeric path is auditable against the C++.
+
+**Iron rule (inherited from the SoA envelope work):** the loader knows the LE byte
+contract for every weight blob; it never guesses. The component byte slices carry
+their own descriptor; hydration is a typed carve, not a reinterpretation.
+
+## 4. Codegen vs hand-port assignment
+
+| Module | Route | Why |
+|---|---|---|
+| `container.rs`, `unicharset.rs`, `recoder.rs`, `dawg.rs` | **AST-DLL codegen (D-OCR-4x)** | mechanical struct/table walks; faithful 1:1 from C++ AST |
+| `vgsl.rs` | hand-port | tiny but a bespoke mini-grammar; codegen overkill |
+| `weights.rs` | hand | bridges into `ndarray::hpc`; integration glue, not transcode |
+
+## 5. Deliverables
+
+- **D-OCR-10:** `TrainedData` loads a real `eng.traineddata` (4/5) → unicharset,
+  recoder, DAWGs, and the VGSL net spec, with legacy components skipped. Test:
+  round-trip the unicharset id↔utf8 against a C++ `combine_tessdata -u` dump.
+- **D-OCR-11:** LSTM weights hydrated into `ndarray` via a new
+  `ndarray::hpc::ModelSource::TrainedData`; shapes + quantization scales match a
+  C++ dump of the same network. Test: per-matrix shape + int8 scale equality.
+
+## 6. Open decisions
+
+- **OD-10a:** support both float and int8 `.traineddata`, or int8-only for v1?
+  (int8 is the common `*_best`/`*_fast` shipping form and the harder exactness case.)
+- **OD-10b:** vendor a pinned `eng.traineddata` as a test fixture, or fetch in CI?

From 6ddbd9775e705f915a5c0b3f61bee4d9d2be3b0c Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 19:49:08 +0200
Subject: [PATCH 06/25] docs(plan): plant tesseract-rs-transcode-master-v1.md

---
 .../plans/tesseract-rs-transcode-master-v1.md | 134 ++++++++++++++++++
 1 file changed, 134 insertions(+)
 create mode 100644 .claude/plans/tesseract-rs-transcode-master-v1.md

diff --git a/.claude/plans/tesseract-rs-transcode-master-v1.md b/.claude/plans/tesseract-rs-transcode-master-v1.md
new file mode 100644
index 00000000..3d77acf9
--- /dev/null
+++ b/.claude/plans/tesseract-rs-transcode-master-v1.md
@@ -0,0 +1,134 @@
+# Tesseract → tesseract-rs Transcode — Master Plan v1
+
+> **Type:** plan family root (forward marker / co-architecture). Plants the
+>   sub-plans; owns the deliverable index, the dependency DAG, and the
+>   skip-list rationale.
+> **Status:** PLANTED 2026-06-15 — design only, no code. Layout/contracts proposed
+>   against the post-#496 front.
+> **Front:** post-#496. `canonical_node.rs` carries `NodeGuid` / `EdgeBlock` /
+>   `EdgeCodecFlavor` / `NodeRow` / `ValueTenant` / `ValueSchema` / `NodeRowPacket`;
+>   `class_view.rs` carries `ClassView` (`edge_codec_flavor`, `value_schema`) +
+>   `FieldMask`. These are the integration surface, not a thing to re-derive.
+> **Canon anchors (all sub-plans must match, never restate):**
+>   - Operator GUID canon — `OGAR/CLAUDE.md` P0 (`classid·HEEL·HIP·TWIG·family·identity`, RFC-waived).
+>   - Doc-lock — `lance-graph/CLAUDE.md` (commit `4ea6ac9`): SoA node, zero-fallback ladder, reserve-don't-reclaim.
+>   - Code form — `crates/lance-graph-contract/src/canonical_node.rs` (#489+#490+#492+#494+#496).
+>   - Three-tier model — `docs/architecture/soa-three-tier-model.md` (zero-copy, no emission).
+>   - Supersession map — `.claude/plans/soa-migration-diff-resolution-2026-06-13.md`.
+> **Skip-by-rule:** anything behind the migration front is residue, not authority.
+>   This plan does NOT conform to deprecated `BindSpace` row geometry.
+
+---
+
+## 0. Intent (one paragraph)
+
+Stand up `tesseract-rs` as a **pure-Rust, bit-reproducible OCR substrate** that (a)
+runs *existing* modern Tesseract `.traineddata` LSTM models with byte-identical
+output, (b) replaces Tesseract's accreted C++ layout heuristics with the neural
+`ocrs`/`rten` path already forked into the account, (c) uses a clang-AST → Rust
+**codegen harness** (built on the `ruff` AST/codegen crates) for the mechanical
+leaf modules instead of hand-porting C++, and (d) emits every recognized token as
+a **canonical SoA `NodeRow`** so OCR output lands directly in the lance-graph
+cognitive substrate (OGAR class, HHTL address, `ValueSchema` over `ValueTenant`s,
+`EdgeCodecFlavor` adjacency, DeepNSM/CAM-PQ correction) with no boundary tax.
+
+The whole effort doubles as the **bit-reproducibility regression harness** the SoA
+migration needs: Tesseract C++ (via the `tesseract-rs` FFI fork as oracle) vs the
+Rust port, diffed to the byte on fixed line crops.
+
+## 1. The spine (target data flow)
+
+```
+PDF / image
+  └─► [front-end]      pdfium-render + image + imageproc  ── OR ──  ferrules (layout-aware)
+        └─► [segment]  ocrs detection + layout_analysis   (NEURAL — replaces textord)   ┐
+        └─► [recognize] EITHER  tesseract-rs traineddata→LSTM-on-ndarray→recodebeam      │  engine
+                        OR      ocrs recognition (rten / rten-text CTC)                   │  trait
+                        OR      tract (arbitrary ONNX recognizer)                         ┘
+              └─► tokens + per-token confidence + bbox + top-k candidates
+                    └─► [repair] DeepNSM (vocabulary·codebook·parser·encoder·similarity)
+                          + CAM/PQ nearest-valid-token (helix / TurbovecResidue / CAKES)
+                          └─► [emit] canonical NodeRow  (classid=OCR class, HHTL address,
+                                ValueSchema OCR preset, EdgeCodecFlavor adjacency)
+                                └─► NodeRowPacket → SoaEnvelope → Lance (kv-lance)
+```
+
+## 2. Three engine paths behind ONE trait (the central decision)
+
+`tesseract-rs` exposes a single `Recognizer` trait with three backends; the engine
+is a runtime/feature choice, never a fork of the pipeline:
+
+| Backend | Source | Use when | C/C++ deps |
+|---|---|---|---|
+| `TranscodeLstm` | this plan family (traineddata→ndarray→recodebeam) | need an *existing* `.traineddata` language model, bit-identical to Tesseract | **none** |
+| `Ocrs` | `AdaWorldAPI/ocrs` + `rten` | default modern path; detection+layout+recognition, pure Rust | **none** |
+| `Tract` | `AdaWorldAPI/tract` | arbitrary/custom ONNX recognizer (PaddleOCR export, custom-trained) | **none** |
+| `Ort` (escape) | `AdaWorldAPI/ort` | GPU needed, or an op `tract` rejects | ONNX Runtime (C++) |
+
+**Default = `Ocrs`.** `TranscodeLstm` is the accuracy/compat fallback AND the
+oracle's reference. `Ort` is opt-in only (it's the lone C++ dependency).
+
+## 3. What we SKIP and why (do not transcode)
+
+| Tesseract component | Disposition | Reason |
+|---|---|---|
+| Legacy pattern matcher (`classify/`, integer matcher, adaptive classifier) | **DROP** | Deprecated in Tesseract 4/5; LSTM supersedes it. |
+| Leptonica (C, ~250k LOC) | **REPLACE, not port** | Reimplement only the ~dozen ops Tesseract calls (Otsu/Sauvola, deskew, despeckle, CC-label, scale) on `image`/`imageproc`. `AdaWorldAPI/leptonica` fork kept only as oracle build dep. |
+| `textord/` + `ccstruct/` layout analysis (tab-stops, columns, `ColPartition`, baselines, reading order) | **REPLACE with `ocrs`** | Tens of thousands of LOC of intrusive linked-list / cyclic-mutable-graph heuristics. `ruff`-style syntax codegen cannot redesign ownership; `ocrs` neural layout is better anyway. See `tesseract-rs-neural-layout-ocrs-v1`. |
+| CMake / `BOOL_VAR`/`INT_VAR` global param registry / renderer plumbing | **DROP** | Cargo + a config struct + the `ClassView` registry replace it. |
+| C# (NuGet `Tesseract`, app wrappers) | **N/A** | No C# in OCR core; C# only in app wrappers — out of scope. |
+
+## 4. What we transcode faithfully (bit-exact tier)
+
+Routed through the AST-DLL codegen harness where mechanical, hand-ported where
+numeric exactness is subtle. See sub-plans for module-by-module assignment.
+
+- `.traineddata` container (`ccutil/tessdatamanager`), `unicharset`, recoder
+  (`lstm/unicharcompress`) → **D-OCR-1x** (`...-traineddata-ndarray-v1`).
+- LSTM forward (`lstm/{network,lstm,fullyconnected,convolve,weightmatrix}`) +
+  `recodebeam` decoder + DAWG dict (`dict/{dawg,trie,permdawg}`) → **D-OCR-2x**
+  (`...-lstm-recodebeam-v1`).
+
+## 5. Deliverable index (D-OCR-NN) and DAG
+
+| ID | Deliverable | Sub-plan | Depends on |
+|---|---|---|---|
+| D-OCR-00 | This master + skip-list + engine trait | (this) | — |
+| D-OCR-10 | `.traineddata`/unicharset/recoder reader | traineddata-ndarray | D-OCR-40 |
+| D-OCR-11 | LSTM weight hydration onto `ndarray::hpc` | traineddata-ndarray | D-OCR-10 |
+| D-OCR-20 | LSTM forward pass on ndarray | lstm-recodebeam | D-OCR-11 |
+| D-OCR-21 | `recodebeam` decoder + DAWG dict | lstm-recodebeam | D-OCR-20 |
+| D-OCR-22 | int8-SIMD numeric-exactness conformance | lstm-recodebeam | D-OCR-20 |
+| D-OCR-30 | `ocrs`/`rten` neural detect+layout+recognize backend | neural-layout-ocrs | — |
+| D-OCR-31 | `tract` arbitrary-ONNX backend + `ort` escape | neural-layout-ocrs | D-OCR-30 |
+| D-OCR-40 | clang C++ AST → IR ("AST DLL") | ast-dll-codegen | — |
+| D-OCR-41 | IR → Rust emission via `ruff` codegen crates | ast-dll-codegen | D-OCR-40 |
+| D-OCR-42 | diff-gate: emitted Rust vs FFI oracle | ast-dll-codegen | D-OCR-41 |
+| D-OCR-50 | OCR class + HHTL address scheme | ocr-canonical-soa-integration | canon |
+| D-OCR-51 | OCR `ValueSchema` preset over existing tenants | ocr-canonical-soa-integration | D-OCR-50 |
+| D-OCR-52 | DeepNSM + CAM/PQ token repair wiring | ocr-canonical-soa-integration | D-OCR-51 |
+| D-OCR-53 | bit-reproducibility harness (oracle diff) | ocr-canonical-soa-integration | D-OCR-21, D-OCR-30 |
+
+DAG critical path: **D-OCR-40 → 10 → 11 → 20 → 21 → 53** (the transcode oracle),
+with **D-OCR-30** (ocrs default) parallel and independent — ships first as the POC
+recognizer while the transcode tier matures behind it.
+
+## 6. Success criteria
+
+1. `Ocrs` backend produces tokens → canonical `NodeRow`s persisted via `NodeRowPacket` (POC gate).
+2. `TranscodeLstm` reproduces Tesseract C++ output **byte-identical** on ≥ 10k fixed line crops (oracle gate).
+3. Zero new crates.io deps (forks/path/`ndarray`/`lance`-family only); `Ort` is the sole opt-in C++ dep, feature-gated off by default.
+4. No layout-heuristic C++ transcoded (skip-list honored).
+5. OCR nodes carry no bespoke row geometry — they ride `ValueSchema`/`ValueTenant`/`EdgeCodecFlavor` unchanged.
+
+## 7. Open co-architecture decisions (resolve with operator)
+
+- **OD-1:** Does an OCR token get a *dedicated* `ValueTenant` (bbox + per-char
+  confidence + top-k recodebeam candidates), or ride `Meta`+`Fingerprint`+
+  `TurbovecResidue`? Adding a tenant is canon-significant (tenants never move/reuse).
+  POC rides existing; dedicated tenant deferred. (See ocr-soa-integration §OD-1.)
+- **OD-2:** Is `ocrs`/`rten` the permanent default with `TranscodeLstm` as compat
+  fallback, or is the transcode tier the eventual primary? Affects how much codegen
+  effort D-OCR-4x warrants.
+- **OD-3:** Does the AST-DLL frontend use libclang directly, or a clang→JSON dump
+  consumed by a Rust IR? (See ast-dll-codegen §OD-3.)

From b021d2c33e8e97ac6312e4575a1942193a900342 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:01:39 +0200
Subject: [PATCH 07/25] docs(plan): retire
 tesseract-rs-traineddata-ndarray-v1.md (v2 supersession)

---
 .../tesseract-rs-traineddata-ndarray-v1.md    | 86 -------------------
 1 file changed, 86 deletions(-)
 delete mode 100644 .claude/plans/tesseract-rs-traineddata-ndarray-v1.md

diff --git a/.claude/plans/tesseract-rs-traineddata-ndarray-v1.md b/.claude/plans/tesseract-rs-traineddata-ndarray-v1.md
deleted file mode 100644
index 8619f89f..00000000
--- a/.claude/plans/tesseract-rs-traineddata-ndarray-v1.md
+++ /dev/null
@@ -1,86 +0,0 @@
-# tesseract-rs — traineddata → ndarray Model Loader v1
-
-> **Type:** plan (sub-plan of `tesseract-rs-transcode-master-v1`). Deliverables D-OCR-10/11.
-> **Status:** PLANTED 2026-06-15 — design only.
-> **Front:** post-#496. Reuses `ndarray::hpc` weight-loading pattern (`src/hpc/{gguf.rs,gguf_indexer.rs,safetensors.rs,models/safetensors.rs}`).
-> **Canon anchors:** master plan §4; `ndarray` hpc loaders; `lance-graph-contract` LE column contract.
-> **Skip-by-rule:** legacy classifier `.traineddata` components (templates, adaptive) are NOT loaded.
-
----
-
-## 0. Intent
-
-Make a modern Tesseract `.traineddata` file loadable in pure Rust as a hydrated
-weight set + symbol tables, using the *same* discipline as the existing GGUF /
-safetensors loaders in `ndarray::hpc`. A `.traineddata` is, for our purposes, just
-another model container: parse the directory of components, pull the LSTM weights,
-and hydrate them into `ndarray` arrays addressable by the forward pass (D-OCR-20).
-
-## 1. `.traineddata` is a TAR-like component bundle
-
-Modern (LSTM) `.traineddata` holds, among ~15 components, the ones we need:
-
-| Component | What it is | We need it for |
-|---|---|---|
-| `lstm` | the recognizer network (VGSL spec + weights) | D-OCR-11 weight hydration |
-| `lstm-unicharset` | LSTM-specific unicharset | symbol ↔ class index |
-| `lstm-recoder` | `UnicharCompress` recoder (codepoint → recode codes) | recodebeam (D-OCR-21) |
-| `unicharset` | full unicharset (props, scripts, ranges) | char props / number-grammar |
-| `*.lstm-*-dawg` | dictionary DAWGs (word/number/punc/system) | dict correction (D-OCR-21) |
-| `version`, `config` | metadata | provenance, default params |
-
-Components we **ignore** (legacy): `inttemp`, `pffmtable`, `normproto`,
-`shapetable`, `*.params-model` (adaptive). The loader skips them by name.
-
-## 2. Loader layout (`tesseract-rs/src/traineddata/`)
-
-```
-traineddata/
-  container.rs     // TessdataManager: offset table parse → component byte slices  (CODEGEN: D-OCR-40)
-  unicharset.rs    // Unicharset parse: id↔utf8, char props, script, ranges          (CODEGEN)
-  recoder.rs       // UnicharCompress: codepoint↔recode-code maps                    (CODEGEN)
-  vgsl.rs          // VGSL network-spec parser → layer graph                          (hand-port: small, gnarly grammar)
-  weights.rs       // weight blobs → ndarray hydration (calls ndarray::hpc pattern)   (hand)
-  dawg.rs          // DAWG/Trie node arrays (squished + unsquished)                   (CODEGEN)
-  mod.rs           // TrainedData { net, unicharset, recoder, dicts }
-```
-
-## 3. Weight hydration — sibling of the GGUF path (D-OCR-11)
-
-The forward pass (D-OCR-20) consumes `ndarray` views, exactly as the GGUF/Qwen
-work does. `weights.rs` mirrors `ndarray::hpc::gguf`:
-
-- Add a `ModelSource::TrainedData` alongside the existing GGUF/safetensors sources
-  in `ndarray::hpc` so the LSTM weights flow through the same hydration/indexer.
-- Each Tesseract `WeightMatrix` (int8 quantized + float scale, or float) becomes
-  an `ndarray` 2-D array plus a per-row scale vector — preserve the quantization
-  exactly (see D-OCR-22; bit-exactness depends on it).
-- No transpose/normalization on load: store as Tesseract stores, defer layout to
-  the forward kernels so the numeric path is auditable against the C++.
-
-**Iron rule (inherited from the SoA envelope work):** the loader knows the LE byte
-contract for every weight blob; it never guesses. The component byte slices carry
-their own descriptor; hydration is a typed carve, not a reinterpretation.
-
-## 4. Codegen vs hand-port assignment
-
-| Module | Route | Why |
-|---|---|---|
-| `container.rs`, `unicharset.rs`, `recoder.rs`, `dawg.rs` | **AST-DLL codegen (D-OCR-4x)** | mechanical struct/table walks; faithful 1:1 from C++ AST |
-| `vgsl.rs` | hand-port | tiny but a bespoke mini-grammar; codegen overkill |
-| `weights.rs` | hand | bridges into `ndarray::hpc`; integration glue, not transcode |
-
-## 5. Deliverables
-
-- **D-OCR-10:** `TrainedData` loads a real `eng.traineddata` (4/5) → unicharset,
-  recoder, DAWGs, and the VGSL net spec, with legacy components skipped. Test:
-  round-trip the unicharset id↔utf8 against a C++ `combine_tessdata -u` dump.
-- **D-OCR-11:** LSTM weights hydrated into `ndarray` via a new
-  `ndarray::hpc::ModelSource::TrainedData`; shapes + quantization scales match a
-  C++ dump of the same network. Test: per-matrix shape + int8 scale equality.
-
-## 6. Open decisions
-
-- **OD-10a:** support both float and int8 `.traineddata`, or int8-only for v1?
-  (int8 is the common `*_best`/`*_fast` shipping form and the harder exactness case.)
-- **OD-10b:** vendor a pinned `eng.traineddata` as a test fixture, or fetch in CI?

From 5d6b51c5d67607e4ab5918190e2f9bcc5179d9e8 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:01:39 +0200
Subject: [PATCH 08/25] docs(plan): retire tesseract-rs-lstm-recodebeam-v1.md
 (v2 supersession)

---
 .../plans/tesseract-rs-lstm-recodebeam-v1.md  | 79 -------------------
 1 file changed, 79 deletions(-)
 delete mode 100644 .claude/plans/tesseract-rs-lstm-recodebeam-v1.md

diff --git a/.claude/plans/tesseract-rs-lstm-recodebeam-v1.md b/.claude/plans/tesseract-rs-lstm-recodebeam-v1.md
deleted file mode 100644
index be4cd8dc..00000000
--- a/.claude/plans/tesseract-rs-lstm-recodebeam-v1.md
+++ /dev/null
@@ -1,79 +0,0 @@
-# tesseract-rs — LSTM Forward + recodebeam Decoder v1
-
-> **Type:** plan (sub-plan). Deliverables D-OCR-20/21/22.
-> **Status:** PLANTED 2026-06-15 — design only.
-> **Front:** post-#496. Forward pass targets `ndarray` (SIMD/BLAS/CLAM provider). Oracle = `tesseract-rs` FFI fork.
-> **Canon anchors:** master §4; `ndarray` SIMD kernels; bit-reproducibility doctrine (DeepNSM "bit-reproducible", envelope version-stamp).
-> **Skip-by-rule:** no legacy matcher; no layout. Input is a line crop, output is text + per-step posteriors.
-
----
-
-## 0. Intent
-
-Run the hydrated LSTM (D-OCR-11) forward on `ndarray` to produce per-timestep
-class posteriors, then decode them with a faithful `recodebeam` (dictionary-aware,
-CTC-style) to text — **byte-identical to Tesseract** on fixed line crops. This is
-the tier that makes "1:1 Tesseract" provable; everything else is plumbing.
-
-## 1. Forward pass (`tesseract-rs/src/lstm/`, on ndarray) — D-OCR-20
-
-Faithful transcode of `lstm/` numerics. Each maps to ndarray ops:
-
-| Tesseract unit | ndarray realization | Exactness note |
-|---|---|---|
-| `WeightMatrix::MatrixDotVector` (int8) | int8 GEMV via ndarray SIMD kernel | **accumulation order + rounding must match** (D-OCR-22) |
-| `FullyConnected` | matmul + bias + activation LUT | activation table must be the same fixed-point LUT |
-| `LSTM` cell (gates i/f/o/g, peephole) | elementwise on ndarray slices | sigmoid/tanh LUTs identical to C++ |
-| `Convolve` / `Maxpool` | im2col + GEMM / window-max | stride/pad identical |
-| `Softmax` / `LogSoftmax` | row softmax | only at the output; feeds the beam |
-
-Float path is straightforward. **The int8 path is where silent drift lives** — it
-is the whole point of D-OCR-22.
-
-## 2. recodebeam decoder (`tesseract-rs/src/recodebeam.rs`) — D-OCR-21
-
-Hand-port (NOT codegen): tie-breaking, normalization, and dawg interaction are
-under-documented and behaviorally subtle.
-
-- Beam over the **recoder** codes (not raw unichars): the `RecodeBeamSearch`
-  maintains dawg-constrained and unconstrained beams; final path picks per
-  Tesseract's certainty/rating rule.
-- DAWG dictionary (`dict/{dawg,trie,permdawg}`) — **codegen-amenable** node-array
-  walks; the *interaction* with the beam is hand-ported.
-- Output: best text + per-token rating/certainty → becomes per-token confidence at
-  the emit stage (master §1).
-
-## 3. int8-SIMD numeric exactness conformance — D-OCR-22
-
-The conformance contract that earns "1:1":
-
-1. Pin the int8 GEMV accumulation order to Tesseract's (block/tile order matters).
-2. Match the fixed-point rounding mode of `IntSimdMatrix` (AVX2/512/NEON variants
-   reduce in a defined order — replicate it, do not "improve" it).
-3. Identical activation LUTs (sigmoid/tanh/softmax) — copy the tables, not the
-   formulae.
-4. Conformance harness: feed N line crops, compare per-timestep argmax AND the
-   full posterior (within 0 ULP for int8) against an FFI dump from the oracle.
-
-## 4. Ground-truth oracle
-
-The `AdaWorldAPI/tesseract-rs` FFI fork (thin bindings, `src/{lib,page_seg_mode}.rs`)
-is built **only** as the oracle: it runs real `libtesseract` to dump (a) per-matrix
-weights, (b) per-timestep posteriors, (c) final decoded text for the same crops.
-The Rust port is diffed against these. The oracle is a dev/test dependency, never a
-runtime path, and the lone place the Leptonica C fork is compiled.
-
-## 5. Deliverables
-
-- **D-OCR-20:** forward pass on ndarray reproduces C++ per-timestep posteriors
-  (float path 1:1; int8 path within the D-OCR-22 contract) on a 1k-crop set.
-- **D-OCR-21:** `recodebeam` + DAWG reproduces C++ decoded text byte-identical on
-  the same set.
-- **D-OCR-22:** int8 conformance harness green on ≥ 10k crops across 2+ languages.
-
-## 6. Open decisions
-
-- **OD-20a:** target one SIMD width first (AVX2) for exactness, then NEON/AVX-512;
-  or define the scalar reference as canonical and treat SIMD as "must equal scalar"?
-- **OD-21a:** support Tesseract's `lstm_choice_mode` (top-k per timestep) now — it
-  feeds the OCR `ValueTenant` top-k candidates (ocr-soa-integration OD-1) — or later?

From 7f85b58d9bb87fd744123d19b8b8b030af40deae Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:01:40 +0200
Subject: [PATCH 09/25] docs(plan): retire
 tesseract-rs-neural-layout-ocrs-v1.md (v2 supersession)

---
 .../tesseract-rs-neural-layout-ocrs-v1.md     | 75 -------------------
 1 file changed, 75 deletions(-)
 delete mode 100644 .claude/plans/tesseract-rs-neural-layout-ocrs-v1.md

diff --git a/.claude/plans/tesseract-rs-neural-layout-ocrs-v1.md b/.claude/plans/tesseract-rs-neural-layout-ocrs-v1.md
deleted file mode 100644
index 376e15cf..00000000
--- a/.claude/plans/tesseract-rs-neural-layout-ocrs-v1.md
+++ /dev/null
@@ -1,75 +0,0 @@
-# tesseract-rs — Neural Layout + Recognition via ocrs/rten v1
-
-> **Type:** plan (sub-plan). Deliverables D-OCR-30/31.
-> **Status:** PLANTED 2026-06-15 — design only. Ships FIRST (POC default recognizer).
-> **Front:** post-#496. Uses `AdaWorldAPI/{ocrs,rten,tract,ort}` forks as-is.
-> **Canon anchors:** master §2 (engine trait), §3 (skip layout). Pure-Rust posture (no C/C++ deps except opt-in `ort`).
-> **Skip-by-rule:** this plan is *why* Tesseract `textord` is never transcoded — neural layout replaces it.
-
----
-
-## 0. Intent
-
-Provide the default, modern recognition path with **zero C/C++ dependencies** by
-wiring the already-forked neural OCR stack behind the `Recognizer` trait. This is
-both the POC recognizer (ships before the transcode tier) and the permanent
-replacement for Tesseract's layout heuristics.
-
-## 1. The forked stack (confirmed surfaces)
-
-- **`ocrs`** (`ocrs/ocrs/src/`): `detection.rs` (text detection), `layout_analysis.rs`
-  + `layout_analysis/` (reading order / columns), `recognition.rs` (line recognizer),
-  `model.rs`, `preprocess.rs`, `text_items.rs`. Full detect→layout→recognize.
-- **`rten`** (`rten-*` crates): `rten-convert` (ONNX→`.rten`), `rten-onnx`,
-  `rten-model-file` (`.rten`), `rten-text` (CTC/text decode), `rten-simd` /
-  `rten-gemm` / `rten-vecmath` (kernels), `rten-imageproc`. ocrs runs on rten.
-- **`rten-ndarray-demo`**: reference for image-rs + ndarray + rten integration
-  (`mobilenet.rten` present) — the wiring template.
-- **`tract`** (pure-Rust general ONNX/TF) — arbitrary model escape hatch.
-- **`ort`** (ONNX Runtime FFI) — GPU / exotic-op escape, **feature-gated, off by default**.
-
-## 2. Backend wiring — D-OCR-30
-
-`tesseract-rs/src/engine/ocrs_backend.rs` implements `Recognizer`:
-
-```
-preprocess (image/imageproc) ─► ocrs::detection ─► ocrs::layout_analysis (reading order)
-   ─► per-line crops ─► ocrs::recognition (rten / rten-text CTC) ─► tokens + confidence + bbox
-```
-
-- Models: convert the ocrs detection + recognition ONNX to `.rten` via `rten-convert`
-  once; vendor the `.rten` blobs (or fetch in CI). Confirm the converter + current
-  model assets are present in the fork before relying on them (D-OCR-30 acceptance).
-- Output normalized to the **same token struct** the transcode tier emits, so the
-  emit stage (ocr-soa-integration) is engine-agnostic.
-
-## 3. Layout: why neural, not ported — D-OCR-30 (rationale)
-
-Tesseract's `textord/`+`ccstruct/` layout is intrusive linked-list + cyclic-mutable
-blob-graph heuristics (`BLOBNBOX`, `ColPartition`, tab-stop finder). A syntax-directed
-transcode (ruff/AST codegen rewrites *syntax*, not *ownership*) turns it into
-`Rc<RefCell<>>` sludge or `unsafe`. `ocrs` already does detection+layout+reading-order
-as a neural model — strictly better and pure-Rust. **Decision: layout is never
-transcoded; it is `ocrs`.** Tesseract's recognizer (the LSTM) is the only thing worth
-the bit-exact transcode (D-OCR-2x), and even then only as a compat/accuracy fallback.
-
-## 4. tract + ort escape hatches — D-OCR-31
-
-- `tract_backend.rs`: load an arbitrary ONNX OCR model (custom-trained, PaddleOCR
-  export). Pure Rust, CPU. The "I have a specific recognizer" path.
-- `ort_backend.rs`: GPU (CUDA/CoreML/TensorRT) or an op `tract` rejects. Feature
-  `ort-gpu`, off by default; documented as the sole C++ dependency.
-
-## 5. Deliverables
-
-- **D-OCR-30:** `Ocrs` backend end-to-end on a scanned PDF page → tokens+bbox+conf;
-  models present/converted; image→`.rten` path runs with no C deps.
-- **D-OCR-31:** `Tract` backend loads + runs one custom ONNX recognizer; `Ort`
-  backend compiles only under `--features ort-gpu`.
-
-## 6. Open decisions
-
-- **OD-30a:** front-end = hand-rolled `pdfium-render`+`image`+`imageproc`, or adopt
-  `AdaWorldAPI/ferrules` (layout-aware Rust document parser) as the PDF front-end?
-- **OD-30b:** keep a third "OCR-free" branch (`colpali`/`Qwen3-VL-Embedding`,
-  document-image → retrieval directly) as a separate plan, or note-and-defer?

From a324864a9fd1c930a36b53c42a426bcc838730bb Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:01:41 +0200
Subject: [PATCH 10/25] =?UTF-8?q?docs(plan):=20v2=20=E2=80=94=20ocr-canoni?=
 =?UTF-8?q?cal-soa-integration-v1.md?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit


From 357634e73697be9343bc3ecb648e544a0b0e0828 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:01:42 +0200
Subject: [PATCH 11/25] =?UTF-8?q?docs(plan):=20v2=20=E2=80=94=20tesseract-?=
 =?UTF-8?q?rs-ast-dll-codegen-v1.md?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .claude/plans/tesseract-rs-ast-dll-codegen-v1.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/.claude/plans/tesseract-rs-ast-dll-codegen-v1.md b/.claude/plans/tesseract-rs-ast-dll-codegen-v1.md
index ebe1d21b..22adaa05 100644
--- a/.claude/plans/tesseract-rs-ast-dll-codegen-v1.md
+++ b/.claude/plans/tesseract-rs-ast-dll-codegen-v1.md
@@ -1,10 +1,10 @@
 # tesseract-rs — AST-DLL C++→Rust Codegen Harness v1
 
 > **Type:** plan (sub-plan). Deliverables D-OCR-40/41/42. The transcode *mechanism*.
-> **Status:** PLANTED 2026-06-15 — design only.
+> **Status:** PLANTED 2026-06-15 v2 — layout IS in scope (1:1 raw-pointer), not skipped.
 > **Front:** post-#496. Uses `AdaWorldAPI/ruff` AST/codegen crates as the Rust-emission engine.
 > **Canon anchors:** master §4. Deterministic + diff-gated (bit-reproducibility doctrine).
-> **Skip-by-rule:** only leaf/mechanical modules are codegen targets; ownership-heavy code is hand-ported or replaced.
+> **Skip-by-rule:** only leaf/mechanical modules are codegen targets; ownership-heavy code is transcribed faithfully as raw-pointer Rust (1:1), with safe-refactor deferred to a later oracle-gated pass.
 
 ---
 
@@ -69,7 +69,8 @@ the harness is re-runnable to prove the commit equals the generator output.
 |---|---|
 | `tessdatamanager`, `unicharset`, `unicharcompress` (recoder), `dawg`/`trie` node arrays, `weightmatrix` struct/quant walks | **CODEGEN (D-OCR-41)** |
 | `recodebeam` (beam + dawg interaction), int8 GEMV rounding | **HAND-PORT** (numeric/behavioral subtlety) |
-| `textord`/`ccstruct` layout, Leptonica | **REPLACE** (ocrs / minimal imageproc) — never enters the harness |
+| `textord`/`ccstruct` layout | **CODEGEN → faithful raw-pointer Rust (D-OCR-30)** — intrusive ELIST/CLIST transcribed 1:1, NOT replaced |
+| Leptonica (~dozen ops only) | hand-port to image/imageproc (D-OCR-31) |
 
 ## 6. Deliverables
 

From 2dc636b0e3539291858bae3ea3237fc6ccf4ccd5 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:01:43 +0200
Subject: [PATCH 12/25] =?UTF-8?q?docs(plan):=20v2=20=E2=80=94=20tesseract-?=
 =?UTF-8?q?rs-layout-transcode-v1.md?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../plans/tesseract-rs-layout-transcode-v1.md | 39 +++++++++++++++++++
 1 file changed, 39 insertions(+)
 create mode 100644 .claude/plans/tesseract-rs-layout-transcode-v1.md

diff --git a/.claude/plans/tesseract-rs-layout-transcode-v1.md b/.claude/plans/tesseract-rs-layout-transcode-v1.md
new file mode 100644
index 00000000..3890ff49
--- /dev/null
+++ b/.claude/plans/tesseract-rs-layout-transcode-v1.md
@@ -0,0 +1,39 @@
+# tesseract-rs — Layout (textord/ccstruct) 1:1 Transcode v1
+
+> **Type:** plan (sub-plan, v2 — the part v1 wrongly skipped). Deliverables D-OCR-30/31.
+> **Status:** PLANTED 2026-06-15. FAITHFUL 1:1, raw-pointer where C++ is intrusive.
+> **Canon:** oracle-gated behavioral parity; safe-refactor is a later, separate, oracle-preserving pass.
+
+---
+
+## 0. Intent
+Reproduce Tesseract's page layout (the ~tens-of-thousands LOC the v1 plan wrongly
+proposed replacing with ocrs) byte-for-byte. This is the bulk of the "free 200k LOC":
+mechanical, codegen-amenable in structure, made tractable by accepting raw-pointer Rust.
+
+## 1. Faithful-transcription ruling
+Tesseract layout is intrusive `ELIST`/`CLIST` doubly-linked lists + cyclic mutable
+blob graphs (`BLOBNBOX`, `TO_BLOCK`, `ColPartition`, tab-stop finder, baseline fit,
+reading order). The 1:1 image is **raw-pointer Rust** (`*mut`, intrusive nodes,
+manual lifetimes) — behavior-identical, NOT redesigned. Ownership redesign (arena/
+slotmap/index graphs) is a LATER pass, gated to preserve oracle output. Do NOT
+redesign during transcode; that breaks 1:1 and is the trap.
+
+## 2. Modules (D-OCR-30)
+`ccstruct/{blobbox,coutline,polyblk,...}`, `textord/{tabfind,colfind,colpartition,
+tablefind,baselinefit,textlineprojection,wordseg,...}`. AST-DLL codegen emits the
+struct/method skeletons + intrusive-list ops as raw-pointer Rust; the gnarly
+control flow is reviewed against the C++ AST, diff-gated per function.
+
+## 3. Leptonica ops (D-OCR-31)
+Only the ~dozen ops Tesseract calls: Otsu/Sauvola binarize, projection/Radon
+deskew, despeckle, connected-component label, scale. Hand-port onto `image`/
+`imageproc` with per-op numeric parity vs the Leptonica fork (built as oracle).
+The rest of Leptonica is NOT ported.
+
+## 4. Acceptance (D-OCR-30/31)
+Full-page layout output (block/line/word boxes + reading order) byte-identical to
+libtesseract on a fixed page set; per-op image results bit-equal on fixtures.
+
+## 5. Open
+- **OD-30a:** one intrusive-list helper crate (`tess_elist`) shared across modules, or inline per-module? (Shared reduces unsafe surface.)

From 534a9c53dfd73eebf1a5c800d4457830a0863b68 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:01:44 +0200
Subject: [PATCH 13/25] =?UTF-8?q?docs(plan):=20v2=20=E2=80=94=20tesseract-?=
 =?UTF-8?q?rs-recodebeam-transcode-v1.md?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../tesseract-rs-recodebeam-transcode-v1.md   | 29 +++++++++++++++++++
 1 file changed, 29 insertions(+)
 create mode 100644 .claude/plans/tesseract-rs-recodebeam-transcode-v1.md

diff --git a/.claude/plans/tesseract-rs-recodebeam-transcode-v1.md b/.claude/plans/tesseract-rs-recodebeam-transcode-v1.md
new file mode 100644
index 00000000..2fb8b3df
--- /dev/null
+++ b/.claude/plans/tesseract-rs-recodebeam-transcode-v1.md
@@ -0,0 +1,29 @@
+# tesseract-rs — recodebeam Decoder 1:1 Transcode v1
+
+> **Type:** plan (sub-plan, v2). Deliverable D-OCR-21. Replaces v1 lstm-recodebeam (LSTM now hosted).
+> **Status:** PLANTED 2026-06-15. Decoder transcoded 1:1 over HOSTED posteriors.
+> **Canon:** consumes `[T,C]` from D-OCR-16 (embedanything host); oracle-gated.
+
+---
+
+## 0. Intent
+Transcode only the DECODER. The LSTM forward is hosted (D-OCR-16); recodebeam takes
+its `[T, n_classes]` posteriors and produces text exactly as Tesseract does.
+
+## 1. recodebeam (`tesseract-rs/src/recodebeam.rs`) — hand-port, D-OCR-21
+Beam over recoder codes; dawg-constrained + unconstrained beams; certainty/rating
+tie-break 1:1. DAWG node-array walks = codegen (D-OCR-40); the beam↔dawg interaction
+is hand-ported (behavioral subtlety). Output: text + per-token rating/certainty →
+becomes per-token confidence at emit (integration plan).
+
+## 2. int8 exactness boundary
+Numeric exactness now lives at the HOST boundary (D-OCR-16 posteriors), not in
+transcoded kernels. recodebeam consumes posteriors; if the host reproduces C++
+posteriors to the int8 contract, the decoder's 1:1 transcription yields 1:1 text.
+
+## 3. Acceptance
+recodebeam text byte-identical to libtesseract on ≥10k crops given oracle posteriors
+(isolates decoder correctness from host numeric parity).
+
+## 4. Open
+- **OD-21a:** support `lstm_choice_mode` top-k now (feeds OCR evidence tenant) or later.

From dbd63862c0894537d2ef6081578a5521d3a5c94c Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:01:45 +0200
Subject: [PATCH 14/25] =?UTF-8?q?docs(plan):=20v2=20=E2=80=94=20tesseract-?=
 =?UTF-8?q?rs-traineddata-gguf-v1.md?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../plans/tesseract-rs-traineddata-gguf-v1.md | 36 +++++++++++++++++++
 1 file changed, 36 insertions(+)
 create mode 100644 .claude/plans/tesseract-rs-traineddata-gguf-v1.md

diff --git a/.claude/plans/tesseract-rs-traineddata-gguf-v1.md b/.claude/plans/tesseract-rs-traineddata-gguf-v1.md
new file mode 100644
index 00000000..a9845114
--- /dev/null
+++ b/.claude/plans/tesseract-rs-traineddata-gguf-v1.md
@@ -0,0 +1,36 @@
+# tesseract-rs — traineddata → GGUF → embedanything Host v1
+
+> **Type:** plan (sub-plan, v2). Deliverables D-OCR-10/16. Replaces v1 `traineddata-ndarray`.
+> **Status:** PLANTED 2026-06-15. The LSTM is HOSTED, not transcoded.
+> **Host chain:** `.traineddata` → GGUF → `embedanything` DTO (candle) → `ndarray` AMX; `bgz_tensor` weight store.
+> **Canon:** `.grok/NDARRAY_BGZ_EMBEDANYTHING_INTEGRATION.md`; ndarray::hpc GGUF loader.
+
+---
+
+## 0. Intent
+Make Tesseract's recognizer "just another GGUF model behind embedanything." Parse
+the `.traineddata`, extract the recognizer net + weights, **export GGUF**, and run
+it on the existing inference runbook. No bespoke LSTM kernels; reuse candle's GGUF
+loader + ndarray AMX + bgz_tensor storage.
+
+## 1. Loader (`tesseract-rs/src/traineddata/`) — D-OCR-10
+Parse the modern (LSTM) `.traineddata` components: `lstm` (VGSL net + weights),
+`lstm-unicharset`, `lstm-recoder` (UnicharCompress), `unicharset`, `*.dawg`.
+Skip legacy (`inttemp`/`normproto`/`shapetable`/adaptive). Container/unicharset/
+recoder/dawg parse = **AST-DLL codegen** (D-OCR-40); VGSL net-spec = hand (tiny grammar).
+
+## 2. GGUF export — D-OCR-10
+Walk the VGSL graph → emit a GGUF model file:
+- map Tesseract layers (Conv, LSTM/BiLSTM, FullyConnected, Output/Softmax) to GGUF tensors + arch metadata;
+- preserve int8 quantization + per-row scales exactly (GGUF Q8 / per-tensor scale) so the hosted run can match C++ within the int8 contract;
+- `bgz_tensor` stores the exported tensors (compressed, Lance-native, random-access) per its weight-store role.
+
+## 3. Hosted run — D-OCR-16
+`embedanything::infer_sequence` (the D-OCR-15 extension) loads the GGUF via candle,
+runs CNN+BiLSTM on the ndarray AMX path, returns `[T, n_classes]` posteriors.
+Acceptance: per-timestep posteriors match a libtesseract dump within the int8
+exactness contract (float path 1:1) on a 1k-crop set. recodebeam (D-OCR-21) consumes these.
+
+## 4. Open
+- **OD-10a:** GGUF int8 (Q8_0) vs keep Tesseract's exact int8 layout in a custom GGUF kv — whichever reproduces C++ posteriors to the bit.
+- **OD-10b:** candle's BiLSTM/CTC coverage — confirm candle expresses Tesseract's BiLSTM + softmax exactly, else add the missing op in the candle fork.

From eeb0e7f9f813c35eb9ac0c458a92ec090f26e256 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:01:45 +0200
Subject: [PATCH 15/25] =?UTF-8?q?docs(plan):=20v2=20=E2=80=94=20tesseract-?=
 =?UTF-8?q?rs-transcode-master-v1.md?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../plans/tesseract-rs-transcode-master-v1.md | 195 +++++++-----------
 1 file changed, 76 insertions(+), 119 deletions(-)

diff --git a/.claude/plans/tesseract-rs-transcode-master-v1.md b/.claude/plans/tesseract-rs-transcode-master-v1.md
index 3d77acf9..d265ec6f 100644
--- a/.claude/plans/tesseract-rs-transcode-master-v1.md
+++ b/.claude/plans/tesseract-rs-transcode-master-v1.md
@@ -1,134 +1,91 @@
-# Tesseract → tesseract-rs Transcode — Master Plan v1
-
-> **Type:** plan family root (forward marker / co-architecture). Plants the
->   sub-plans; owns the deliverable index, the dependency DAG, and the
->   skip-list rationale.
-> **Status:** PLANTED 2026-06-15 — design only, no code. Layout/contracts proposed
->   against the post-#496 front.
-> **Front:** post-#496. `canonical_node.rs` carries `NodeGuid` / `EdgeBlock` /
->   `EdgeCodecFlavor` / `NodeRow` / `ValueTenant` / `ValueSchema` / `NodeRowPacket`;
->   `class_view.rs` carries `ClassView` (`edge_codec_flavor`, `value_schema`) +
->   `FieldMask`. These are the integration surface, not a thing to re-derive.
-> **Canon anchors (all sub-plans must match, never restate):**
->   - Operator GUID canon — `OGAR/CLAUDE.md` P0 (`classid·HEEL·HIP·TWIG·family·identity`, RFC-waived).
->   - Doc-lock — `lance-graph/CLAUDE.md` (commit `4ea6ac9`): SoA node, zero-fallback ladder, reserve-don't-reclaim.
->   - Code form — `crates/lance-graph-contract/src/canonical_node.rs` (#489+#490+#492+#494+#496).
->   - Three-tier model — `docs/architecture/soa-three-tier-model.md` (zero-copy, no emission).
->   - Supersession map — `.claude/plans/soa-migration-diff-resolution-2026-06-13.md`.
-> **Skip-by-rule:** anything behind the migration front is residue, not authority.
->   This plan does NOT conform to deprecated `BindSpace` row geometry.
+# Tesseract → tesseract-rs — 1:1 Transcode Master Plan v2
+
+> **Type:** plan family root. SUPERSEDES v1 (which wrongly skipped layout).
+> **Status:** PLANTED 2026-06-15 v2 — design locked. 1:1 behavioral transcode of ALL
+>   of Tesseract; the LSTM forward is the ONLY swapped component.
+> **Front:** post-#496. Hosts: `embedanything` DTO (GGUF→candle→ndarray-AMX, per
+>   `.grok/NDARRAY_BGZ_EMBEDANYTHING_INTEGRATION.md`); `bgz_tensor` weight store.
+> **Canon:** OGAR/CLAUDE.md GUID P0; lance-graph/CLAUDE.md SoA node; canonical_node.rs.
 
 ---
 
-## 0. Intent (one paragraph)
+## 0. The whole decision in one line
+
+Transcode **every** Tesseract module 1:1 for behavioral parity (validated against
+the `tesseract-rs` FFI oracle), EXCEPT the LSTM recognizer forward pass, which is
+**hosted** on the existing runbook: recognizer weights → GGUF → `embedanything`
+DTO (candle backend) → `ndarray` AMX path → per-timestep posteriors. Everything
+else — container, unicharset, recoder, **textord/ccstruct layout**, recodebeam,
+DAWG dict, the minimal Leptonica ops Tesseract calls — is faithfully ported.
 
-Stand up `tesseract-rs` as a **pure-Rust, bit-reproducible OCR substrate** that (a)
-runs *existing* modern Tesseract `.traineddata` LSTM models with byte-identical
-output, (b) replaces Tesseract's accreted C++ layout heuristics with the neural
-`ocrs`/`rten` path already forked into the account, (c) uses a clang-AST → Rust
-**codegen harness** (built on the `ruff` AST/codegen crates) for the mechanical
-leaf modules instead of hand-porting C++, and (d) emits every recognized token as
-a **canonical SoA `NodeRow`** so OCR output lands directly in the lance-graph
-cognitive substrate (OGAR class, HHTL address, `ValueSchema` over `ValueTenant`s,
-`EdgeCodecFlavor` adjacency, DeepNSM/CAM-PQ correction) with no boundary tax.
+## 1. What is 1:1 transcoded (≈200k LOC, mechanically)
 
-The whole effort doubles as the **bit-reproducibility regression harness** the SoA
-migration needs: Tesseract C++ (via the `tesseract-rs` FFI fork as oracle) vs the
-Rust port, diffed to the byte on fixed line crops.
+| Tesseract area | Route | 1:1 fidelity rule |
+|---|---|---|
+| `ccutil/tessdatamanager`, `unicharset`, `unicharcompress` (recoder) | AST-DLL codegen | byte-faithful tables |
+| `dict/{dawg,trie,permdawg}` | AST-DLL codegen | node-array walks 1:1 |
+| **`textord/` + `ccstruct/` (layout, tab-stops, ColPartition, reading order)** | AST-DLL codegen → **faithful raw-pointer/unsafe Rust** | ELIST/CLIST + cyclic blob graphs transcribed as raw-pointer intrusive lists; behavior-identical; safe-refactor is a LATER behavior-preserving pass, never a 1:1 deviation |
+| `recodebeam` (beam + DAWG interaction) | hand-port | tie-break/normalization 1:1 |
+| Leptonica ops Tesseract actually calls (Otsu/Sauvola, deskew, despeckle, CC-label, scale) | hand-port onto `image`/`imageproc` | numeric parity per-op |
+| `lstm/{network,lstm,fullyconnected,convolve,weightmatrix}` | **NOT transcoded — HOSTED** | see §2 |
 
-## 1. The spine (target data flow)
+**The unsafe-is-fine ruling:** a true 1:1 image of intrusive-pointer C++ is
+raw-pointer Rust. We accept `unsafe` as the faithful transcription; correctness is
+proven by the oracle diff, not by safe-Rust aesthetics. Refactor to arena/index
+graphs is a separate, oracle-gated step AFTER 1:1 is green. This is what makes the
+200k LOC mechanical (codegen + faithful transcription) instead of an ownership redesign.
+
+## 2. The ONE swap — LSTM hosted, not ported
 
 ```
-PDF / image
-  └─► [front-end]      pdfium-render + image + imageproc  ── OR ──  ferrules (layout-aware)
-        └─► [segment]  ocrs detection + layout_analysis   (NEURAL — replaces textord)   ┐
-        └─► [recognize] EITHER  tesseract-rs traineddata→LSTM-on-ndarray→recodebeam      │  engine
-                        OR      ocrs recognition (rten / rten-text CTC)                   │  trait
-                        OR      tract (arbitrary ONNX recognizer)                         ┘
-              └─► tokens + per-token confidence + bbox + top-k candidates
-                    └─► [repair] DeepNSM (vocabulary·codebook·parser·encoder·similarity)
-                          + CAM/PQ nearest-valid-token (helix / TurbovecResidue / CAKES)
-                          └─► [emit] canonical NodeRow  (classid=OCR class, HHTL address,
-                                ValueSchema OCR preset, EdgeCodecFlavor adjacency)
-                                └─► NodeRowPacket → SoaEnvelope → Lance (kv-lance)
+.traineddata LSTM weights ──► GGUF export ──► embedanything DTO (candle) ──► ndarray AMX
+   ──► per-timestep posteriors ──► (transcoded) recodebeam + DAWG ──► text + confidence
 ```
 
-## 2. Three engine paths behind ONE trait (the central decision)
-
-`tesseract-rs` exposes a single `Recognizer` trait with three backends; the engine
-is a runtime/feature choice, never a fork of the pipeline:
+- The `.traineddata` loader (D-OCR-10) extracts the recognizer net + weights and
+  **exports GGUF**, not raw ndarray hydration — so it enters the existing runbook
+  unchanged. `bgz_tensor` stores/streams the weight tensors (its §2.3 use-case).
+- `embedanything::infer()` runs the CNN+BiLSTM. recodebeam (transcoded) decodes the
+  posteriors. Only the matmul/gate compute is delegated; the decode stays 1:1.
+- Reuses every existing optimization (candle GGUF loader, ndarray AMX, bgz storage)
+  — zero new inference stack.
 
-| Backend | Source | Use when | C/C++ deps |
-|---|---|---|---|
-| `TranscodeLstm` | this plan family (traineddata→ndarray→recodebeam) | need an *existing* `.traineddata` language model, bit-identical to Tesseract | **none** |
-| `Ocrs` | `AdaWorldAPI/ocrs` + `rten` | default modern path; detection+layout+recognition, pure Rust | **none** |
-| `Tract` | `AdaWorldAPI/tract` | arbitrary/custom ONNX recognizer (PaddleOCR export, custom-trained) | **none** |
-| `Ort` (escape) | `AdaWorldAPI/ort` | GPU needed, or an op `tract` rejects | ONNX Runtime (C++) |
+## 3. The one DTO extension OCR imposes
 
-**Default = `Ocrs`.** `TranscodeLstm` is the accuracy/compat fallback AND the
-oracle's reference. `Ort` is opt-in only (it's the lone C++ dependency).
+`embedanything::infer` today shapes outputs as `EmbeddingVector`/`HypothesisScore`.
+The recognizer needs **per-timestep posteriors** (`[T, n_classes]`) for CTC/recodebeam.
+→ **D-OCR-15:** add a sequence-output variant to the DTO (`infer_sequence → [T,C]`),
+the only change the OCR use forces on the shared interface. Narrow, additive.
 
-## 3. What we SKIP and why (do not transcode)
+## 4. Deliverable index (D-OCR-NN)
 
-| Tesseract component | Disposition | Reason |
+| ID | Deliverable | Depends |
 |---|---|---|
-| Legacy pattern matcher (`classify/`, integer matcher, adaptive classifier) | **DROP** | Deprecated in Tesseract 4/5; LSTM supersedes it. |
-| Leptonica (C, ~250k LOC) | **REPLACE, not port** | Reimplement only the ~dozen ops Tesseract calls (Otsu/Sauvola, deskew, despeckle, CC-label, scale) on `image`/`imageproc`. `AdaWorldAPI/leptonica` fork kept only as oracle build dep. |
-| `textord/` + `ccstruct/` layout analysis (tab-stops, columns, `ColPartition`, baselines, reading order) | **REPLACE with `ocrs`** | Tens of thousands of LOC of intrusive linked-list / cyclic-mutable-graph heuristics. `ruff`-style syntax codegen cannot redesign ownership; `ocrs` neural layout is better anyway. See `tesseract-rs-neural-layout-ocrs-v1`. |
-| CMake / `BOOL_VAR`/`INT_VAR` global param registry / renderer plumbing | **DROP** | Cargo + a config struct + the `ClassView` registry replace it. |
-| C# (NuGet `Tesseract`, app wrappers) | **N/A** | No C# in OCR core; C# only in app wrappers — out of scope. |
-
-## 4. What we transcode faithfully (bit-exact tier)
-
-Routed through the AST-DLL codegen harness where mechanical, hand-ported where
-numeric exactness is subtle. See sub-plans for module-by-module assignment.
-
-- `.traineddata` container (`ccutil/tessdatamanager`), `unicharset`, recoder
-  (`lstm/unicharcompress`) → **D-OCR-1x** (`...-traineddata-ndarray-v1`).
-- LSTM forward (`lstm/{network,lstm,fullyconnected,convolve,weightmatrix}`) +
-  `recodebeam` decoder + DAWG dict (`dict/{dawg,trie,permdawg}`) → **D-OCR-2x**
-  (`...-lstm-recodebeam-v1`).
-
-## 5. Deliverable index (D-OCR-NN) and DAG
-
-| ID | Deliverable | Sub-plan | Depends on |
-|---|---|---|---|
-| D-OCR-00 | This master + skip-list + engine trait | (this) | — |
-| D-OCR-10 | `.traineddata`/unicharset/recoder reader | traineddata-ndarray | D-OCR-40 |
-| D-OCR-11 | LSTM weight hydration onto `ndarray::hpc` | traineddata-ndarray | D-OCR-10 |
-| D-OCR-20 | LSTM forward pass on ndarray | lstm-recodebeam | D-OCR-11 |
-| D-OCR-21 | `recodebeam` decoder + DAWG dict | lstm-recodebeam | D-OCR-20 |
-| D-OCR-22 | int8-SIMD numeric-exactness conformance | lstm-recodebeam | D-OCR-20 |
-| D-OCR-30 | `ocrs`/`rten` neural detect+layout+recognize backend | neural-layout-ocrs | — |
-| D-OCR-31 | `tract` arbitrary-ONNX backend + `ort` escape | neural-layout-ocrs | D-OCR-30 |
-| D-OCR-40 | clang C++ AST → IR ("AST DLL") | ast-dll-codegen | — |
-| D-OCR-41 | IR → Rust emission via `ruff` codegen crates | ast-dll-codegen | D-OCR-40 |
-| D-OCR-42 | diff-gate: emitted Rust vs FFI oracle | ast-dll-codegen | D-OCR-41 |
-| D-OCR-50 | OCR class + HHTL address scheme | ocr-canonical-soa-integration | canon |
-| D-OCR-51 | OCR `ValueSchema` preset over existing tenants | ocr-canonical-soa-integration | D-OCR-50 |
-| D-OCR-52 | DeepNSM + CAM/PQ token repair wiring | ocr-canonical-soa-integration | D-OCR-51 |
-| D-OCR-53 | bit-reproducibility harness (oracle diff) | ocr-canonical-soa-integration | D-OCR-21, D-OCR-30 |
-
-DAG critical path: **D-OCR-40 → 10 → 11 → 20 → 21 → 53** (the transcode oracle),
-with **D-OCR-30** (ocrs default) parallel and independent — ships first as the POC
-recognizer while the transcode tier matures behind it.
-
-## 6. Success criteria
-
-1. `Ocrs` backend produces tokens → canonical `NodeRow`s persisted via `NodeRowPacket` (POC gate).
-2. `TranscodeLstm` reproduces Tesseract C++ output **byte-identical** on ≥ 10k fixed line crops (oracle gate).
-3. Zero new crates.io deps (forks/path/`ndarray`/`lance`-family only); `Ort` is the sole opt-in C++ dep, feature-gated off by default.
-4. No layout-heuristic C++ transcoded (skip-list honored).
-5. OCR nodes carry no bespoke row geometry — they ride `ValueSchema`/`ValueTenant`/`EdgeCodecFlavor` unchanged.
-
-## 7. Open co-architecture decisions (resolve with operator)
-
-- **OD-1:** Does an OCR token get a *dedicated* `ValueTenant` (bbox + per-char
-  confidence + top-k recodebeam candidates), or ride `Meta`+`Fingerprint`+
-  `TurbovecResidue`? Adding a tenant is canon-significant (tenants never move/reuse).
-  POC rides existing; dedicated tenant deferred. (See ocr-soa-integration §OD-1.)
-- **OD-2:** Is `ocrs`/`rten` the permanent default with `TranscodeLstm` as compat
-  fallback, or is the transcode tier the eventual primary? Affects how much codegen
-  effort D-OCR-4x warrants.
-- **OD-3:** Does the AST-DLL frontend use libclang directly, or a clang→JSON dump
-  consumed by a Rust IR? (See ast-dll-codegen §OD-3.)
+| D-OCR-10 | `.traineddata` parse → unicharset/recoder/DAWG + recognizer-net → **GGUF export** | D-OCR-40 |
+| D-OCR-15 | `embedanything` sequence-output (`infer_sequence → [T,C]`) | — |
+| D-OCR-16 | recognizer GGUF runs via embedanything(candle)→ndarray; posteriors match C++ | D-OCR-10,15 |
+| D-OCR-21 | recodebeam + DAWG transcode (decode over hosted posteriors) | D-OCR-16 |
+| D-OCR-30 | **textord/ccstruct layout 1:1** (raw-pointer faithful) | D-OCR-40 |
+| D-OCR-31 | minimal Leptonica ops on image/imageproc (numeric parity) | — |
+| D-OCR-40 | AST-DLL clang→IR→Rust codegen harness (ruff emission) | — |
+| D-OCR-42 | oracle diff-gate (every module vs libtesseract FFI) | D-OCR-21,30 |
+| D-OCR-50 | OCR token → canonical NodeRow (OGAR class, HHTL, ValueSchema, edges) | canon |
+| D-OCR-52 | DeepNSM + char-confusion + CAM/PQ token repair | D-OCR-50 |
+| D-OCR-53 | bit-reproducibility harness (crop→text→NodeRow golden diff) | D-OCR-21,30 |
+
+Critical path: **40 → {10,30} → 16 → 21 → 42 → 53**. D-OCR-15 parallel (tiny). The
+layout transcode (30) and the recognizer host (16) are independent until decode.
+
+## 5. Success criteria
+
+1. `tesseract-rs` reproduces libtesseract output **byte-identical** on ≥10k line crops AND full-page layouts (oracle gate) — layout included, since it's 1:1.
+2. LSTM runs ONLY via embedanything(candle)→ndarray; no transcoded LSTM kernels.
+3. Zero crates.io (forks + ndarray/lance family); `ort` sole opt-in C++ (off by default).
+4. Recognized tokens land as canonical NodeRows with no bespoke geometry.
+
+## 6. Sub-plans (re-cast for v2)
+- `tesseract-rs-traineddata-gguf-v1` (D-OCR-10/16) — loader → GGUF → embedanything host.
+- `tesseract-rs-layout-transcode-v1` (D-OCR-30/31) — textord/ccstruct 1:1 + Leptonica ops.
+- `tesseract-rs-recodebeam-transcode-v1` (D-OCR-21) — decoder over hosted posteriors.
+- `tesseract-rs-ast-dll-codegen-v1` (D-OCR-40/42) — the codegen harness (covers layout).
+- `ocr-canonical-soa-integration-v1` (D-OCR-50/52/53) — OCR token = NodeRow + repair.

From 5b7ba97254853eda291437f0e52327173169fe55 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:08:00 +0200
Subject: [PATCH 16/25] =?UTF-8?q?docs(plan):=20residue-cascade=20text=20re?=
 =?UTF-8?q?construction=20(no=20hashes);=20gate=20D-OCR-53;=20pure-Rust=20?=
 =?UTF-8?q?front-end=20=E2=80=94=20addresses=20Codex?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../plans/ocr-canonical-soa-integration-v1.md | 44 +++++++++++--------
 1 file changed, 26 insertions(+), 18 deletions(-)

diff --git a/.claude/plans/ocr-canonical-soa-integration-v1.md b/.claude/plans/ocr-canonical-soa-integration-v1.md
index a31524d9..11efc96a 100644
--- a/.claude/plans/ocr-canonical-soa-integration-v1.md
+++ b/.claude/plans/ocr-canonical-soa-integration-v1.md
@@ -42,27 +42,34 @@ point of the splat-native / "one representation, many views" doctrine, applied t
 
 ## 3. OCR `ValueSchema` preset over EXISTING tenants (D-OCR-51)
 
-The 480-byte value slab already carves into `VALUE_TENANTS`. OCR **rides existing
-tenants** — no new tenant for the POC:
+The 480-byte value slab already carves into `VALUE_TENANTS`. An OCR token is **not
+a stored string and not a hash** — it is the *terminal of the perturbation cascade*,
+reconstructed exactly like every other node. Text = codebook index + residue.
 
-| Tenant (existing) | OCR use |
+| Tenant (existing) | OCR role |
 |---|---|
-| `Fingerprint` (32 B / 256-bit) | glyph/line identity print (DeepNSM `encoder` XOR-bind/bundle of the crop) |
-| `TurbovecResidue` (16 B, PQ) | glyph embedding → CAKES nearest-valid-token search |
-| `HelixResidue` (48 B) | orthogonal residue: per-token deviation from class centroid (confidence-as-residue) |
-| `Meta` (u64) | packed confidence + NSM-repair flags + token-subtype bits |
-| `EntityType` (u16) | OCR token class discriminator (Word/Number/Date/Glyph/TableCell) |
+| `HelixResidue` (48 B) | golden-spiral place/perturbation residue — how THIS token deviates from its codebook centroid (the recognizer/DeepNSM output IS this residue) |
+| `TurbovecResidue` (16 B, PQ) | PQ edge residue → CAKES nearest-valid-token search over the codebook |
+| `Meta` (u64) | codebook index/anchor + confidence + char-confusion/NSM-repair flags + recoder-code fallback for true-OOV |
+| `EntityType` (u16) | token subtype (Word/Number/Date/Glyph/TableCell) |
 | `Plasticity` (u32) | correction history / last-repair stamp |
 
-→ define `ValueSchema::Ocr` (or select `Cognitive` if its mask already covers the
-above) as a `FieldMask` over those `ValueTenant` positions. Selection only — it
-carves *within* the slab, moves nothing (canon: tenants never move/reuse).
+**Reconstruction (this is the round-trip, and it answers Codex P1):**
+`text  ⇄  codebook_index(Meta) + residue(HelixResidue ⊕ TurbovecResidue)`. Decode =
+the DeepNSM Morton-tile **stacked-pyramid perturbation-shader cascade** applied to
+the residue → CAKES nearest-valid-token over the codebook (DeepNSM `vocabulary` /
+coca `word_frequency`) → the word. No `Fingerprint` hash, no string column. The
+reversibility lives in residue + codebook, which is the architecture's whole point.
 
-**OD-1 (deferred):** a dedicated `ValueTenant::OcrEvidence` (bbox `[f16;4]` +
-per-char confidence + top-k recodebeam candidates) is the clean home for
-recognizer evidence. Adding a tenant is canon-significant, so the POC packs a
-compressed form into `Meta`+`HelixResidue` and defers the dedicated tenant to a
-follow-up once the evidence shape is stable (needs D-OCR-21 `lstm_choice_mode`).
+**True-OOV (no codebook neighbor — a raw code like `69B8`):** falls back to the
+**recoder-code residue** — `recodebeam` already emits recoder codes, not pixels, so
+the codes themselves are the reversible payload in `Meta`, repaired by the
+char-confusion grammar (D-OCR-52). Still a residue, never a hash.
+
+**ValueSchema:** `Cognitive` does NOT include `HelixResidue`/`TurbovecResidue`, so
+OCR needs a dedicated **`ValueSchema::Ocr`** = `FieldMask` over
+{`HelixResidue`,`TurbovecResidue`,`Meta`,`EntityType`,`Plasticity`}. Selection only;
+moves no tenant (canon: tenants never move/reuse).
 
 ## 4. Repair: DeepNSM + CAM/PQ nearest-valid-token (D-OCR-52)
 
@@ -75,7 +82,7 @@ The recognizer emits candidates+confidence; repair is the brainstem we already h
 - **Word layer = `deepnsm`:** `vocabulary` → `codebook` → `parser`/`pos` → `encoder`
   → `similarity`/`cam64`/`crystal_neighborhood`. Word-level plausibility + disambiguation.
 - **Nearest-valid-token = helix / CAM-PQ / CAKES:** the glyph `TurbovecResidue`
-  (PQ) + `HelixResidue` feed CAKES nearest-valid-token; CHAODA flags anomalous
+  (PQ) + `HelixResidue` feed CAKES nearest-valid-token; CHAODA (clustered-hierarchical outlier detection) flags anomalous
   tokens (likely-misrecognized). This is `bgz-tensor` CAM-PQ + `crates/helix`.
 
 Repaired token writes back: corrected text → `Fingerprint`/`EntityType`, repair
@@ -108,7 +115,8 @@ SIMD numeric exactness. OCR is the best external oracle the substrate has.
 - **D-OCR-52:** DeepNSM + character-confusion layer + CAM/PQ repair wired; a known
   OCR-garbage fixture (`69B8`, `rn`→`m`) is repaired by plausibility.
 - **D-OCR-53:** golden-file (crop → NodeRow bytes) regression green, shared with the
-  SoA migration suite.
+  SoA migration suite. **Prereq: D-OCR-50 + D-OCR-51** (class/HHTL/ValueSchema must
+  define the row layout before bytes can be golden-diffed).
 
 ## 8. Open decisions
 

From aa0ef21a6f8201ff0721f47ba810edef969ffca7 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:08:01 +0200
Subject: [PATCH 17/25] =?UTF-8?q?docs(plan):=20residue-cascade=20text=20re?=
 =?UTF-8?q?construction=20(no=20hashes);=20gate=20D-OCR-53;=20pure-Rust=20?=
 =?UTF-8?q?front-end=20=E2=80=94=20addresses=20Codex?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .claude/plans/tesseract-rs-transcode-master-v1.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.claude/plans/tesseract-rs-transcode-master-v1.md b/.claude/plans/tesseract-rs-transcode-master-v1.md
index d265ec6f..2c6f51e3 100644
--- a/.claude/plans/tesseract-rs-transcode-master-v1.md
+++ b/.claude/plans/tesseract-rs-transcode-master-v1.md
@@ -71,7 +71,7 @@ the only change the OCR use forces on the shared interface. Narrow, additive.
 | D-OCR-42 | oracle diff-gate (every module vs libtesseract FFI) | D-OCR-21,30 |
 | D-OCR-50 | OCR token → canonical NodeRow (OGAR class, HHTL, ValueSchema, edges) | canon |
 | D-OCR-52 | DeepNSM + char-confusion + CAM/PQ token repair | D-OCR-50 |
-| D-OCR-53 | bit-reproducibility harness (crop→text→NodeRow golden diff) | D-OCR-21,30 |
+| D-OCR-53 | bit-reproducibility harness (crop→text→NodeRow golden diff) | D-OCR-21,30,**50,51** |
 
 Critical path: **40 → {10,30} → 16 → 21 → 42 → 53**. D-OCR-15 parallel (tiny). The
 layout transcode (30) and the recognizer host (16) are independent until decode.

From abdbf9d2380fc01222e5100760f755f27bdb85a7 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:08:02 +0200
Subject: [PATCH 18/25] =?UTF-8?q?docs(plan):=20residue-cascade=20text=20re?=
 =?UTF-8?q?construction=20(no=20hashes);=20gate=20D-OCR-53;=20pure-Rust=20?=
 =?UTF-8?q?front-end=20=E2=80=94=20addresses=20Codex?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .claude/plans/tesseract-rs-layout-transcode-v1.md | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/.claude/plans/tesseract-rs-layout-transcode-v1.md b/.claude/plans/tesseract-rs-layout-transcode-v1.md
index 3890ff49..554275ca 100644
--- a/.claude/plans/tesseract-rs-layout-transcode-v1.md
+++ b/.claude/plans/tesseract-rs-layout-transcode-v1.md
@@ -37,3 +37,8 @@ libtesseract on a fixed page set; per-op image results bit-equal on fixtures.
 
 ## 5. Open
 - **OD-30a:** one intrusive-list helper crate (`tess_elist`) shared across modules, or inline per-module? (Shared reduces unsafe surface.)
+
+## 6. Front-end constraint (carried from retired neural-layout plan)
+The PDF/image front-end MUST be pure-Rust to honor the zero-C posture: image inputs
+via `image`/`imageproc`; PDF via a pure-Rust path (`ferrules`), **not** `pdfium-render`
+(it wraps native PDFium = C). The zero-C acceptance gate is otherwise unsatisfiable.

From c089518e754667d41422fbf13e7a1dde6746e068 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:12:45 +0200
Subject: [PATCH 19/25] docs(plan): helix residue = 48-bit (2x ResidueEdge),
 distinct from 48-byte tenant

---
 .claude/plans/ocr-canonical-soa-integration-v1.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.claude/plans/ocr-canonical-soa-integration-v1.md b/.claude/plans/ocr-canonical-soa-integration-v1.md
index 11efc96a..f6b8e5c3 100644
--- a/.claude/plans/ocr-canonical-soa-integration-v1.md
+++ b/.claude/plans/ocr-canonical-soa-integration-v1.md
@@ -48,14 +48,14 @@ reconstructed exactly like every other node. Text = codebook index + residue.
 
 | Tenant (existing) | OCR role |
 |---|---|
-| `HelixResidue` (48 B) | golden-spiral place/perturbation residue — how THIS token deviates from its codebook centroid (the recognizer/DeepNSM output IS this residue) |
+| `HelixResidue` | the token's **48-bit** perturbation = 2× `ResidueEdge` (each 24-bit: `start_idx`,`end_idx`,`floor_version`), how THIS token deviates from its codebook centroid; stored within the 48-**byte** HelixResidue tenant (≤16 edges = the Morton-tile stacked-pyramid cascade levels) |
 | `TurbovecResidue` (16 B, PQ) | PQ edge residue → CAKES nearest-valid-token search over the codebook |
 | `Meta` (u64) | codebook index/anchor + confidence + char-confusion/NSM-repair flags + recoder-code fallback for true-OOV |
 | `EntityType` (u16) | token subtype (Word/Number/Date/Glyph/TableCell) |
 | `Plasticity` (u32) | correction history / last-repair stamp |
 
 **Reconstruction (this is the round-trip, and it answers Codex P1):**
-`text  ⇄  codebook_index(Meta) + residue(HelixResidue ⊕ TurbovecResidue)`. Decode =
+`text  ⇄  codebook_index(Meta) + residue(helix 48-bit / 2×ResidueEdge ⊕ TurbovecResidue PQ)`. Decode =
 the DeepNSM Morton-tile **stacked-pyramid perturbation-shader cascade** applied to
 the residue → CAKES nearest-valid-token over the codebook (DeepNSM `vocabulary` /
 coca `word_frequency`) → the word. No `Fingerprint` hash, no string column. The

From 39a23deed05a0a4614e781b02c24cbb7aa63f5a8 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:14:05 +0200
Subject: [PATCH 20/25] docs(plan): helix residue = phi-spiral endpoint-pair
 edge (3B/24bit), crate's own framing

---
 .claude/plans/ocr-canonical-soa-integration-v1.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.claude/plans/ocr-canonical-soa-integration-v1.md b/.claude/plans/ocr-canonical-soa-integration-v1.md
index f6b8e5c3..fb9214a0 100644
--- a/.claude/plans/ocr-canonical-soa-integration-v1.md
+++ b/.claude/plans/ocr-canonical-soa-integration-v1.md
@@ -48,14 +48,14 @@ reconstructed exactly like every other node. Text = codebook index + residue.
 
 | Tenant (existing) | OCR role |
 |---|---|
-| `HelixResidue` | the token's **48-bit** perturbation = 2× `ResidueEdge` (each 24-bit: `start_idx`,`end_idx`,`floor_version`), how THIS token deviates from its codebook centroid; stored within the 48-**byte** HelixResidue tenant (≤16 edges = the Morton-tile stacked-pyramid cascade levels) |
+| `HelixResidue` | helix residue = the orthogonal φ-spiral **endpoint-pair edge** at the token's place (curve-ruler MODULUS 17 / STRIDE 4, regenerated from template — "8K resolution at Super-8 cost", only the endpoint pair stored at **3 B / 24-bit per edge**). The token's perturbation = **2 edges = 48-bit**; stored within the 48-**byte** tenant (16 edges = the Morton-tile stacked-pyramid cascade levels). HHTL is PLACE, helix is the RESIDUE. |
 | `TurbovecResidue` (16 B, PQ) | PQ edge residue → CAKES nearest-valid-token search over the codebook |
 | `Meta` (u64) | codebook index/anchor + confidence + char-confusion/NSM-repair flags + recoder-code fallback for true-OOV |
 | `EntityType` (u16) | token subtype (Word/Number/Date/Glyph/TableCell) |
 | `Plasticity` (u32) | correction history / last-repair stamp |
 
 **Reconstruction (this is the round-trip, and it answers Codex P1):**
-`text  ⇄  codebook_index(Meta) + residue(helix 48-bit / 2×ResidueEdge ⊕ TurbovecResidue PQ)`. Decode =
+`text  ⇄  codebook_index(Meta) + residue(helix 48-bit = 2 endpoint-pair edges ⊕ TurbovecResidue PQ)`. Decode =
 the DeepNSM Morton-tile **stacked-pyramid perturbation-shader cascade** applied to
 the residue → CAKES nearest-valid-token over the codebook (DeepNSM `vocabulary` /
 coca `word_frequency`) → the word. No `Fingerprint` hash, no string column. The

From 4c46707116796a757f83130fb0462263e8cf805f Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:18:06 +0200
Subject: [PATCH 21/25] docs(plan): helix residue in BITS (24b/edge,
 48b/token); disown bogus byte-tenant

---
 .claude/plans/ocr-canonical-soa-integration-v1.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.claude/plans/ocr-canonical-soa-integration-v1.md b/.claude/plans/ocr-canonical-soa-integration-v1.md
index fb9214a0..c5423bf0 100644
--- a/.claude/plans/ocr-canonical-soa-integration-v1.md
+++ b/.claude/plans/ocr-canonical-soa-integration-v1.md
@@ -48,7 +48,7 @@ reconstructed exactly like every other node. Text = codebook index + residue.
 
 | Tenant (existing) | OCR role |
 |---|---|
-| `HelixResidue` | helix residue = the orthogonal φ-spiral **endpoint-pair edge** at the token's place (curve-ruler MODULUS 17 / STRIDE 4, regenerated from template — "8K resolution at Super-8 cost", only the endpoint pair stored at **3 B / 24-bit per edge**). The token's perturbation = **2 edges = 48-bit**; stored within the 48-**byte** tenant (16 edges = the Morton-tile stacked-pyramid cascade levels). HHTL is PLACE, helix is the RESIDUE. |
+| helix residue (real `helix` crate, NOT the bogus value-tenant) | the orthogonal φ-spiral **endpoint-pair edge** at the token's place (`ResidueEncoder::encode`, curve-ruler MODULUS 17 / STRIDE 4, regenerated from template — "8K resolution at Super-8 cost"). One edge = **24 bits**; the token's perturbation = **2 edges = 48 bits**. HHTL is PLACE, helix is the RESIDUE. **Do NOT use the `ValueTenant::HelixResidue` slab — it is a mislabeled placeholder that does no φ/golden math and is slated for removal.** |
 | `TurbovecResidue` (16 B, PQ) | PQ edge residue → CAKES nearest-valid-token search over the codebook |
 | `Meta` (u64) | codebook index/anchor + confidence + char-confusion/NSM-repair flags + recoder-code fallback for true-OOV |
 | `EntityType` (u16) | token subtype (Word/Number/Date/Glyph/TableCell) |

From df48c87bae13d6049b51e82a0da87ff815682d10 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:20:11 +0200
Subject: [PATCH 22/25] docs(plan): helix residue = 24-bit golden index (probe
 #495); helix-48 tenant is the dilution gap

---
 .claude/plans/ocr-canonical-soa-integration-v1.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.claude/plans/ocr-canonical-soa-integration-v1.md b/.claude/plans/ocr-canonical-soa-integration-v1.md
index c5423bf0..6a9bc5e8 100644
--- a/.claude/plans/ocr-canonical-soa-integration-v1.md
+++ b/.claude/plans/ocr-canonical-soa-integration-v1.md
@@ -48,14 +48,14 @@ reconstructed exactly like every other node. Text = codebook index + residue.
 
 | Tenant (existing) | OCR role |
 |---|---|
-| helix residue (real `helix` crate, NOT the bogus value-tenant) | the orthogonal φ-spiral **endpoint-pair edge** at the token's place (`ResidueEncoder::encode`, curve-ruler MODULUS 17 / STRIDE 4, regenerated from template — "8K resolution at Super-8 cost"). One edge = **24 bits**; the token's perturbation = **2 edges = 48 bits**. HHTL is PLACE, helix is the RESIDUE. **Do NOT use the `ValueTenant::HelixResidue` slab — it is a mislabeled placeholder that does no φ/golden math and is slated for removal.** |
+| helix residue (real `helix` crate, probe-backed #495 `helix_bitdepth_probe`) | direction as a **discrete golden index — 24-bit lossless** (φ-spiral, vs ≤f16). This is THE helix residue. The 48-**byte** `ValueTenant::HelixResidue` is the diluted placeholder #496 named the *"helix-48 dilution gap"* (PROBE-GATED research, not real φ-geometry) — **do NOT use it**. HHTL = PLACE, helix 24-bit golden index = RESIDUE. |
 | `TurbovecResidue` (16 B, PQ) | PQ edge residue → CAKES nearest-valid-token search over the codebook |
 | `Meta` (u64) | codebook index/anchor + confidence + char-confusion/NSM-repair flags + recoder-code fallback for true-OOV |
 | `EntityType` (u16) | token subtype (Word/Number/Date/Glyph/TableCell) |
 | `Plasticity` (u32) | correction history / last-repair stamp |
 
 **Reconstruction (this is the round-trip, and it answers Codex P1):**
-`text  ⇄  codebook_index(Meta) + residue(helix 48-bit = 2 endpoint-pair edges ⊕ TurbovecResidue PQ)`. Decode =
+`text  ⇄  codebook_index(Meta) + residue(helix 24-bit golden index ⊕ TurbovecResidue PQ)`. Decode =
 the DeepNSM Morton-tile **stacked-pyramid perturbation-shader cascade** applied to
 the residue → CAKES nearest-valid-token over the codebook (DeepNSM `vocabulary` /
 coca `word_frequency`) → the word. No `Fingerprint` hash, no string column. The

From 209ea6abcbb836e122446568f7d23e0e725d368f Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:41:22 +0200
Subject: [PATCH 23/25] =?UTF-8?q?docs(plan):=20centroid=20attention=20fiel?=
 =?UTF-8?q?d=20synthesis=20=E2=80=94=20helix=20residue=20as=20field,=20uni?=
 =?UTF-8?q?fy=20perception/reasoning/grammar=20(ocr-canonical-soa-integrat?=
 =?UTF-8?q?ion-v1.md)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .claude/plans/ocr-canonical-soa-integration-v1.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.claude/plans/ocr-canonical-soa-integration-v1.md b/.claude/plans/ocr-canonical-soa-integration-v1.md
index 6a9bc5e8..a2e9ecff 100644
--- a/.claude/plans/ocr-canonical-soa-integration-v1.md
+++ b/.claude/plans/ocr-canonical-soa-integration-v1.md
@@ -48,14 +48,14 @@ reconstructed exactly like every other node. Text = codebook index + residue.
 
 | Tenant (existing) | OCR role |
 |---|---|
-| helix residue (real `helix` crate, probe-backed #495 `helix_bitdepth_probe`) | direction as a **discrete golden index — 24-bit lossless** (φ-spiral, vs ≤f16). This is THE helix residue. The 48-**byte** `ValueTenant::HelixResidue` is the diluted placeholder #496 named the *"helix-48 dilution gap"* (PROBE-GATED research, not real φ-geometry) — **do NOT use it**. HHTL = PLACE, helix 24-bit golden index = RESIDUE. |
+| helix residue = **centroid attention field** (NOT a stored code) | The 24-bit golden index is the **query↔centroid alignment** (φ-spiral direction = how this point attends to its place-centroid); the Morton-tile stacked-pyramid perturbation-shader is **multi-scale attention** (coarse centroid → fine perturbation = HHTL cascade in residue space). The field is **evaluated from the φ-template, never stored** ("8K resolution at Super-8 cost" — only the index is kept). Place=HHTL centroid; residue=perturbation off it. The 48-byte `ValueTenant::HelixResidue` is category-wrong (stores a field that must be computed) — do NOT use it. |
 | `TurbovecResidue` (16 B, PQ) | PQ edge residue → CAKES nearest-valid-token search over the codebook |
 | `Meta` (u64) | codebook index/anchor + confidence + char-confusion/NSM-repair flags + recoder-code fallback for true-OOV |
 | `EntityType` (u16) | token subtype (Word/Number/Date/Glyph/TableCell) |
 | `Plasticity` (u32) | correction history / last-repair stamp |
 
 **Reconstruction (this is the round-trip, and it answers Codex P1):**
-`text  ⇄  codebook_index(Meta) + residue(helix 24-bit golden index ⊕ TurbovecResidue PQ)`. Decode =
+`text  ⇄  codebook_index(Meta) + field-eval(helix 24-bit golden-index attention ⊕ TurbovecResidue PQ)`. Decode =
 the DeepNSM Morton-tile **stacked-pyramid perturbation-shader cascade** applied to
 the residue → CAKES nearest-valid-token over the codebook (DeepNSM `vocabulary` /
 coca `word_frequency`) → the word. No `Fingerprint` hash, no string column. The

From 6e316fa6e2d69c7705ae683ecdc5eaca9d99abd3 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:41:23 +0200
Subject: [PATCH 24/25] =?UTF-8?q?docs(plan):=20centroid=20attention=20fiel?=
 =?UTF-8?q?d=20synthesis=20=E2=80=94=20helix=20residue=20as=20field,=20uni?=
 =?UTF-8?q?fy=20perception/reasoning/grammar=20(soa-centroid-attention-fie?=
 =?UTF-8?q?ld-synthesis-v1.md)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 ...a-centroid-attention-field-synthesis-v1.md | 65 +++++++++++++++++++
 1 file changed, 65 insertions(+)
 create mode 100644 .claude/plans/soa-centroid-attention-field-synthesis-v1.md

diff --git a/.claude/plans/soa-centroid-attention-field-synthesis-v1.md b/.claude/plans/soa-centroid-attention-field-synthesis-v1.md
new file mode 100644
index 00000000..66e6c2df
--- /dev/null
+++ b/.claude/plans/soa-centroid-attention-field-synthesis-v1.md
@@ -0,0 +1,65 @@
+# SoA Centroid Attention Field — Unified Synthesis v1
+
+> **Type:** plan (phase-2 marker / co-architecture). Unifies recognition + reasoning + grammar as reads of ONE field.
+> **Status:** PLANTED 2026-06-15. Gated on `cycle-coherent-soa-snapshot-v1` (plastic field ⇒ COW writes).
+> **Canon:** helix crate (golden-index residue, φ-template); deepnsm; causal-edge (pearl/nars); TEKAMOLO (#495).
+
+---
+
+## 0. The one idea
+
+The **48-bit helix residue + Morton-tile stacked-pyramid perturbation-shader IS a
+centroid attention field.** Place (HHTL) = centroid; residue (24-bit golden index)
+= each point's perturbation off it = the **query↔key alignment**; the pyramid =
+**multi-scale attention** (coarse centroid → fine). The field is *evaluated from the
+φ-spiral template, never stored*. Everything below is a **read of this one field at
+a different scale** — not separate engines bolted together.
+
+## 1. The reads (each is the same field, different scale)
+
+| Capability | Real crate / source | What it is, as a field read |
+|---|---|---|
+| **Perception (ONNX/LSTM)** | embedanything(candle)/GGUF host | emits a **query** into the field (golden index + posteriors); the ONLY learned-perceptual part, stays hosted |
+| **Attention eval** | `helix` (golden index, curve-ruler, `DistanceLut`) | query↔centroid alignment; Morton pyramid = coarse→fine resolution |
+| **Markov context building / bundling** | `deepnsm::markov_bundle`, `encoder` | temporal **superposition along the field** = the bundling read (context = bundled perturbations) |
+| **Quorum + NARS reasoning** | `causal-edge::{pearl,nars,syllogism}` | centroid **coupling** = edge read; quorum = agreement of multiple field reads; NARS truth = coupling strength |
+| **Grammar heuristics** | `deepnsm::{parser,pos,morphology,spo,syllogism}` | syntactic **field masks** = structured attention over the field |
+| **Relative-pronoun / syntax order** | TEKAMOLO resolver (#495) | resolves adverbial/relative-pronoun binding = constrained attention path |
+| **Episodic / coref** | AriGraph (`EpisodicWitness64`) *(name "aerial" — confirm)* | temporal chain read = the field over witness-time |
+| **Nearest-valid-token** | `crystal_neighborhood`, `cam64`, CAKES + `turbovec` | field-alignment argmax = read-off to codebook word |
+
+## 2. Why this is one object, not a pipeline
+
+VSA bind/bundle/similarity **are** the field operations: bind = perturbation off
+centroid, bundle = the pyramid's coarse-level superposition, similarity = field
+alignment (`DistanceLut`). So DeepNSM's markov_bundle is the *symbolic readout* of
+the field; NARS/quorum is the *edge coupling*; grammar/TEKAMOLO are *attention
+masks*. No separate learning machine is needed — the attention field already does
+binding/bundling/attention in one structure (Frady/Kleyko 1707.01429: trained-RNN
+⊁ VSA for symbol sequences). What's missing is only **plasticity**.
+
+## 3. Phase-2: make the field plastic (the "learning edges")
+
+Not new tenants — **the field adapts**:
+- centroid **drift** (place-centroids move toward corpus density);
+- shader **perturbation-gain** adaptation (the pyramid's response sharpens);
+- timed by `Plasticity` tenant; coupled by `CausalEdge64` strength (NARS mantissa moves).
+Evaluated from the φ-template (not materialized). **Hard dep:** `cycle-coherent-soa-snapshot`
+COW — plastic field mutates per cycle; without snapshot it thrashes Lance.
+
+## 4. ONNX combination (operator's point)
+
+The ONNX-shaped recognizer and the field **meet at the query boundary**: ONNX emits
+posteriors → the field's golden-index query; field eval + grammar masks + NARS
+coupling resolve to the token. So ONNX = the perceptual *encoder into* the field;
+the field = everything symbolic/sequential/relational. One substrate, two scales.
+
+## 5. Determinism split (non-negotiable)
+- **Frozen mode** (centroids/gains fixed) → bit-reproducible → the Tesseract oracle + golden-file harness run here.
+- **Plastic mode** (field adapts) → live use; NOT golden-diffable; gated by snapshot.
+Two modes, explicitly separated, or the bit-repro guarantee is lost.
+
+## 6. Open
+- **OD-A:** confirm "aerial" = AriGraph (`EpisodicWitness64`) vs a distinct AST crate.
+- **OD-B:** centroid drift rule — Hebbian on `Plasticity`, or NARS-revision on `CausalEdge64`? (probe-gate, measure first.)
+- **OD-C:** operator sign-off required for any new tenant (anti-invention guardrail) — phase-2 should need NONE (field is evaluated, not stored).

From 298e8e93f64639c532e47cbaae01347709ba2018 Mon Sep 17 00:00:00 2001
From: AdaWorldAPI <jan@exo.red>
Date: Mon, 15 Jun 2026 20:43:40 +0200
Subject: [PATCH 25/25] =?UTF-8?q?docs(plan):=20aerial=20=3D=20lance-graph-?=
 =?UTF-8?q?arm-discovery=20(Aerial+=20codebook-distance=20rule=20mining),?=
 =?UTF-8?q?=20not=20AriGraph=20=E2=80=94=20the=20in-tree=20proof=20of=20fi?=
 =?UTF-8?q?eld-as-learner?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 .../plans/soa-centroid-attention-field-synthesis-v1.md | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/.claude/plans/soa-centroid-attention-field-synthesis-v1.md b/.claude/plans/soa-centroid-attention-field-synthesis-v1.md
index 66e6c2df..952fdc64 100644
--- a/.claude/plans/soa-centroid-attention-field-synthesis-v1.md
+++ b/.claude/plans/soa-centroid-attention-field-synthesis-v1.md
@@ -25,7 +25,8 @@ a different scale** — not separate engines bolted together.
 | **Quorum + NARS reasoning** | `causal-edge::{pearl,nars,syllogism}` | centroid **coupling** = edge read; quorum = agreement of multiple field reads; NARS truth = coupling strength |
 | **Grammar heuristics** | `deepnsm::{parser,pos,morphology,spo,syllogism}` | syntactic **field masks** = structured attention over the field |
 | **Relative-pronoun / syntax order** | TEKAMOLO resolver (#495) | resolves adverbial/relative-pronoun binding = constrained attention path |
-| **Episodic / coref** | AriGraph (`EpisodicWitness64`) *(name "aerial" — confirm)* | temporal chain read = the field over witness-time |
+| **Rule learning (the real "aerial")** | `lance-graph-arm-discovery::aerial` — Aerial+ transcode (arXiv 2504.19354), **autoencoder replaced by integer codebook-distance oracle** (palette256, ρ=0.9973 vs cosine) | mines SPO association rules **float-free / bitwise-deterministic** → `arm_to_truth_u8` → `CausalEdge64` confidence_u8 + i4 mantissa. This IS "learning edges" — the field's codebook distance replaces the f32 autoencoder. |
+| **Episodic / coref** | AriGraph (`EpisodicWitness64`) | temporal chain read = the field over witness-time |
 | **Nearest-valid-token** | `crystal_neighborhood`, `cam64`, CAKES + `turbovec` | field-alignment argmax = read-off to codebook word |
 
 ## 2. Why this is one object, not a pipeline
@@ -36,7 +37,10 @@ alignment (`DistanceLut`). So DeepNSM's markov_bundle is the *symbolic readout*
 the field; NARS/quorum is the *edge coupling*; grammar/TEKAMOLO are *attention
 masks*. No separate learning machine is needed — the attention field already does
 binding/bundling/attention in one structure (Frady/Kleyko 1707.01429: trained-RNN
-⊁ VSA for symbol sequences). What's missing is only **plasticity**.
+⊁ VSA for symbol sequences). **`aerial` is the proof in-tree:** Aerial+'s f32
+autoencoder is replaced by the integer codebook-distance oracle (the field) and
+still mines rules — neurosymbolic learning with NO autoencoder, NO SGD, NO seed.
+What's missing is only **plasticity** (centroid drift), not a learner.
 
 ## 3. Phase-2: make the field plastic (the "learning edges")
 
@@ -60,6 +64,6 @@ the field = everything symbolic/sequential/relational. One substrate, two scales
 Two modes, explicitly separated, or the bit-repro guarantee is lost.
 
 ## 6. Open
-- **OD-A:** confirm "aerial" = AriGraph (`EpisodicWitness64`) vs a distinct AST crate.
+- **OD-A:** RESOLVED — "aerial" = `lance-graph-arm-discovery::aerial` (Aerial+ rule-mining, codebook-distance oracle, not AriGraph).
 - **OD-B:** centroid drift rule — Hebbian on `Plasticity`, or NARS-revision on `CausalEdge64`? (probe-gate, measure first.)
 - **OD-C:** operator sign-off required for any new tenant (anti-invention guardrail) — phase-2 should need NONE (field is evaluated, not stored).