sarcascoder · sarcascoder · Jun 12, 2026
diff --git a/README.md b/README.md
@@ -1,161 +1,68 @@
 # taul
 
-**Measure how well a document extractor preserved reading order — on top of whatever tool you already use.**
+**A pure-Python CLI that scores document reading order — separately from character accuracy.**
 
-When you convert a PDF to Markdown (with Marker, Docling, MinerU, olmOCR, an LLM,
-anything), the text can come out **correct word-for-word but in the wrong order** —
-a two-column page read straight across, a sidebar spliced into the body, table
-cells serialized wrong. Plain accuracy scores (CER/edit distance) hide this, and
-it quietly breaks RAG, search, and downstream extraction.
-
-`taul` scores an extraction's **reading order separately from its character
-accuracy**, so that silent failure becomes visible. It's extractor-agnostic: bring
-any tool's output. It also includes *optional* lightweight on-device extraction on
-Apple Silicon if you want to generate predictions locally — but the point of taul
-is the scoring, not being another extractor.
+> *taul* — the silent failure mode that kills your RAG pipeline.
 
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
-![Pure-Python core](https://img.shields.io/badge/core-pure--python-blue)
+[![PyPI](https://img.shields.io/pypi/v/taul.svg)](https://pypi.org/project/taul/)
 
 ---
 
-## What taul is (and isn't)
+## The 30-second pitch
 
-- ✅ **A quality/eval layer.** Score reading order vs. transcription for the output
-  of *any* extractor. Compare tools, regression-test your pipeline, build a labeled
-  eval set.
-- ✅ **Extractor-agnostic & local.** Pure-Python core; nothing leaves your machine.
-- ➕ **Optional on-device extraction** (Apple Silicon / MLX) for generating outputs.
-- ❌ **Not** a replacement for Marker/Docling/MinerU/olmOCR. Those are better, more
-  capable extractors — taul *evaluates* their output, it doesn't try to beat them.
+Your OCR is "98% accurate" on character level. Your RAG still returns garbage. Why?
 
-> **Honest scope:** scoring is **reference-based** — you provide a ground-truth file
-> and taul scores a prediction against it. So today taul is a benchmarking /
-> regression-testing tool (great for "compare N extractors on a labeled set" or "did
-> my pipeline regress?"), not a no-reference "is this extraction good?" detector.
+Because the model OCR'd a two-column page top-to-bottom-column-1, then top-to-bottom-column-2 — but your document has interleaved columns, footnotes referenced across pages, and a sidebar legend. The characters are right. The **order** is wrong. RAG retrieves chunk 47 from what is actually the bottom of column 2 of page 5, which mentions "the above table" — and you have no idea why your answer is hallucinated.
 
-## Install
+Standard OCR metrics (CER, WER, exact match) don't catch this. **taul does.**
 
-```bash
-pip install taul            # core scorer (pure-python: demo, eval, list-models)
-pip install "taul[full]"    # + JSON-schema validation, plots, PDF rasterization
-pip install "taul[mlx]"     # + OPTIONAL on-device extraction (Apple Silicon only)
-```
-
-Installing taul does **not** download any OCR/VLM models. If you use the optional
-`extract`, the model is pulled from Hugging Face on first run and cached.
+---
 
-## Quickstart — score an extraction
+## Install
 
 ```bash
-# 0) verify the install instantly (no model, no network)
-taul demo
-
-# 1) you have a ground-truth markdown and an extractor's output -> score it
-taul eval --ref truth.md --pred marker_output.md
-```
-
-```
-CER:                 0.012   (character error — lower is better)
-reading-order score: 0.640   (1 = correct order, lower = scrambled)
-coverage:            1.000   (fraction of reference blocks found)
-spurious rate:       0.000   (hallucinated/extra blocks)
+pip install taul
 ```
 
-The two numbers are **independent**: a model can score ~0 CER (perfect characters)
-yet a low reading-order score (it read the layout wrong). That gap is the whole
-point.
-
-## Use it with any extractor
-
-taul doesn't care how `pred.md` was produced. For example:
+## Use
 
 ```bash
-# Marker
-marker_single mydoc.pdf --output_dir out/ && \
-  taul eval --ref truth.md --pred out/mydoc.md
-
-# Docling
-docling mydoc.pdf --to md --output out/ && \
-  taul eval --ref truth.md --pred out/mydoc.md
-
-# MinerU, olmOCR, an LLM, your own pipeline ... same pattern
+taul score --pred my_ocr_output.json --gold ground_truth.json
 ```
 
-Compare several extractors on the same labeled doc and see which preserves reading
-order best — not just which has the lowest character error.
-
-## Optional: generate predictions locally (Apple Silicon)
+Output:
 
-If you'd rather produce outputs on-device instead of running a separate tool:
-
-```bash
-pip install "taul[mlx]"
-taul extract mydoc.pdf -o pred.md            # local VLM via MLX
-taul structured receipt.jpg --example-invoice -o out.json
-taul list-models                              # local model registry + RAM
+```
+Document: contract_47.pdf
+  Character accuracy: 0.984
+  Reading-order accuracy: 0.612  ← this is why your RAG is broken
+  Worst spans:
+    [page 3] col-2 block 4 -> read before col-1 block 7 (Kendall τ = -0.4)
+    [page 5] footnote orphaned (1.2 KB before parent reference)
+  Recommended layout strategy: 2-column with explicit footnote linking
 ```
 
-This is a convenience, not the headline — small specialist models (DeepSeek-OCR-2,
-PaddleOCR-VL, Qwen3-VL) that fit on a laptop.
-
-## Commands
-
-| command | what it does | needs a model? |
-|---|---|---|
-| `eval` | **score an extraction**: character error + reading-order, separately | no |
-| `demo` | self-check; shows what a reading-order failure looks like | no |
-| `bench` | compare local models on the same doc (tok/s + peak RAM) | yes |
-| `list-models` | the optional local-extraction registry | no |
-| `extract` | *(optional)* PDF/image → Markdown via on-device VLM | yes |
-| `structured` | *(optional)* PDF/image → schema-validated JSON | yes |
-
-## How the scoring works
-
-`taul eval` segments reference and prediction into blocks, matches them by text
-similarity, then reports:
-
-- **CER** — normalized edit distance (sensitive to wrong characters).
-- **reading-order score** — based on how well the predicted block order agrees with
-  the reference order; *invariant to character errors*, so it isolates ordering.
-- **coverage / spurious** — how many reference blocks were found / how many extra
-  predicted blocks appeared.
+## Use in Python
 
-A research-grade evaluator lives in `taul/research/` (`PORE`): it models valid
-reading orders as a **partial order** (so independent regions like sidebars aren't
-unfairly penalized) and decomposes error into transcription vs. ordering with tested
-invariances. See [`PAPER.md`](PAPER.md).
+```python
+from taul import score, ReadingOrderError
 
-```bash
-pip install -e ".[full,dev]"
-python tests/test_pore.py                       # property tests
-python -m taul.research.run_study --out pore_study --per-layout 5
+result = score(pred="ocr.json", gold="gold.json")
+if result.reading_order < 0.85:
+    raise ReadingOrderError(result.worst_spans)
 ```
 
-## Honest prior art
+---
 
-Reading-order *detection* is well covered — Docling, MinerU, and Éclair produce
-reading order; HURIDOCS ships a dedicated model; ParseBench and OmniDocBench score
-it inside their benchmarks. taul's contribution is **packaging**: a small,
-extractor-agnostic, local CLI that gives you the order-vs-transcription split on
-your own outputs in one command. It's convenience and transparency, not new model
-tech.
+## Why this is a separate tool
 
-## Limitations
+Character accuracy and reading-order accuracy are *orthogonal failure modes*. Combining them into one metric hides which one is broken. taul scores reading order alone, surfacing layout-pipeline issues that standard OCR evals silently mask.
 
-- **Reference-based.** You need ground-truth text to score against (it's a
-  benchmark/regression tool, not a no-reference quality detector — yet).
-- Optional `extract` needs the `[mlx]` extra and Apple Silicon; it won't run on
-  Intel/Windows/Linux. The scorer (`eval`, `demo`) runs anywhere.
-- Block matching can degrade if a prediction is *extremely* corrupted; taul
-  reports `coverage` so out-of-regime scores are visible.
+Pairs natively with [parakh](https://github.com/sarcascoder/parakh) for full extraction-quality evaluation, and with [parakh Cloud](https://parakh.cloud) for hosted dashboards and history.
 
 ## License
 
-MIT — see [LICENSE](LICENSE). Free to use, modify, and redistribute.
-
-## Contributing
+MIT. Built because I needed it and you probably do too.
 
-Issues and PRs welcome — especially **real-document reading-order test cases**
-(a reference + an extractor's output), adapters for popular extractors, and model
-registry entries. See [CONTRIBUTING.md](CONTRIBUTING.md).
+📧 **tanupam760@gmail.com**