Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 33 additions & 126 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,161 +1,68 @@
# taul

**Measure how well a document extractor preserved reading order — on top of whatever tool you already use.**
**A pure-Python CLI that scores document reading order — separately from character accuracy.**

When you convert a PDF to Markdown (with Marker, Docling, MinerU, olmOCR, an LLM,
anything), the text can come out **correct word-for-word but in the wrong order** —
a two-column page read straight across, a sidebar spliced into the body, table
cells serialized wrong. Plain accuracy scores (CER/edit distance) hide this, and
it quietly breaks RAG, search, and downstream extraction.

`taul` scores an extraction's **reading order separately from its character
accuracy**, so that silent failure becomes visible. It's extractor-agnostic: bring
any tool's output. It also includes *optional* lightweight on-device extraction on
Apple Silicon if you want to generate predictions locally — but the point of taul
is the scoring, not being another extractor.
> *taul* — the silent failure mode that kills your RAG pipeline.

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
![Pure-Python core](https://img.shields.io/badge/core-pure--python-blue)
[![PyPI](https://img.shields.io/pypi/v/taul.svg)](https://pypi.org/project/taul/)

---

## What taul is (and isn't)
## The 30-second pitch

- ✅ **A quality/eval layer.** Score reading order vs. transcription for the output
of *any* extractor. Compare tools, regression-test your pipeline, build a labeled
eval set.
- ✅ **Extractor-agnostic & local.** Pure-Python core; nothing leaves your machine.
- ➕ **Optional on-device extraction** (Apple Silicon / MLX) for generating outputs.
- ❌ **Not** a replacement for Marker/Docling/MinerU/olmOCR. Those are better, more
capable extractors — taul *evaluates* their output, it doesn't try to beat them.
Your OCR is "98% accurate" on character level. Your RAG still returns garbage. Why?

> **Honest scope:** scoring is **reference-based** — you provide a ground-truth file
> and taul scores a prediction against it. So today taul is a benchmarking /
> regression-testing tool (great for "compare N extractors on a labeled set" or "did
> my pipeline regress?"), not a no-reference "is this extraction good?" detector.
Because the model OCR'd a two-column page top-to-bottom-column-1, then top-to-bottom-column-2 — but your document has interleaved columns, footnotes referenced across pages, and a sidebar legend. The characters are right. The **order** is wrong. RAG retrieves chunk 47 from what is actually the bottom of column 2 of page 5, which mentions "the above table" — and you have no idea why your answer is hallucinated.

## Install
Standard OCR metrics (CER, WER, exact match) don't catch this. **taul does.**

```bash
pip install taul # core scorer (pure-python: demo, eval, list-models)
pip install "taul[full]" # + JSON-schema validation, plots, PDF rasterization
pip install "taul[mlx]" # + OPTIONAL on-device extraction (Apple Silicon only)
```

Installing taul does **not** download any OCR/VLM models. If you use the optional
`extract`, the model is pulled from Hugging Face on first run and cached.
---

## Quickstart — score an extraction
## Install

```bash
# 0) verify the install instantly (no model, no network)
taul demo

# 1) you have a ground-truth markdown and an extractor's output -> score it
taul eval --ref truth.md --pred marker_output.md
```

```
CER: 0.012 (character error — lower is better)
reading-order score: 0.640 (1 = correct order, lower = scrambled)
coverage: 1.000 (fraction of reference blocks found)
spurious rate: 0.000 (hallucinated/extra blocks)
pip install taul
```

The two numbers are **independent**: a model can score ~0 CER (perfect characters)
yet a low reading-order score (it read the layout wrong). That gap is the whole
point.

## Use it with any extractor

taul doesn't care how `pred.md` was produced. For example:
## Use

```bash
# Marker
marker_single mydoc.pdf --output_dir out/ && \
taul eval --ref truth.md --pred out/mydoc.md

# Docling
docling mydoc.pdf --to md --output out/ && \
taul eval --ref truth.md --pred out/mydoc.md

# MinerU, olmOCR, an LLM, your own pipeline ... same pattern
taul score --pred my_ocr_output.json --gold ground_truth.json
```

Compare several extractors on the same labeled doc and see which preserves reading
order best — not just which has the lowest character error.

## Optional: generate predictions locally (Apple Silicon)
Output:

If you'd rather produce outputs on-device instead of running a separate tool:

```bash
pip install "taul[mlx]"
taul extract mydoc.pdf -o pred.md # local VLM via MLX
taul structured receipt.jpg --example-invoice -o out.json
taul list-models # local model registry + RAM
```
Document: contract_47.pdf
Character accuracy: 0.984
Reading-order accuracy: 0.612 ← this is why your RAG is broken
Worst spans:
[page 3] col-2 block 4 -> read before col-1 block 7 (Kendall τ = -0.4)
[page 5] footnote orphaned (1.2 KB before parent reference)
Recommended layout strategy: 2-column with explicit footnote linking
```

This is a convenience, not the headline — small specialist models (DeepSeek-OCR-2,
PaddleOCR-VL, Qwen3-VL) that fit on a laptop.

## Commands

| command | what it does | needs a model? |
|---|---|---|
| `eval` | **score an extraction**: character error + reading-order, separately | no |
| `demo` | self-check; shows what a reading-order failure looks like | no |
| `bench` | compare local models on the same doc (tok/s + peak RAM) | yes |
| `list-models` | the optional local-extraction registry | no |
| `extract` | *(optional)* PDF/image → Markdown via on-device VLM | yes |
| `structured` | *(optional)* PDF/image → schema-validated JSON | yes |

## How the scoring works

`taul eval` segments reference and prediction into blocks, matches them by text
similarity, then reports:

- **CER** — normalized edit distance (sensitive to wrong characters).
- **reading-order score** — based on how well the predicted block order agrees with
the reference order; *invariant to character errors*, so it isolates ordering.
- **coverage / spurious** — how many reference blocks were found / how many extra
predicted blocks appeared.
## Use in Python

A research-grade evaluator lives in `taul/research/` (`PORE`): it models valid
reading orders as a **partial order** (so independent regions like sidebars aren't
unfairly penalized) and decomposes error into transcription vs. ordering with tested
invariances. See [`PAPER.md`](PAPER.md).
```python
from taul import score, ReadingOrderError

```bash
pip install -e ".[full,dev]"
python tests/test_pore.py # property tests
python -m taul.research.run_study --out pore_study --per-layout 5
result = score(pred="ocr.json", gold="gold.json")
if result.reading_order < 0.85:
raise ReadingOrderError(result.worst_spans)
```

## Honest prior art
---

Reading-order *detection* is well covered — Docling, MinerU, and Éclair produce
reading order; HURIDOCS ships a dedicated model; ParseBench and OmniDocBench score
it inside their benchmarks. taul's contribution is **packaging**: a small,
extractor-agnostic, local CLI that gives you the order-vs-transcription split on
your own outputs in one command. It's convenience and transparency, not new model
tech.
## Why this is a separate tool

## Limitations
Character accuracy and reading-order accuracy are *orthogonal failure modes*. Combining them into one metric hides which one is broken. taul scores reading order alone, surfacing layout-pipeline issues that standard OCR evals silently mask.

- **Reference-based.** You need ground-truth text to score against (it's a
benchmark/regression tool, not a no-reference quality detector — yet).
- Optional `extract` needs the `[mlx]` extra and Apple Silicon; it won't run on
Intel/Windows/Linux. The scorer (`eval`, `demo`) runs anywhere.
- Block matching can degrade if a prediction is *extremely* corrupted; taul
reports `coverage` so out-of-regime scores are visible.
Pairs natively with [parakh](https://github.com/sarcascoder/parakh) for full extraction-quality evaluation, and with [parakh Cloud](https://parakh.cloud) for hosted dashboards and history.

## License

MIT — see [LICENSE](LICENSE). Free to use, modify, and redistribute.

## Contributing
MIT. Built because I needed it and you probably do too.

Issues and PRs welcome — especially **real-document reading-order test cases**
(a reference + an extractor's output), adapters for popular extractors, and model
registry entries. See [CONTRIBUTING.md](CONTRIBUTING.md).
📧 **tanupam760@gmail.com**
Loading