Measure how well a document extractor preserved reading order — on top of whatever tool you already use.
When you convert a PDF to Markdown (with Marker, Docling, MinerU, olmOCR, an LLM, anything), the text can come out correct word-for-word but in the wrong order — a two-column page read straight across, a sidebar spliced into the body, table cells serialized wrong. Plain accuracy scores (CER/edit distance) hide this, and it quietly breaks RAG, search, and downstream extraction.
taul scores an extraction's reading order separately from its character
accuracy, so that silent failure becomes visible. It's extractor-agnostic: bring
any tool's output. It also includes optional lightweight on-device extraction on
Apple Silicon if you want to generate predictions locally — but the point of taul
is the scoring, not being another extractor.
- ✅ A quality/eval layer. Score reading order vs. transcription for the output of any extractor. Compare tools, regression-test your pipeline, build a labeled eval set.
- ✅ Extractor-agnostic & local. Pure-Python core; nothing leaves your machine.
- ➕ Optional on-device extraction (Apple Silicon / MLX) for generating outputs.
- ❌ Not a replacement for Marker/Docling/MinerU/olmOCR. Those are better, more capable extractors — taul evaluates their output, it doesn't try to beat them.
Honest scope: scoring is reference-based — you provide a ground-truth file and taul scores a prediction against it. So today taul is a benchmarking / regression-testing tool (great for "compare N extractors on a labeled set" or "did my pipeline regress?"), not a no-reference "is this extraction good?" detector.
pip install taul # core scorer (pure-python: demo, eval, list-models)
pip install "taul[full]" # + JSON-schema validation, plots, PDF rasterization
pip install "taul[mlx]" # + OPTIONAL on-device extraction (Apple Silicon only)Installing taul does not download any OCR/VLM models. If you use the optional
extract, the model is pulled from Hugging Face on first run and cached.
# 0) verify the install instantly (no model, no network)
taul demo
# 1) you have a ground-truth markdown and an extractor's output -> score it
taul eval --ref truth.md --pred marker_output.mdCER: 0.012 (character error — lower is better)
reading-order score: 0.640 (1 = correct order, lower = scrambled)
coverage: 1.000 (fraction of reference blocks found)
spurious rate: 0.000 (hallucinated/extra blocks)
The two numbers are independent: a model can score ~0 CER (perfect characters) yet a low reading-order score (it read the layout wrong). That gap is the whole point.
taul doesn't care how pred.md was produced. For example:
# Marker
marker_single mydoc.pdf --output_dir out/ && \
taul eval --ref truth.md --pred out/mydoc.md
# Docling
docling mydoc.pdf --to md --output out/ && \
taul eval --ref truth.md --pred out/mydoc.md
# MinerU, olmOCR, an LLM, your own pipeline ... same patternCompare several extractors on the same labeled doc and see which preserves reading order best — not just which has the lowest character error.
If you'd rather produce outputs on-device instead of running a separate tool:
pip install "taul[mlx]"
taul extract mydoc.pdf -o pred.md # local VLM via MLX
taul structured receipt.jpg --example-invoice -o out.json
taul list-models # local model registry + RAMThis is a convenience, not the headline — small specialist models (DeepSeek-OCR-2, PaddleOCR-VL, Qwen3-VL) that fit on a laptop.
| command | what it does | needs a model? |
|---|---|---|
eval |
score an extraction: character error + reading-order, separately | no |
demo |
self-check; shows what a reading-order failure looks like | no |
bench |
compare local models on the same doc (tok/s + peak RAM) | yes |
list-models |
the optional local-extraction registry | no |
extract |
(optional) PDF/image → Markdown via on-device VLM | yes |
structured |
(optional) PDF/image → schema-validated JSON | yes |
taul eval segments reference and prediction into blocks, matches them by text
similarity, then reports:
- CER — normalized edit distance (sensitive to wrong characters).
- reading-order score — based on how well the predicted block order agrees with the reference order; invariant to character errors, so it isolates ordering.
- coverage / spurious — how many reference blocks were found / how many extra predicted blocks appeared.
A research-grade evaluator lives in taul/research/ (PORE): it models valid
reading orders as a partial order (so independent regions like sidebars aren't
unfairly penalized) and decomposes error into transcription vs. ordering with tested
invariances. See PAPER.md.
pip install -e ".[full,dev]"
python tests/test_pore.py # property tests
python -m taul.research.run_study --out pore_study --per-layout 5Reading-order detection is well covered — Docling, MinerU, and Éclair produce reading order; HURIDOCS ships a dedicated model; ParseBench and OmniDocBench score it inside their benchmarks. taul's contribution is packaging: a small, extractor-agnostic, local CLI that gives you the order-vs-transcription split on your own outputs in one command. It's convenience and transparency, not new model tech.
- Reference-based. You need ground-truth text to score against (it's a benchmark/regression tool, not a no-reference quality detector — yet).
- Optional
extractneeds the[mlx]extra and Apple Silicon; it won't run on Intel/Windows/Linux. The scorer (eval,demo) runs anywhere. - Block matching can degrade if a prediction is extremely corrupted; taul
reports
coverageso out-of-regime scores are visible.
MIT — see LICENSE. Free to use, modify, and redistribute.
Issues and PRs welcome — especially real-document reading-order test cases (a reference + an extractor's output), adapters for popular extractors, and model registry entries. See CONTRIBUTING.md.