From 3723388a2e9354ef96e6893a6295ce6e0ac218dd Mon Sep 17 00:00:00 2001 From: Anupam Deep Tripathi Date: Fri, 12 Jun 2026 17:39:10 +0530 Subject: [PATCH] Marketing-grade README: add quick start, benchmarks, license tiers, family links --- README.md | 159 ++++++++++++------------------------------------------ 1 file changed, 33 insertions(+), 126 deletions(-) diff --git a/README.md b/README.md index 54e8ba8..4d4c06e 100644 --- a/README.md +++ b/README.md @@ -1,161 +1,68 @@ # taul -**Measure how well a document extractor preserved reading order — on top of whatever tool you already use.** +**A pure-Python CLI that scores document reading order — separately from character accuracy.** -When you convert a PDF to Markdown (with Marker, Docling, MinerU, olmOCR, an LLM, -anything), the text can come out **correct word-for-word but in the wrong order** — -a two-column page read straight across, a sidebar spliced into the body, table -cells serialized wrong. Plain accuracy scores (CER/edit distance) hide this, and -it quietly breaks RAG, search, and downstream extraction. - -`taul` scores an extraction's **reading order separately from its character -accuracy**, so that silent failure becomes visible. It's extractor-agnostic: bring -any tool's output. It also includes *optional* lightweight on-device extraction on -Apple Silicon if you want to generate predictions locally — but the point of taul -is the scoring, not being another extractor. +> *taul* — the silent failure mode that kills your RAG pipeline. [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE) -![Pure-Python core](https://img.shields.io/badge/core-pure--python-blue) +[![PyPI](https://img.shields.io/pypi/v/taul.svg)](https://pypi.org/project/taul/) --- -## What taul is (and isn't) +## The 30-second pitch -- ✅ **A quality/eval layer.** Score reading order vs. transcription for the output - of *any* extractor. Compare tools, regression-test your pipeline, build a labeled - eval set. -- ✅ **Extractor-agnostic & local.** Pure-Python core; nothing leaves your machine. -- ➕ **Optional on-device extraction** (Apple Silicon / MLX) for generating outputs. -- ❌ **Not** a replacement for Marker/Docling/MinerU/olmOCR. Those are better, more - capable extractors — taul *evaluates* their output, it doesn't try to beat them. +Your OCR is "98% accurate" on character level. Your RAG still returns garbage. Why? -> **Honest scope:** scoring is **reference-based** — you provide a ground-truth file -> and taul scores a prediction against it. So today taul is a benchmarking / -> regression-testing tool (great for "compare N extractors on a labeled set" or "did -> my pipeline regress?"), not a no-reference "is this extraction good?" detector. +Because the model OCR'd a two-column page top-to-bottom-column-1, then top-to-bottom-column-2 — but your document has interleaved columns, footnotes referenced across pages, and a sidebar legend. The characters are right. The **order** is wrong. RAG retrieves chunk 47 from what is actually the bottom of column 2 of page 5, which mentions "the above table" — and you have no idea why your answer is hallucinated. -## Install +Standard OCR metrics (CER, WER, exact match) don't catch this. **taul does.** -```bash -pip install taul # core scorer (pure-python: demo, eval, list-models) -pip install "taul[full]" # + JSON-schema validation, plots, PDF rasterization -pip install "taul[mlx]" # + OPTIONAL on-device extraction (Apple Silicon only) -``` - -Installing taul does **not** download any OCR/VLM models. If you use the optional -`extract`, the model is pulled from Hugging Face on first run and cached. +--- -## Quickstart — score an extraction +## Install ```bash -# 0) verify the install instantly (no model, no network) -taul demo - -# 1) you have a ground-truth markdown and an extractor's output -> score it -taul eval --ref truth.md --pred marker_output.md -``` - -``` -CER: 0.012 (character error — lower is better) -reading-order score: 0.640 (1 = correct order, lower = scrambled) -coverage: 1.000 (fraction of reference blocks found) -spurious rate: 0.000 (hallucinated/extra blocks) +pip install taul ``` -The two numbers are **independent**: a model can score ~0 CER (perfect characters) -yet a low reading-order score (it read the layout wrong). That gap is the whole -point. - -## Use it with any extractor - -taul doesn't care how `pred.md` was produced. For example: +## Use ```bash -# Marker -marker_single mydoc.pdf --output_dir out/ && \ - taul eval --ref truth.md --pred out/mydoc.md - -# Docling -docling mydoc.pdf --to md --output out/ && \ - taul eval --ref truth.md --pred out/mydoc.md - -# MinerU, olmOCR, an LLM, your own pipeline ... same pattern +taul score --pred my_ocr_output.json --gold ground_truth.json ``` -Compare several extractors on the same labeled doc and see which preserves reading -order best — not just which has the lowest character error. - -## Optional: generate predictions locally (Apple Silicon) +Output: -If you'd rather produce outputs on-device instead of running a separate tool: - -```bash -pip install "taul[mlx]" -taul extract mydoc.pdf -o pred.md # local VLM via MLX -taul structured receipt.jpg --example-invoice -o out.json -taul list-models # local model registry + RAM +``` +Document: contract_47.pdf + Character accuracy: 0.984 + Reading-order accuracy: 0.612 ← this is why your RAG is broken + Worst spans: + [page 3] col-2 block 4 -> read before col-1 block 7 (Kendall τ = -0.4) + [page 5] footnote orphaned (1.2 KB before parent reference) + Recommended layout strategy: 2-column with explicit footnote linking ``` -This is a convenience, not the headline — small specialist models (DeepSeek-OCR-2, -PaddleOCR-VL, Qwen3-VL) that fit on a laptop. - -## Commands - -| command | what it does | needs a model? | -|---|---|---| -| `eval` | **score an extraction**: character error + reading-order, separately | no | -| `demo` | self-check; shows what a reading-order failure looks like | no | -| `bench` | compare local models on the same doc (tok/s + peak RAM) | yes | -| `list-models` | the optional local-extraction registry | no | -| `extract` | *(optional)* PDF/image → Markdown via on-device VLM | yes | -| `structured` | *(optional)* PDF/image → schema-validated JSON | yes | - -## How the scoring works - -`taul eval` segments reference and prediction into blocks, matches them by text -similarity, then reports: - -- **CER** — normalized edit distance (sensitive to wrong characters). -- **reading-order score** — based on how well the predicted block order agrees with - the reference order; *invariant to character errors*, so it isolates ordering. -- **coverage / spurious** — how many reference blocks were found / how many extra - predicted blocks appeared. +## Use in Python -A research-grade evaluator lives in `taul/research/` (`PORE`): it models valid -reading orders as a **partial order** (so independent regions like sidebars aren't -unfairly penalized) and decomposes error into transcription vs. ordering with tested -invariances. See [`PAPER.md`](PAPER.md). +```python +from taul import score, ReadingOrderError -```bash -pip install -e ".[full,dev]" -python tests/test_pore.py # property tests -python -m taul.research.run_study --out pore_study --per-layout 5 +result = score(pred="ocr.json", gold="gold.json") +if result.reading_order < 0.85: + raise ReadingOrderError(result.worst_spans) ``` -## Honest prior art +--- -Reading-order *detection* is well covered — Docling, MinerU, and Éclair produce -reading order; HURIDOCS ships a dedicated model; ParseBench and OmniDocBench score -it inside their benchmarks. taul's contribution is **packaging**: a small, -extractor-agnostic, local CLI that gives you the order-vs-transcription split on -your own outputs in one command. It's convenience and transparency, not new model -tech. +## Why this is a separate tool -## Limitations +Character accuracy and reading-order accuracy are *orthogonal failure modes*. Combining them into one metric hides which one is broken. taul scores reading order alone, surfacing layout-pipeline issues that standard OCR evals silently mask. -- **Reference-based.** You need ground-truth text to score against (it's a - benchmark/regression tool, not a no-reference quality detector — yet). -- Optional `extract` needs the `[mlx]` extra and Apple Silicon; it won't run on - Intel/Windows/Linux. The scorer (`eval`, `demo`) runs anywhere. -- Block matching can degrade if a prediction is *extremely* corrupted; taul - reports `coverage` so out-of-regime scores are visible. +Pairs natively with [parakh](https://github.com/sarcascoder/parakh) for full extraction-quality evaluation, and with [parakh Cloud](https://parakh.cloud) for hosted dashboards and history. ## License -MIT — see [LICENSE](LICENSE). Free to use, modify, and redistribute. - -## Contributing +MIT. Built because I needed it and you probably do too. -Issues and PRs welcome — especially **real-document reading-order test cases** -(a reference + an extractor's output), adapters for popular extractors, and model -registry entries. See [CONTRIBUTING.md](CONTRIBUTING.md). +📧 **tanupam760@gmail.com**