feat(bench): add Presidio and compromise comparison runs#195
Conversation
Scores two external tools on the bench corpus through the same scorer and reference annotations: - Microsoft Presidio (presidio-analyzer with its documented spaCy defaults) via comparison/presidio/run.py, which converts offsets to UTF-16 code units and writes the bench interchange format; Czech documents are skipped because Presidio has no Czech support - compromise (closest JS baseline with span output) via src/run-compromise.ts, English documents only run-quality now reports documents a tool cannot process as skipped instead of failing, and the renderer surfaces the skip note. The bench README documents the comparison methodology and its fairness caveats (reference-annotation bias, Presidio's ORG default, the DATE_TIME label asymmetry); the root README links the results.
There was a problem hiding this comment.
Code Review
This pull request introduces a benchmarking suite (packages/bench) to compare the project's anonymization pipeline against Microsoft Presidio and the compromise NLP library, including scripts to run these tools and updated quality reporting to handle skipped documents. The review feedback is highly constructive, suggesting to explicitly verify directories in the Presidio Python script to avoid processing non-directory files, and recommending a simpler, more standard array indexing approach instead of .at(0) when rendering skipped document languages.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| for language_dir in sorted(FIXTURES_DIR.iterdir()): | ||
| language = language_dir.name | ||
| if language not in LANGUAGE_MODELS: | ||
| print(f"skipping {language}: no Presidio language support") | ||
| continue |
There was a problem hiding this comment.
To prevent noisy console output or potential errors when encountering non-directory files (such as .DS_Store or README.md) in the fixtures directory, it is safer to explicitly verify that each item is a directory before processing.
| for language_dir in sorted(FIXTURES_DIR.iterdir()): | |
| language = language_dir.name | |
| if language not in LANGUAGE_MODELS: | |
| print(f"skipping {language}: no Presidio language support") | |
| continue | |
| for language_dir in sorted(FIXTURES_DIR.iterdir()): | |
| if not language_dir.is_dir(): | |
| continue | |
| language = language_dir.name | |
| if language not in LANGUAGE_MODELS: | |
| print(f"skipping {language}: no Presidio language support") | |
| continue |
| const languages = [ | ||
| ...new Set(skipped.map((id) => id.split("/").at(0) ?? id)), | ||
| ].toSorted(); |
There was a problem hiding this comment.
Using id.split("/")[0] is simpler, more standard, and avoids the need for ES2022 .at(0) and the redundant ?? id fallback (since split always returns an array with at least one element).
| const languages = [ | |
| ...new Set(skipped.map((id) => id.split("/").at(0) ?? id)), | |
| ].toSorted(); | |
| const languages = [ | |
| ...new Set(skipped.map((id) => id.split("/")[0])), | |
| ].toSorted(); |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cdccfe86b9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const docs = allDocs.filter((doc) => predictionsById.has(doc.id)); | ||
| const skippedDocIds = allDocs | ||
| .filter((doc) => !predictionsById.has(doc.id)) | ||
| .map((doc) => doc.id); |
There was a problem hiding this comment.
Require explicit skips before dropping corpus docs
When an external predictions file omits a document because of an adapter typo, crash, or path mismatch, this now silently removes that document from the scoring denominator and reports it as skipped. That makes recall and per-language coverage look better than they are for any accidental omission; the Presidio/compromise language gaps should be declared explicitly, or gated by an allow-skip option/language filter, so missing supported documents still fail fast.
Useful? React with 👍 / 👎.
| # Models (installed via `python -m spacy download <name>`): | ||
| # en_core_web_lg | ||
| # de_core_news_lg |
There was a problem hiding this comment.
Pin the spaCy model package versions
The Presidio comparison is meant to be reproducible, but these model installs are left as bare names, so python -m spacy download en_core_web_lg / de_core_news_lg will install the best compatible model available at rerun time rather than the exact model used for the committed numbers. If spaCy publishes a new compatible model, the same pinned requirements.txt can produce different entities and benchmark results; pin the model wheel versions or direct download names alongside the Python deps.
Useful? React with 👍 / 👎.
Summary
Stacked on #194. Adds the first external-tool comparison runs to the bench workspace: Microsoft Presidio and compromise, executed over the same contract corpus and scored span-by-span with the same scorer and reference annotations as the anonymize pipeline.
What's included
comparison/presidio/run.py+ pinnedrequirements.txt: runspresidio-analyzerwith its documented spaCy defaults (en_core_web_lg,de_core_news_lg), converts offsets to UTF-16 code units to match the reference annotations, and writes the bench interchange format. Czech fixtures are skipped (Presidio has no Czech support) and reported as such rather than scored as zero.src/run-compromise.ts: the closest JS-ecosystem baseline that reports spans; English documents, person + organization labels.run-quality.tsnow treats documents missing from a predictions file as skipped (recorded in the report and rendered), instead of refusing to score; an empty intersection still fails.Headline observations (committed in results/RESULTS.md)
Verification
bun run lint,bun run typecheck(6/6),bun run format:check, andbun testinpackages/benchpass. Comparison outputs contain only offsets and labels (no extracted text); the corpus fixtures are the intentionally public ones already in the repository.