feat(bench): add Presidio and compromise comparison runs by jan-kubica · Pull Request #195 · stella/anonymize

jan-kubica · 2026-06-12T11:46:58Z

Summary

Stacked on #194. Adds the first external-tool comparison runs to the bench workspace: Microsoft Presidio and compromise, executed over the same contract corpus and scored span-by-span with the same scorer and reference annotations as the anonymize pipeline.

What's included

comparison/presidio/run.py + pinned requirements.txt: runs presidio-analyzer with its documented spaCy defaults (en_core_web_lg, de_core_news_lg), converts offsets to UTF-16 code units to match the reference annotations, and writes the bench interchange format. Czech fixtures are skipped (Presidio has no Czech support) and reported as such rather than scored as zero.
src/run-compromise.ts: the closest JS-ecosystem baseline that reports spans; English documents, person + organization labels.
run-quality.ts now treats documents missing from a predictions file as skipped (recorded in the report and rendered), instead of refusing to score; an empty intersection still fails.
Methodology notes in the bench README covering the fairness caveats: reference-annotation bias, Presidio's deliberate ORG-by-default exclusion (enabled here because organizations are unavoidable in contracts), and the DATE_TIME/date label asymmetry.
Root README section linking the benchmark results.

Headline observations (committed in results/RESULTS.md)

Presidio cannot process 8 of 13 corpus documents (no Czech support); on en/de it reaches 0.81 precision / 0.92 recall (overlap) for persons, but spaCy ORG produces 183 organization false positives and DATE_TIME over-fires on durations.
compromise (en only, person + organization) lands at 0.65 overlap F1.
The anonymize self-score remains a by-construction regression signal, stated as such wherever the numbers are rendered.

Verification

bun run lint, bun run typecheck (6/6), bun run format:check, and bun test in packages/bench pass. Comparison outputs contain only offsets and labels (no extracted text); the corpus fixtures are the intentionally public ones already in the repository.

Scores two external tools on the bench corpus through the same scorer and reference annotations: - Microsoft Presidio (presidio-analyzer with its documented spaCy defaults) via comparison/presidio/run.py, which converts offsets to UTF-16 code units and writes the bench interchange format; Czech documents are skipped because Presidio has no Czech support - compromise (closest JS baseline with span output) via src/run-compromise.ts, English documents only run-quality now reports documents a tool cannot process as skipped instead of failing, and the renderer surfaces the skip note. The bench README documents the comparison methodology and its fairness caveats (reference-annotation bias, Presidio's ORG default, the DATE_TIME label asymmetry); the root README links the results.

gemini-code-assist

Code Review

This pull request introduces a benchmarking suite (packages/bench) to compare the project's anonymization pipeline against Microsoft Presidio and the compromise NLP library, including scripts to run these tools and updated quality reporting to handle skipped documents. The review feedback is highly constructive, suggesting to explicitly verify directories in the Presidio Python script to avoid processing non-directory files, and recommending a simpler, more standard array indexing approach instead of .at(0) when rendering skipped document languages.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-12T11:48:21Z

+    for language_dir in sorted(FIXTURES_DIR.iterdir()):
+        language = language_dir.name
+        if language not in LANGUAGE_MODELS:
+            print(f"skipping {language}: no Presidio language support")
+            continue


To prevent noisy console output or potential errors when encountering non-directory files (such as .DS_Store or README.md) in the fixtures directory, it is safer to explicitly verify that each item is a directory before processing.

Suggested change

for language_dir in sorted(FIXTURES_DIR.iterdir()):

language = language_dir.name

if language not in LANGUAGE_MODELS:

print(f"skipping {language}: no Presidio language support")

continue

for language_dir in sorted(FIXTURES_DIR.iterdir()):

if not language_dir.is_dir():

continue

language = language_dir.name

if language not in LANGUAGE_MODELS:

print(f"skipping {language}: no Presidio language support")

continue

gemini-code-assist · 2026-06-12T11:48:22Z

+    const languages = [
+      ...new Set(skipped.map((id) => id.split("/").at(0) ?? id)),
+    ].toSorted();


Using id.split("/")[0] is simpler, more standard, and avoids the need for ES2022 .at(0) and the redundant ?? id fallback (since split always returns an array with at least one element).

Suggested change

const languages = [

...new Set(skipped.map((id) => id.split("/").at(0) ?? id)),

].toSorted();

const languages = [

...new Set(skipped.map((id) => id.split("/")[0])),

].toSorted();

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cdccfe86b9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-12T12:11:39Z

+const docs = allDocs.filter((doc) => predictionsById.has(doc.id));
+const skippedDocIds = allDocs
+  .filter((doc) => !predictionsById.has(doc.id))
+  .map((doc) => doc.id);


Require explicit skips before dropping corpus docs

When an external predictions file omits a document because of an adapter typo, crash, or path mismatch, this now silently removes that document from the scoring denominator and reports it as skipped. That makes recall and per-language coverage look better than they are for any accidental omission; the Presidio/compromise language gaps should be declared explicitly, or gated by an allow-skip option/language filter, so missing supported documents still fail fast.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-12T12:11:39Z

+# Models (installed via `python -m spacy download <name>`):
+#   en_core_web_lg
+#   de_core_news_lg


Pin the spaCy model package versions

The Presidio comparison is meant to be reproducible, but these model installs are left as bare names, so python -m spacy download en_core_web_lg / de_core_news_lg will install the best compatible model available at rerun time rather than the exact model used for the committed numbers. If spaCy publishes a new compatible model, the same pinned requirements.txt can produce different entities and benchmark results; pin the model wheel versions or direct download names alongside the Python deps.

Useful? React with 👍 / 👎.

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

jan-kubica marked this pull request as ready for review June 12, 2026 12:04

chatgpt-codex-connector Bot reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(bench): add Presidio and compromise comparison runs#195

feat(bench): add Presidio and compromise comparison runs#195
jan-kubica wants to merge 1 commit into
feat/benchfrom
feat/bench-comparison

jan-kubica commented Jun 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jan-kubica commented Jun 12, 2026

Summary

What's included

Headline observations (committed in results/RESULTS.md)

Verification

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant