Skip to content

feat(bench): add Presidio and compromise comparison runs#195

Open
jan-kubica wants to merge 1 commit into
feat/benchfrom
feat/bench-comparison
Open

feat(bench): add Presidio and compromise comparison runs#195
jan-kubica wants to merge 1 commit into
feat/benchfrom
feat/bench-comparison

Conversation

@jan-kubica

Copy link
Copy Markdown
Contributor

Summary

Stacked on #194. Adds the first external-tool comparison runs to the bench workspace: Microsoft Presidio and compromise, executed over the same contract corpus and scored span-by-span with the same scorer and reference annotations as the anonymize pipeline.

What's included

  • comparison/presidio/run.py + pinned requirements.txt: runs presidio-analyzer with its documented spaCy defaults (en_core_web_lg, de_core_news_lg), converts offsets to UTF-16 code units to match the reference annotations, and writes the bench interchange format. Czech fixtures are skipped (Presidio has no Czech support) and reported as such rather than scored as zero.
  • src/run-compromise.ts: the closest JS-ecosystem baseline that reports spans; English documents, person + organization labels.
  • run-quality.ts now treats documents missing from a predictions file as skipped (recorded in the report and rendered), instead of refusing to score; an empty intersection still fails.
  • Methodology notes in the bench README covering the fairness caveats: reference-annotation bias, Presidio's deliberate ORG-by-default exclusion (enabled here because organizations are unavoidable in contracts), and the DATE_TIME/date label asymmetry.
  • Root README section linking the benchmark results.

Headline observations (committed in results/RESULTS.md)

  • Presidio cannot process 8 of 13 corpus documents (no Czech support); on en/de it reaches 0.81 precision / 0.92 recall (overlap) for persons, but spaCy ORG produces 183 organization false positives and DATE_TIME over-fires on durations.
  • compromise (en only, person + organization) lands at 0.65 overlap F1.
  • The anonymize self-score remains a by-construction regression signal, stated as such wherever the numbers are rendered.

Verification

bun run lint, bun run typecheck (6/6), bun run format:check, and bun test in packages/bench pass. Comparison outputs contain only offsets and labels (no extracted text); the corpus fixtures are the intentionally public ones already in the repository.

Scores two external tools on the bench corpus through the same
scorer and reference annotations:

- Microsoft Presidio (presidio-analyzer with its documented spaCy
  defaults) via comparison/presidio/run.py, which converts offsets
  to UTF-16 code units and writes the bench interchange format;
  Czech documents are skipped because Presidio has no Czech support
- compromise (closest JS baseline with span output) via
  src/run-compromise.ts, English documents only

run-quality now reports documents a tool cannot process as skipped
instead of failing, and the renderer surfaces the skip note. The
bench README documents the comparison methodology and its fairness
caveats (reference-annotation bias, Presidio's ORG default, the
DATE_TIME label asymmetry); the root README links the results.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a benchmarking suite (packages/bench) to compare the project's anonymization pipeline against Microsoft Presidio and the compromise NLP library, including scripts to run these tools and updated quality reporting to handle skipped documents. The review feedback is highly constructive, suggesting to explicitly verify directories in the Presidio Python script to avoid processing non-directory files, and recommending a simpler, more standard array indexing approach instead of .at(0) when rendering skipped document languages.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +81 to +85
for language_dir in sorted(FIXTURES_DIR.iterdir()):
language = language_dir.name
if language not in LANGUAGE_MODELS:
print(f"skipping {language}: no Presidio language support")
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent noisy console output or potential errors when encountering non-directory files (such as .DS_Store or README.md) in the fixtures directory, it is safer to explicitly verify that each item is a directory before processing.

Suggested change
for language_dir in sorted(FIXTURES_DIR.iterdir()):
language = language_dir.name
if language not in LANGUAGE_MODELS:
print(f"skipping {language}: no Presidio language support")
continue
for language_dir in sorted(FIXTURES_DIR.iterdir()):
if not language_dir.is_dir():
continue
language = language_dir.name
if language not in LANGUAGE_MODELS:
print(f"skipping {language}: no Presidio language support")
continue

Comment on lines +80 to +82
const languages = [
...new Set(skipped.map((id) => id.split("/").at(0) ?? id)),
].toSorted();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using id.split("/")[0] is simpler, more standard, and avoids the need for ES2022 .at(0) and the redundant ?? id fallback (since split always returns an array with at least one element).

Suggested change
const languages = [
...new Set(skipped.map((id) => id.split("/").at(0) ?? id)),
].toSorted();
const languages = [
...new Set(skipped.map((id) => id.split("/")[0])),
].toSorted();

@jan-kubica jan-kubica marked this pull request as ready for review June 12, 2026 12:04

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cdccfe86b9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +68 to +71
const docs = allDocs.filter((doc) => predictionsById.has(doc.id));
const skippedDocIds = allDocs
.filter((doc) => !predictionsById.has(doc.id))
.map((doc) => doc.id);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require explicit skips before dropping corpus docs

When an external predictions file omits a document because of an adapter typo, crash, or path mismatch, this now silently removes that document from the scoring denominator and reports it as skipped. That makes recall and per-language coverage look better than they are for any accidental omission; the Presidio/compromise language gaps should be declared explicitly, or gated by an allow-skip option/language filter, so missing supported documents still fail fast.

Useful? React with 👍 / 👎.

Comment on lines +3 to +5
# Models (installed via `python -m spacy download <name>`):
# en_core_web_lg
# de_core_news_lg

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pin the spaCy model package versions

The Presidio comparison is meant to be reproducible, but these model installs are left as bare names, so python -m spacy download en_core_web_lg / de_core_news_lg will install the best compatible model available at rerun time rather than the exact model used for the committed numbers. If spaCy publishes a new compatible model, the same pinned requirements.txt can produce different entities and benchmark results; pin the model wheel versions or direct download names alongside the Python deps.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant