Skip to content

feat(bench): add quality and throughput benchmark workspace#194

Open
jan-kubica wants to merge 2 commits into
mainfrom
feat/bench
Open

feat(bench): add quality and throughput benchmark workspace#194
jan-kubica wants to merge 2 commits into
mainfrom
feat/bench

Conversation

@jan-kubica

Copy link
Copy Markdown
Contributor

Summary

Adds packages/bench, a private workspace that benchmarks the deterministic pipeline (NER off) over the contract fixture corpus, as groundwork for publishing reproducible performance and quality numbers and for comparing other anonymization tools on the same corpus.

Stacked on #193: the bench imports the built dist like a production consumer, which is how the non-Western corpus bundling regression fixed there was found; the first commit here is that fix.

What's included

  • Scorer (src/scorer.ts): span-level, per-label precision/recall/F1 with one-to-one matching in two modes (exact bounds; label + largest overlap), plus unit tests.
  • Quality runner (src/run-quality.ts): scores predictions against the reviewed .snapshot.json reference annotations, per label, per language, micro-averaged. --predictions file.json scores an external tool's output through the same scorer; the interchange format and label-mapping expectations are documented in the README.
  • Throughput runner (src/run-throughput.ts): one-time costs (dictionary load, search preparation) separated from steady-state latency; warmup + measured passes with per-document medians and corpus chars/s.
  • Renderer producing results/RESULTS.md from the JSON reports; results from a developer machine are committed alongside the methodology README.

Methodology note

The reference annotations derive from reviewed pipeline output, so the pipeline's own score against them is ~100% by construction; the README states this explicitly and frames the number as a regression/drift signal. The meaningful outputs are throughput and, next, cross-tool comparisons via the predictions interchange format.

Verification

bun run lint, bun run typecheck (6/6 tasks), bun run format:check, and bun test in packages/bench pass. check:version and packlist tooling are unaffected (explicit package lists; bench is private).

The per-locale name corpus files were loaded with a template-literal
dynamic import, which the bundler cannot resolve statically. The
import survived into dist as a runtime-relative path that does not
exist in the published package, so name detection was silently
disabled for consumers of the built output (the regression suite
imports from src and never hit the path).

Replace the template literal with a map of literal import
specifiers keyed by locale so each corpus file becomes a build
chunk, and pin one chunk in check-packlist so the regression cannot
ship again.
New private packages/bench workspace measuring the deterministic
pipeline (NER off) over the contract fixture corpus:

- span-level scorer (per-label precision/recall/F1, exact and
  overlap matching, one-to-one within label) with unit tests
- quality runner scoring the pipeline against the reviewed
  .snapshot.json reference annotations; accepts external tool
  predictions via a documented JSON interchange format so other
  anonymizers can be scored by the same scorer on the same corpus
- throughput runner (warmup + measured passes, per-document medians,
  corpus chars/s, one-time dictionary and prepare costs)
- methodology README covering what the reference annotations can and
  cannot support, plus rendered results

The bench imports the built dist like a production consumer, which
is how it caught the non-Western corpus bundling regression fixed in
the previous commit.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new benchmarking package, @stll/anonymize-bench, to measure the quality and throughput of the anonymization pipeline. It also refactors non-Western name imports in the main package to use literal import specifiers, ensuring correct bundler resolution. The reviewer identified three key areas for improvement: filtering readdirSync in fixtures.ts to avoid crashing on non-directory entries, deduplicating city names in dictionaries.ts using a Set for better memory efficiency, and adding a guard in run-throughput.ts to prevent a potential division-by-zero error when calculating characters per second.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +29 to +30
for (const language of readdirSync(CONTRACTS_DIR).toSorted()) {
const languageDir = join(CONTRACTS_DIR, language);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using readdirSync directly on CONTRACTS_DIR will return all directory entries, including hidden files (such as .DS_Store) or any non-directory files. Attempting to call readdirSync on these files in the next line will throw an ENOTDIR error and crash the benchmark. Filtering for directories only using withFileTypes: true prevents this issue.

  const languages = readdirSync(CONTRACTS_DIR, { withFileTypes: true })
    .filter((dirent) => dirent.isDirectory())
    .map((dirent) => dirent.name)
    .toSorted();
  for (const language of languages) {
    const languageDir = join(CONTRACTS_DIR, language);

Comment on lines +119 to +125
const mergedCities: string[] = [];
for (const { country, entries } of cityResults) {
citiesByCountry[country] = entries;
for (const entry of entries) {
mergedCities.push(entry);
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The mergedCities array is populated by pushing every city from every country, which can result in duplicate city names (e.g., "London" in both GB and US/CA). Deduplicating the list using a Set avoids redundant entries in the search automaton and improves memory efficiency.

Suggested change
const mergedCities: string[] = [];
for (const { country, entries } of cityResults) {
citiesByCountry[country] = entries;
for (const entry of entries) {
mergedCities.push(entry);
}
}
const mergedCitiesSet = new Set<string>();
for (const { country, entries } of cityResults) {
citiesByCountry[country] = entries;
for (const entry of entries) {
mergedCitiesSet.add(entry);
}
}
const mergedCities = Array.from(mergedCitiesSet);

medianMs: roundMs(medianMs),
minMs: roundMs(Math.min(...samples)),
maxMs: roundMs(Math.max(...samples)),
charsPerSecond: Math.round(doc.text.length / (medianMs / 1_000)),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If a document is processed extremely quickly or due to timer resolution limits in virtualized environments, medianMs can be 0. This would result in a division-by-zero error, making charsPerSecond evaluate to Infinity. Adding a guard prevents this.

    charsPerSecond: medianMs === 0 ? 0 : Math.round(doc.text.length / (medianMs / 1_000)),

@jan-kubica jan-kubica marked this pull request as ready for review June 12, 2026 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant