feat(bench): add quality and throughput benchmark workspace#194
feat(bench): add quality and throughput benchmark workspace#194jan-kubica wants to merge 2 commits into
Conversation
The per-locale name corpus files were loaded with a template-literal dynamic import, which the bundler cannot resolve statically. The import survived into dist as a runtime-relative path that does not exist in the published package, so name detection was silently disabled for consumers of the built output (the regression suite imports from src and never hit the path). Replace the template literal with a map of literal import specifiers keyed by locale so each corpus file becomes a build chunk, and pin one chunk in check-packlist so the regression cannot ship again.
New private packages/bench workspace measuring the deterministic pipeline (NER off) over the contract fixture corpus: - span-level scorer (per-label precision/recall/F1, exact and overlap matching, one-to-one within label) with unit tests - quality runner scoring the pipeline against the reviewed .snapshot.json reference annotations; accepts external tool predictions via a documented JSON interchange format so other anonymizers can be scored by the same scorer on the same corpus - throughput runner (warmup + measured passes, per-document medians, corpus chars/s, one-time dictionary and prepare costs) - methodology README covering what the reference annotations can and cannot support, plus rendered results The bench imports the built dist like a production consumer, which is how it caught the non-Western corpus bundling regression fixed in the previous commit.
There was a problem hiding this comment.
Code Review
This pull request introduces a new benchmarking package, @stll/anonymize-bench, to measure the quality and throughput of the anonymization pipeline. It also refactors non-Western name imports in the main package to use literal import specifiers, ensuring correct bundler resolution. The reviewer identified three key areas for improvement: filtering readdirSync in fixtures.ts to avoid crashing on non-directory entries, deduplicating city names in dictionaries.ts using a Set for better memory efficiency, and adding a guard in run-throughput.ts to prevent a potential division-by-zero error when calculating characters per second.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| for (const language of readdirSync(CONTRACTS_DIR).toSorted()) { | ||
| const languageDir = join(CONTRACTS_DIR, language); |
There was a problem hiding this comment.
Using readdirSync directly on CONTRACTS_DIR will return all directory entries, including hidden files (such as .DS_Store) or any non-directory files. Attempting to call readdirSync on these files in the next line will throw an ENOTDIR error and crash the benchmark. Filtering for directories only using withFileTypes: true prevents this issue.
const languages = readdirSync(CONTRACTS_DIR, { withFileTypes: true })
.filter((dirent) => dirent.isDirectory())
.map((dirent) => dirent.name)
.toSorted();
for (const language of languages) {
const languageDir = join(CONTRACTS_DIR, language);| const mergedCities: string[] = []; | ||
| for (const { country, entries } of cityResults) { | ||
| citiesByCountry[country] = entries; | ||
| for (const entry of entries) { | ||
| mergedCities.push(entry); | ||
| } | ||
| } |
There was a problem hiding this comment.
The mergedCities array is populated by pushing every city from every country, which can result in duplicate city names (e.g., "London" in both GB and US/CA). Deduplicating the list using a Set avoids redundant entries in the search automaton and improves memory efficiency.
| const mergedCities: string[] = []; | |
| for (const { country, entries } of cityResults) { | |
| citiesByCountry[country] = entries; | |
| for (const entry of entries) { | |
| mergedCities.push(entry); | |
| } | |
| } | |
| const mergedCitiesSet = new Set<string>(); | |
| for (const { country, entries } of cityResults) { | |
| citiesByCountry[country] = entries; | |
| for (const entry of entries) { | |
| mergedCitiesSet.add(entry); | |
| } | |
| } | |
| const mergedCities = Array.from(mergedCitiesSet); |
| medianMs: roundMs(medianMs), | ||
| minMs: roundMs(Math.min(...samples)), | ||
| maxMs: roundMs(Math.max(...samples)), | ||
| charsPerSecond: Math.round(doc.text.length / (medianMs / 1_000)), |
There was a problem hiding this comment.
If a document is processed extremely quickly or due to timer resolution limits in virtualized environments, medianMs can be 0. This would result in a division-by-zero error, making charsPerSecond evaluate to Infinity. Adding a guard prevents this.
charsPerSecond: medianMs === 0 ? 0 : Math.round(doc.text.length / (medianMs / 1_000)),
Summary
Adds
packages/bench, a private workspace that benchmarks the deterministic pipeline (NER off) over the contract fixture corpus, as groundwork for publishing reproducible performance and quality numbers and for comparing other anonymization tools on the same corpus.Stacked on #193: the bench imports the built
distlike a production consumer, which is how the non-Western corpus bundling regression fixed there was found; the first commit here is that fix.What's included
src/scorer.ts): span-level, per-label precision/recall/F1 with one-to-one matching in two modes (exact bounds; label + largest overlap), plus unit tests.src/run-quality.ts): scores predictions against the reviewed.snapshot.jsonreference annotations, per label, per language, micro-averaged.--predictions file.jsonscores an external tool's output through the same scorer; the interchange format and label-mapping expectations are documented in the README.src/run-throughput.ts): one-time costs (dictionary load, search preparation) separated from steady-state latency; warmup + measured passes with per-document medians and corpus chars/s.results/RESULTS.mdfrom the JSON reports; results from a developer machine are committed alongside the methodology README.Methodology note
The reference annotations derive from reviewed pipeline output, so the pipeline's own score against them is ~100% by construction; the README states this explicitly and frames the number as a regression/drift signal. The meaningful outputs are throughput and, next, cross-tool comparisons via the predictions interchange format.
Verification
bun run lint,bun run typecheck(6/6 tasks),bun run format:check, andbun testinpackages/benchpass.check:versionand packlist tooling are unaffected (explicit package lists; bench is private).