Automate benchmark discovery sweep via eval JSON#278
Conversation
…json. Export script refreshes metrics from agent DB; page imports JSON for discovery table and rule-card agent gold. Drift check validates JSON instead of parsing page.tsx.
There was a problem hiding this comment.
Pull request overview
This PR moves the benchmark “discovery sweep” data source from a hard-coded TS array in the benchmark page to a committed eval/benchmark-discovery-sweep.json, and adds scripts to export/refresh that JSON from the local agent corpus DB plus a drift check to keep it in sync.
Changes:
- Add
eval/benchmark-discovery-sweep.jsonand update the benchmark page to import it via a new@eval/*TS path alias. - Add
scripts/export-benchmark-discovery-sweep.py+scripts/benchmark_discovery_lib.pyto refresh JSON metrics from the corpus DB (and update the refresh PS script to run it). - Refactor the drift check to compare JSON vs DB and update docs/workflow path triggers accordingly.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| site/tsconfig.json | Adds @eval/* path alias for importing eval JSON into the site. |
| site/app/benchmark/page.tsx | Imports discovery sweep + rule-card gold metrics from JSON instead of inlining. |
| scripts/refresh-agent-corpus.ps1 | Runs the new export script before the drift check in the local refresh pipeline. |
| scripts/export-benchmark-discovery-sweep.py | New exporter to refresh JSON metrics from the agent corpus DB while preserving notes/names. |
| scripts/corpus-benchmark-discovery-drift.py | Refactors drift check to compare JSON vs DB using shared library helpers. |
| scripts/benchmark_discovery_lib.py | New shared helpers for DB metric extraction + sweep JSON parsing/mapping. |
| eval/benchmark-discovery-sweep.json | New committed JSON containing discovery sweep rows and rule-card agent-gold map. |
| docs/rules.md | Updates refresh instructions to include exporting the sweep JSON before drift checking. |
| .github/workflows/benchmark-discovery-drift.yml | Expands workflow path triggers to include new/updated files. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| card_gold = doc.get("ruleCardAgentGold", {}) | ||
| for rule_id in list(card_gold.keys()): | ||
| if rule_id in gold: | ||
| card_gold[rule_id] = gold[rule_id] | ||
|
|
| actual_trigger = triggers.get(rule_id) | ||
| if actual_trigger and exp["triggerPct"] != actual_trigger: | ||
| if actual_trigger and exp.get("triggerPct") != actual_trigger: | ||
| errors.append( | ||
| f"{rule_id} triggerPct page={exp['triggerPct']} db={actual_trigger}" | ||
| f"{rule_id} triggerPct json={exp.get('triggerPct')} db={actual_trigger}" | ||
| ) |
| if exp.get("goldPrecision"): | ||
| actual_gold = gold.get(rule_id) | ||
| if actual_gold and exp["goldPrecision"] != actual_gold: | ||
| errors.append( | ||
| f"{rule_id} goldPrecision page={exp['goldPrecision']} db={actual_gold}" | ||
| f"{rule_id} goldPrecision json={exp['goldPrecision']} db={actual_gold}" | ||
| ) |
| for rule_id, exp_gold in card_gold.items(): | ||
| actual_gold = gold.get(rule_id) | ||
| if actual_gold and exp_gold != actual_gold: | ||
| errors.append( | ||
| f"{rule_id} ruleCardAgentGold json={exp_gold} db={actual_gold}" | ||
| ) |
| const discoverySweepJune2026 = | ||
| discoverySweepDoc.discoveryRows as DiscoverySweepRow[]; | ||
|
|
||
| const agentGoldByRuleId = discoverySweepDoc.ruleCardAgentGold as Record< | ||
| string, | ||
| string | ||
| >; |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4f949d2586
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| import { Header } from "@/components/header"; | ||
| import { Footer } from "@/components/footer"; | ||
| import { Breadcrumbs } from "@/components/breadcrumbs"; | ||
| import discoverySweepDoc from "@eval/benchmark-discovery-sweep.json"; |
There was a problem hiding this comment.
Add the eval data file to the Pages deploy trigger
Since this page now imports eval/benchmark-discovery-sweep.json into the static Next bundle, a refresh commit that only updates that JSON file will not redeploy the benchmark page. I checked .github/workflows/nextjs.yml and its push path filter only includes site/** and the workflow file, so gauntletci.com/benchmark would keep serving the old bundled metrics until an unrelated site change or manual dispatch occurs. Include the eval JSON in the Pages workflow trigger or keep the data under site/.
Useful? React with 👍 / 👎.
| gold_rows = cur.execute( | ||
| """ | ||
| WITH latest AS ( | ||
| SELECT fixture_id, MAX(id) AS run_id |
There was a problem hiding this comment.
Select latest runs by completion time
When the corpus DB contains multiple runs for the same fixture, MAX(id) picks the lexicographically greatest run id rather than the newest completed run; run ids are GUID/string values, while the rest of the corpus readers use completed_at_utc DESC, id DESC with status = 'COMPLETED'. This can make the new export/drift path compute agent-gold precision from an older run and publish stale percentages after a refresh. Use the same latest-run CTE as corpus_db_read.compute_labeled_rule_metrics.
Useful? React with 👍 / 👎.
| for rule_id in list(card_gold.keys()): | ||
| if rule_id in gold: | ||
| card_gold[rule_id] = gold[rule_id] |
There was a problem hiding this comment.
Drop stale rule-card gold entries on export
When a rule-card entry no longer has a labeled precision in the DB (for example, no latest-run triggers so the rule is absent from gold), this loop leaves the old JSON value in place. The row-level metrics above remove stale goldPrecision, but site/app/benchmark/page.tsx reads ruleCardAgentGold directly for the cards, so a refresh can continue publishing an obsolete agent-gold percentage. Remove missing entries or fail the export when a configured card has no current gold metric.
Useful? React with 👍 / 👎.
Summary
eval/benchmark-discovery-sweep.json(discovery rows + rule-card agent gold map)export-benchmark-discovery-sweep.pyrefreshes metrics from agent corpus DB (preserves names/notes)@eval/*path alias--skip-if-missing-dbTest plan
npm run build