Skip to content

Automate benchmark discovery sweep via eval JSON#278

Merged
EricCogen merged 1 commit into
mainfrom
chore/export-benchmark-discovery-sweep-json
Jun 14, 2026
Merged

Automate benchmark discovery sweep via eval JSON#278
EricCogen merged 1 commit into
mainfrom
chore/export-benchmark-discovery-sweep-json

Conversation

@EricCogen

Copy link
Copy Markdown
Owner

Summary

  • Adds eval/benchmark-discovery-sweep.json (discovery rows + rule-card agent gold map)
  • export-benchmark-discovery-sweep.py refreshes metrics from agent corpus DB (preserves names/notes)
  • Benchmark page imports JSON via @eval/* path alias
  • Drift check compares JSON to DB; CI still skips DB with --skip-if-missing-db
  • Refresh pipeline runs export before drift

Test plan

  • Export + drift OK on agent DB
  • CI skip mode OK without DB
  • npm run build
  • Full CI

…json.

Export script refreshes metrics from agent DB; page imports JSON for discovery table and rule-card agent gold. Drift check validates JSON instead of parsing page.tsx.
Copilot AI review requested due to automatic review settings June 14, 2026 01:32

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR moves the benchmark “discovery sweep” data source from a hard-coded TS array in the benchmark page to a committed eval/benchmark-discovery-sweep.json, and adds scripts to export/refresh that JSON from the local agent corpus DB plus a drift check to keep it in sync.

Changes:

  • Add eval/benchmark-discovery-sweep.json and update the benchmark page to import it via a new @eval/* TS path alias.
  • Add scripts/export-benchmark-discovery-sweep.py + scripts/benchmark_discovery_lib.py to refresh JSON metrics from the corpus DB (and update the refresh PS script to run it).
  • Refactor the drift check to compare JSON vs DB and update docs/workflow path triggers accordingly.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
site/tsconfig.json Adds @eval/* path alias for importing eval JSON into the site.
site/app/benchmark/page.tsx Imports discovery sweep + rule-card gold metrics from JSON instead of inlining.
scripts/refresh-agent-corpus.ps1 Runs the new export script before the drift check in the local refresh pipeline.
scripts/export-benchmark-discovery-sweep.py New exporter to refresh JSON metrics from the agent corpus DB while preserving notes/names.
scripts/corpus-benchmark-discovery-drift.py Refactors drift check to compare JSON vs DB using shared library helpers.
scripts/benchmark_discovery_lib.py New shared helpers for DB metric extraction + sweep JSON parsing/mapping.
eval/benchmark-discovery-sweep.json New committed JSON containing discovery sweep rows and rule-card agent-gold map.
docs/rules.md Updates refresh instructions to include exporting the sweep JSON before drift checking.
.github/workflows/benchmark-discovery-drift.yml Expands workflow path triggers to include new/updated files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +33 to +37
card_gold = doc.get("ruleCardAgentGold", {})
for rule_id in list(card_gold.keys()):
if rule_id in gold:
card_gold[rule_id] = gold[rule_id]

Comment on lines 55 to 59
actual_trigger = triggers.get(rule_id)
if actual_trigger and exp["triggerPct"] != actual_trigger:
if actual_trigger and exp.get("triggerPct") != actual_trigger:
errors.append(
f"{rule_id} triggerPct page={exp['triggerPct']} db={actual_trigger}"
f"{rule_id} triggerPct json={exp.get('triggerPct')} db={actual_trigger}"
)
Comment on lines 60 to 65
if exp.get("goldPrecision"):
actual_gold = gold.get(rule_id)
if actual_gold and exp["goldPrecision"] != actual_gold:
errors.append(
f"{rule_id} goldPrecision page={exp['goldPrecision']} db={actual_gold}"
f"{rule_id} goldPrecision json={exp['goldPrecision']} db={actual_gold}"
)
Comment on lines +67 to +72
for rule_id, exp_gold in card_gold.items():
actual_gold = gold.get(rule_id)
if actual_gold and exp_gold != actual_gold:
errors.append(
f"{rule_id} ruleCardAgentGold json={exp_gold} db={actual_gold}"
)
Comment on lines +73 to +79
const discoverySweepJune2026 =
discoverySweepDoc.discoveryRows as DiscoverySweepRow[];

const agentGoldByRuleId = discoverySweepDoc.ruleCardAgentGold as Record<
string,
string
>;
@EricCogen EricCogen merged commit 2059ccb into main Jun 14, 2026
16 checks passed
@EricCogen EricCogen deleted the chore/export-benchmark-discovery-sweep-json branch June 14, 2026 01:37

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f949d2586

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

import { Header } from "@/components/header";
import { Footer } from "@/components/footer";
import { Breadcrumbs } from "@/components/breadcrumbs";
import discoverySweepDoc from "@eval/benchmark-discovery-sweep.json";

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add the eval data file to the Pages deploy trigger

Since this page now imports eval/benchmark-discovery-sweep.json into the static Next bundle, a refresh commit that only updates that JSON file will not redeploy the benchmark page. I checked .github/workflows/nextjs.yml and its push path filter only includes site/** and the workflow file, so gauntletci.com/benchmark would keep serving the old bundled metrics until an unrelated site change or manual dispatch occurs. Include the eval JSON in the Pages workflow trigger or keep the data under site/.

Useful? React with 👍 / 👎.

gold_rows = cur.execute(
"""
WITH latest AS (
SELECT fixture_id, MAX(id) AS run_id

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Select latest runs by completion time

When the corpus DB contains multiple runs for the same fixture, MAX(id) picks the lexicographically greatest run id rather than the newest completed run; run ids are GUID/string values, while the rest of the corpus readers use completed_at_utc DESC, id DESC with status = 'COMPLETED'. This can make the new export/drift path compute agent-gold precision from an older run and publish stale percentages after a refresh. Use the same latest-run CTE as corpus_db_read.compute_labeled_rule_metrics.

Useful? React with 👍 / 👎.

Comment on lines +34 to +36
for rule_id in list(card_gold.keys()):
if rule_id in gold:
card_gold[rule_id] = gold[rule_id]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Drop stale rule-card gold entries on export

When a rule-card entry no longer has a labeled precision in the DB (for example, no latest-run triggers so the rule is absent from gold), this loop leaves the old JSON value in place. The row-level metrics above remove stale goldPrecision, but site/app/benchmark/page.tsx reads ruleCardAgentGold directly for the cards, so a refresh can continue publishing an obsolete agent-gold percentage. Remove missing entries or fail the export when a configured card has no current gold metric.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants