Automate benchmark discovery sweep via eval JSON by EricCogen · Pull Request #278 · EricCogen/GauntletCI

EricCogen · 2026-06-14T01:32:13Z

Summary

Adds eval/benchmark-discovery-sweep.json (discovery rows + rule-card agent gold map)
export-benchmark-discovery-sweep.py refreshes metrics from agent corpus DB (preserves names/notes)
Benchmark page imports JSON via @eval/* path alias
Drift check compares JSON to DB; CI still skips DB with --skip-if-missing-db
Refresh pipeline runs export before drift

Test plan

Export + drift OK on agent DB
CI skip mode OK without DB
npm run build
Full CI

…json. Export script refreshes metrics from agent DB; page imports JSON for discovery table and rule-card agent gold. Drift check validates JSON instead of parsing page.tsx.

Copilot

Pull request overview

This PR moves the benchmark “discovery sweep” data source from a hard-coded TS array in the benchmark page to a committed eval/benchmark-discovery-sweep.json, and adds scripts to export/refresh that JSON from the local agent corpus DB plus a drift check to keep it in sync.

Changes:

Add eval/benchmark-discovery-sweep.json and update the benchmark page to import it via a new @eval/* TS path alias.
Add scripts/export-benchmark-discovery-sweep.py + scripts/benchmark_discovery_lib.py to refresh JSON metrics from the corpus DB (and update the refresh PS script to run it).
Refactor the drift check to compare JSON vs DB and update docs/workflow path triggers accordingly.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
site/tsconfig.json	Adds `@eval/*` path alias for importing eval JSON into the site.
site/app/benchmark/page.tsx	Imports discovery sweep + rule-card gold metrics from JSON instead of inlining.
scripts/refresh-agent-corpus.ps1	Runs the new export script before the drift check in the local refresh pipeline.
scripts/export-benchmark-discovery-sweep.py	New exporter to refresh JSON metrics from the agent corpus DB while preserving notes/names.
scripts/corpus-benchmark-discovery-drift.py	Refactors drift check to compare JSON vs DB using shared library helpers.
scripts/benchmark_discovery_lib.py	New shared helpers for DB metric extraction + sweep JSON parsing/mapping.
eval/benchmark-discovery-sweep.json	New committed JSON containing discovery sweep rows and rule-card agent-gold map.
docs/rules.md	Updates refresh instructions to include exporting the sweep JSON before drift checking.
.github/workflows/benchmark-discovery-drift.yml	Expands workflow path triggers to include new/updated files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    card_gold = doc.get("ruleCardAgentGold", {})
+    for rule_id in list(card_gold.keys()):
+        if rule_id in gold:
+            card_gold[rule_id] = gold[rule_id]
+


        actual_trigger = triggers.get(rule_id)
-        if actual_trigger and exp["triggerPct"] != actual_trigger:
+        if actual_trigger and exp.get("triggerPct") != actual_trigger:
            errors.append(
-                f"{rule_id} triggerPct page={exp['triggerPct']} db={actual_trigger}"
+                f"{rule_id} triggerPct json={exp.get('triggerPct')} db={actual_trigger}"
            )


        if exp.get("goldPrecision"):
            actual_gold = gold.get(rule_id)
            if actual_gold and exp["goldPrecision"] != actual_gold:
                errors.append(
-                    f"{rule_id} goldPrecision page={exp['goldPrecision']} db={actual_gold}"
+                    f"{rule_id} goldPrecision json={exp['goldPrecision']} db={actual_gold}"
                )


+    for rule_id, exp_gold in card_gold.items():
+        actual_gold = gold.get(rule_id)
+        if actual_gold and exp_gold != actual_gold:
+            errors.append(
+                f"{rule_id} ruleCardAgentGold json={exp_gold} db={actual_gold}"
+            )


+const discoverySweepJune2026 =
+  discoverySweepDoc.discoveryRows as DiscoverySweepRow[];
+
+const agentGoldByRuleId = discoverySweepDoc.ruleCardAgentGold as Record<
+  string,
+  string
+>;


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4f949d2586

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T01:40:03Z

 import { Header } from "@/components/header";
 import { Footer } from "@/components/footer";
 import { Breadcrumbs } from "@/components/breadcrumbs";
+import discoverySweepDoc from "@eval/benchmark-discovery-sweep.json";


Add the eval data file to the Pages deploy trigger

Since this page now imports eval/benchmark-discovery-sweep.json into the static Next bundle, a refresh commit that only updates that JSON file will not redeploy the benchmark page. I checked .github/workflows/nextjs.yml and its push path filter only includes site/** and the workflow file, so gauntletci.com/benchmark would keep serving the old bundled metrics until an unrelated site change or manual dispatch occurs. Include the eval JSON in the Pages workflow trigger or keep the data under site/.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-14T01:40:03Z

+    gold_rows = cur.execute(
+        """
+        WITH latest AS (
+            SELECT fixture_id, MAX(id) AS run_id


Select latest runs by completion time

When the corpus DB contains multiple runs for the same fixture, MAX(id) picks the lexicographically greatest run id rather than the newest completed run; run ids are GUID/string values, while the rest of the corpus readers use completed_at_utc DESC, id DESC with status = 'COMPLETED'. This can make the new export/drift path compute agent-gold precision from an older run and publish stale percentages after a refresh. Use the same latest-run CTE as corpus_db_read.compute_labeled_rule_metrics.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-14T01:40:03Z

+    for rule_id in list(card_gold.keys()):
+        if rule_id in gold:
+            card_gold[rule_id] = gold[rule_id]


Drop stale rule-card gold entries on export

When a rule-card entry no longer has a labeled precision in the DB (for example, no latest-run triggers so the rule is absent from gold), this loop leaves the old JSON value in place. The row-level metrics above remove stale goldPrecision, but site/app/benchmark/page.tsx reads ruleCardAgentGold directly for the cards, so a refresh can continue publishing an obsolete agent-gold percentage. Remove missing entries or fail the export when a configured card has no current gold metric.

Useful? React with 👍 / 👎.

Source benchmark discovery sweep from eval/benchmark-discovery-sweep.…

4f949d2

…json. Export script refreshes metrics from agent DB; page imports JSON for discovery table and rule-card agent gold. Drift check validates JSON instead of parsing page.tsx.

Copilot AI review requested due to automatic review settings June 14, 2026 01:32

Copilot started reviewing on behalf of EricCogen June 14, 2026 01:32 View session

Copilot AI reviewed Jun 14, 2026

View reviewed changes

EricCogen merged commit 2059ccb into main Jun 14, 2026
16 checks passed

EricCogen deleted the chore/export-benchmark-discovery-sweep-json branch June 14, 2026 01:37

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automate benchmark discovery sweep via eval JSON#278

Automate benchmark discovery sweep via eval JSON#278
EricCogen merged 1 commit into
mainfrom
chore/export-benchmark-discovery-sweep-json

EricCogen commented Jun 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EricCogen commented Jun 14, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants