-
Notifications
You must be signed in to change notification settings - Fork 32
sync: bring v0.6-integration into release/v0.6.0 (ATIF/ADP runtime emission + router docs/tests + smoke gate) #680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
c1fbbc9
8ed3181
41d42e8
6131305
57d8610
313f0d0
0aff775
970eefb
a32799b
3a6500b
587a941
76aa0de
93e0ae9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -90,10 +90,36 @@ If `ruff format` changed files: `git diff --name-only`, then `git add <those fil | |
|
|
||
| ```bash | ||
| source .env 2>/dev/null || true | ||
| .venv/bin/python -m pytest -m live tests/test_smoke.py -v | ||
| .venv/bin/python -m pytest -m live tests/test_smoke.py -v -ra \ | ||
| --junitxml=/tmp/smoke.xml | ||
| # A skipped live smoke is NOT green — exit 0 on a run that never executed | ||
| # would false-green the e2e gate. pytest puts tests/skipped on the nested | ||
| # <testsuite> elements, so sum over them and fail unless one ran clean. | ||
| .venv/bin/python -c 'import sys,xml.etree.ElementTree as ET; \ | ||
| es=list(ET.parse("/tmp/smoke.xml").getroot().iter("testsuite")); \ | ||
| t=sum(int(e.get("tests",0)) for e in es); \ | ||
| s=sum(int(e.get("skipped",0)) for e in es); \ | ||
| sys.exit(0) if t and not s else (print("LIVE SMOKE SKIPPED OR NOT RUN — gate is RED, not green") or sys.exit(1))' | ||
|
Comment on lines
+98
to
+102
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
When the live smoke actually runs but fails (for example an assertion/verifier failure), pytest still writes a JUnit suite with Useful? React with 👍 / 👎. |
||
| ``` | ||
|
|
||
| If Docker is unavailable, warn and ask to skip or abort — do not skip silently. Expected: `test_hello_world_smoke` passes with reward > 0 and non-empty trajectory. | ||
| The live smoke `skipif`s when Docker is down or the chosen model has no | ||
| credential, and pytest exits `0` on a skip. The JUnit check above turns that | ||
| silent pass into a hard failure so a skipped smoke cannot false-green the gate. | ||
| Expected: `test_hello_world_smoke` passes with reward > 0 and non-empty | ||
| trajectory. | ||
|
|
||
| If the only credential on the machine is non-Anthropic, point the smoke at an | ||
| agent/model it can authenticate instead of skipping (proven combo: | ||
| openhands + deepseek): | ||
|
|
||
| ```bash | ||
| export BENCHFLOW_SMOKE_AGENT=openhands | ||
| export BENCHFLOW_SMOKE_MODEL=deepseek/deepseek-chat | ||
| export DEEPSEEK_API_KEY=... DEEPSEEK_BASE_URL=https://api.deepseek.com | ||
| ``` | ||
|
|
||
| `BENCHFLOW_SMOKE_AGENT` and `BENCHFLOW_SMOKE_MODEL` must be set together; the | ||
| skip reason names the exact missing credential for whichever model is selected. | ||
|
|
||
| --- | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,330 @@ | ||
| # Benchmark adoption | ||
|
|
||
| Adopt an upstream benchmark into a BenchFlow benchmark with `bench agent`. | ||
|
|
||
| ## What the router is | ||
|
|
||
| `bench agent` is the benchmark-adoption router. It *routes* an external | ||
| benchmark into `benchmarks/<name>/` — scaffold, codex-driven conversion, and a | ||
| parity gate — so the result is a first-class BenchFlow benchmark. It sits | ||
| upstream of evaluation: the router *adopts*, while `bench eval create` *runs* | ||
| the resulting tasks. Once `bench agent verify <name>` reports | ||
| `parity-confirmed`, you point `bench eval create` at the converted tasks and run | ||
| them like any other benchmark. | ||
|
|
||
| Three subcommands form the adopt → verify loop: | ||
|
|
||
| ``` | ||
| $ bench agent --help | ||
| ╭─ Commands ───────────────────────────────────────────────────────────────────╮ | ||
| │ create Scaffold benchmarks/<name>/ for a new benchmark adoption. │ | ||
| │ run Drive the CONVERT.md workflow by launching the host codex CLI. │ | ||
| │ verify Run the parity gate for an adopted benchmark; emit a verdict. │ | ||
| ╰──────────────────────────────────────────────────────────────────────────────╯ | ||
| ``` | ||
|
|
||
| The reference for what a finished adoption looks like is | ||
| [`benchmarks/programbench/`](../benchmarks/programbench/); the conversion | ||
| contract is [`benchmarks/CONVERT.md`](../benchmarks/CONVERT.md). The router | ||
| embeds both into the conversion workflow for you. | ||
|
|
||
| ## `bench agent create <name>` | ||
|
|
||
| `create` writes a deterministic scaffold under `benchmarks/<name>/`, matching | ||
| the reference layout and the CONVERT.md contract. Use `--benchmarks-dir` to | ||
| target a directory other than the repo's `benchmarks/`: | ||
|
|
||
| ``` | ||
| $ bench agent create webarena-lite --benchmarks-dir /tmp/router-docs/benchmarks | ||
| Scaffolded /tmp/router-docs/benchmarks/webarena-lite | ||
| README.md | ||
| __init__.py | ||
| benchflow.py | ||
| benchmark.yaml | ||
| main.py | ||
| parity_experiment.json | ||
| parity_test.py | ||
| run_webarena_lite.py | ||
| webarena-lite.yaml | ||
| ``` | ||
|
|
||
| That produces this tree: | ||
|
|
||
| ``` | ||
| webarena-lite/ | ||
| ├── __init__.py | ||
| ├── benchflow.py # converter: source instances → Harbor task dirs | ||
| ├── main.py # converter CLI delegator | ||
| ├── parity_test.py # structural / eval / side-by-side parity checks | ||
| ├── parity_experiment.json # recorded parity results (read by verify) | ||
| ├── benchmark.yaml # standard benchmark descriptor | ||
| ├── run_webarena_lite.py # runner: convert, then evaluate via BenchFlow | ||
| ├── webarena-lite.yaml # BenchFlow job config (how to run) | ||
| └── README.md # generated workflow notes | ||
| ``` | ||
|
|
||
| What each file is for: | ||
|
|
||
| - **`benchflow.py`** — the converter. Its documented `convert()` / | ||
| `convert_all()` entry points are `NotImplementedError` stubs that point at | ||
| CONVERT.md step 2; you fill them in to map each source instance to a | ||
| Harbor-format task directory (`task.toml`, `instruction.md`, | ||
| `environment/Dockerfile`, `tests/test.sh`). | ||
| - **`parity_test.py`** — the parity harness, with `--mode full | eval-parity | | ||
| side-by-side` (CONVERT.md steps 3–5). Side-by-side parity records the | ||
| per-criterion `original_verdict` / `adapted_verdict` pairs that `verify` | ||
| scores. | ||
| - **`parity_experiment.json`** — the recorded parity results `verify` reads. The | ||
| scaffold writes a `status: "template"` placeholder with empty | ||
| `conversion_parity.tasks` and `reward_distribution_parity.samples`; you | ||
| populate it from a real parity run. | ||
| - **`benchmark.yaml`** — the standard descriptor (name, conversion method, | ||
| verification method, parity tallies). Fields start as `TODO`/`0`. | ||
|
|
||
| `main.py`, `run_webarena_lite.py`, and `webarena-lite.yaml` are the converter | ||
| CLI delegator, the convert-then-evaluate runner, and the BenchFlow job config | ||
| respectively. | ||
|
|
||
| ### Fail-closed behavior | ||
|
|
||
| `create` refuses to overwrite an existing benchmark — re-running it is an error, | ||
| not a silent clobber: | ||
|
|
||
| ``` | ||
| $ bench agent create webarena-lite --benchmarks-dir /tmp/router-docs/benchmarks | ||
| benchmark already exists: /tmp/router-docs/benchmarks/webarena-lite (refusing to | ||
| overwrite) | ||
| ``` | ||
|
|
||
| Names must be lowercase slugs (leading letter, single internal hyphens). The | ||
| slug is also the security floor — it keeps `create`/`verify` from being steered | ||
| outside `benchmarks/`. An uppercase or underscored name is rejected: | ||
|
|
||
| ``` | ||
| $ bench agent create WebArena_Lite --benchmarks-dir /tmp/router-docs/benchmarks | ||
| invalid benchmark name 'WebArena_Lite': use a lowercase slug like 'my-bench' | ||
| (letters/digits, single internal hyphens, leading letter) | ||
| ``` | ||
|
|
||
| Both fail-closed cases exit non-zero. | ||
|
|
||
| ## `bench agent run <source> [--name]` | ||
|
|
||
| `run` drives the conversion. It assembles an adoption prompt — the source, the | ||
| target `benchmarks/<name>/` path, the adoption skills (CONVERT.md, the | ||
| programbench worked example, the parity harness), and the full embedded | ||
| CONVERT.md guide — then launches the host `codex` CLI to do the conversion | ||
| toward a pull request. If you omit `--name`, the slug is derived from the source | ||
| basename (so `.../webarena` becomes `webarena`). | ||
|
|
||
| Use `--dry-run` to print the exact command the router would launch without | ||
| running it: | ||
|
|
||
| ``` | ||
| $ bench agent run https://github.com/web-arena-x/webarena --name webarena-lite --dry-run | ||
| codex exec --cd /path/to/benchflow --skip-git-repo-check --sandbox workspace-write '# Benchmark adoption: webarena-lite | ||
|
|
||
| Adopt the source benchmark below into a BenchFlow benchmark by | ||
| following the conversion guide. Produce the converter, parity tests, | ||
| metadata, and task directories, then open a pull request. | ||
|
|
||
| Source benchmark: https://github.com/web-arena-x/webarena | ||
| Target directory: benchmarks/webarena-lite/ | ||
|
|
||
| ## Adoption skills | ||
| - conversion-guide: benchmarks/CONVERT.md | ||
| - reference-benchmark: benchmarks/programbench/ (worked example) | ||
| - parity-harness: parity_test.py + parity_experiment.json (verify gate) | ||
|
|
||
| ## Conversion guide (benchmarks/CONVERT.md) | ||
|
|
||
| # Benchmark Conversion Guide | ||
| ... | ||
| ## Definition of done | ||
| - benchmarks/webarena-lite/ has benchflow.py, parity_test.py, | ||
| parity_experiment.json, benchmark.yaml, run_webarena_lite.py, | ||
| README.md | ||
| - `bench agent verify webarena-lite` reports parity-confirmed' | ||
| ``` | ||
|
|
||
| The full prompt embeds CONVERT.md verbatim (elided above). The `codex exec` | ||
| argv is constructed deterministically: it runs in the repo root | ||
| (`--cd <repo>`), with `--skip-git-repo-check` and | ||
| `--sandbox workspace-write`. Pass `--model` to set the codex driver model and | ||
| `--codex-bin` to point at a different codex binary. | ||
|
|
||
| A live run (drop `--dry-run`) requires codex credentials and fails closed | ||
| without them — set `OPENAI_API_KEY` (or `CODEX_API_KEY`), or run `codex login` | ||
| to create `~/.codex/auth.json`. Without credentials `run` errors before | ||
| assembling any context: | ||
|
|
||
| ``` | ||
| codex needs credentials to launch: set OPENAI_API_KEY (or CODEX_API_KEY), or run | ||
| `codex login` to create ~/.codex/auth.json | ||
| ``` | ||
|
|
||
| The codex run is the manual-validation step — it iterates on the converter and | ||
| parity tests until `bench agent verify` confirms parity. | ||
|
|
||
| ## `bench agent verify <name>` | ||
|
|
||
| `verify` is the gate that closes the loop. It reads the adopted benchmark's | ||
| `parity_experiment.json` and emits a confidence verdict. The gate is *parity | ||
| only*: a faithful conversion must reproduce the original's behavior on identical | ||
| inputs — including any reward-hackability the original has. It never "improves" | ||
| or sanitizes the source. | ||
|
|
||
| It scores two layers: | ||
|
|
||
| - **Conversion parity (deterministic floor)** — every compared criterion's | ||
| converted verdict must match the original's verdict on identical inputs. | ||
| - **Reward-distribution parity (statistical layer)** — every | ||
| legacy-vs-converted reward delta must sit within `--tolerance` (default | ||
| `0.02`). | ||
|
|
||
| A layer with no recorded data does not block the verdict. The three verdicts: | ||
|
|
||
| | Verdict | Meaning | | ||
| | --- | --- | | ||
| | `parity-confirmed` | Every recorded layer agrees; high-confidence the conversion is faithful. | | ||
| | `parity-divergent` | A criterion disagrees or a reward delta exceeds tolerance. | | ||
| | `insufficient-evidence` | No recorded comparisons at all — run `parity_test.py` and record results first. | | ||
|
|
||
| A freshly scaffolded benchmark has no recorded parity, so it is | ||
| `insufficient-evidence` and exits non-zero: | ||
|
|
||
| ``` | ||
| $ bench agent verify webarena-lite --benchmarks-dir /tmp/router-docs/benchmarks | ||
| Verdict: insufficient-evidence | ||
| conversion: 0/0 criteria agree (rate 0.0000) | ||
| Insufficient evidence: no recorded parity comparisons. Run parity_test.py and | ||
| record results before trusting the conversion. | ||
| ... | ||
| ``` | ||
|
|
||
| ### A parity-confirmed run | ||
|
|
||
| Populate `parity_experiment.json` from a parity run. `verify` reads | ||
| per-criterion verdicts under `conversion_parity.tasks` and reward samples under | ||
| `reward_distribution_parity.samples`: | ||
|
|
||
| ```json | ||
| { | ||
| "experiment": "side-by-side-parity", | ||
| "benchmark": "webarena-lite", | ||
| "status": "recorded", | ||
| "judge_model": "gemini-3.1-flash-lite", | ||
| "conversion_parity": { | ||
| "tasks": [ | ||
| { | ||
| "task_id": "shopping-001", | ||
| "n_criteria": 2, | ||
| "criteria_results": [ | ||
| {"criterion_id": "C-001", "original_verdict": "pass", "adapted_verdict": "pass", "agreement": true}, | ||
| {"criterion_id": "C-002", "original_verdict": "fail", "adapted_verdict": "fail", "agreement": true} | ||
| ] | ||
| }, | ||
| { | ||
| "task_id": "reddit-002", | ||
| "n_criteria": 1, | ||
| "criteria_results": [ | ||
| {"criterion_id": "C-001", "original_verdict": "pass", "adapted_verdict": "pass", "agreement": true} | ||
| ] | ||
| } | ||
| ] | ||
| }, | ||
| "reward_distribution_parity": { | ||
| "samples": [ | ||
| {"task_id": "shopping-001", "legacy_reward": 0.50, "converted_reward": 0.50}, | ||
| {"task_id": "reddit-002", "legacy_reward": 1.00, "converted_reward": 1.00} | ||
| ] | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| With every criterion agreeing and every reward delta at zero, the verdict is | ||
| `parity-confirmed` and `verify` exits zero: | ||
|
|
||
| ``` | ||
| $ bench agent verify webarena-lite --benchmarks-dir /tmp/router-docs/benchmarks | ||
| Verdict: parity-confirmed | ||
| conversion: 3/3 criteria agree (rate 1.0000) | ||
| reward: max abs delta 0.0000 (tolerance 0.0200) | ||
| High-confidence: the converted evaluation reproduces the original's verdicts on | ||
| every compared criterion and stays within reward tolerance. | ||
| ``` | ||
|
|
||
| ### A parity-divergent run | ||
|
|
||
| Flip one criterion so the converted verdict no longer matches the original | ||
| (here `C-002`'s `adapted_verdict` goes from `fail` to `pass`). The deterministic | ||
| floor trips, the verdict becomes `parity-divergent`, and `verify` prints a draft | ||
| GitHub issue body for the support path: | ||
|
|
||
| ``` | ||
| $ bench agent verify webarena-lite --benchmarks-dir /tmp/router-docs/benchmarks | ||
| Verdict: parity-divergent | ||
| conversion: 2/3 criteria agree (rate 0.6667) | ||
| reward: max abs delta 0.0000 (tolerance 0.0200) | ||
| Divergence found: the conversion does not yet reproduce the original's behavior | ||
| — iterate, then open an issue for support. | ||
| ## Benchmark adoption parity: webarena-lite | ||
|
|
||
| **Verdict:** parity-divergent | ||
|
|
||
| Divergence found: the conversion does not yet reproduce the original's behavior | ||
| — iterate, then open an issue for support. | ||
|
|
||
| ### Conversion parity (deterministic floor) | ||
| - criteria compared: 3 | ||
| - agreed: 2 | ||
| - agreement rate: 0.6667 | ||
| - shopping-001/C-002: original=fail converted=pass | ||
|
|
||
| ### Reward-distribution parity (statistical layer) | ||
| - samples: 2 | ||
| - max abs delta: 0.0000 | ||
| - tolerance: 0.0200 | ||
|
|
||
| ### Ask | ||
| Parity could not be closed for this conversion. The translation must | ||
| reproduce the original's behavior on identical inputs (including any | ||
| reward-hackability it has). This draft has NOT been filed — review it, | ||
| iterate on the converter, and open it manually if you need support. | ||
| ``` | ||
|
|
||
| The draft is **never filed automatically** — it is printed for a human to | ||
| review and open if they need support. Pass `--issue-out PATH` to write it to a | ||
| file instead of stdout: | ||
|
|
||
| ``` | ||
| $ bench agent verify webarena-lite --benchmarks-dir /tmp/router-docs/benchmarks --issue-out /tmp/router-docs/divergence.md | ||
| Verdict: parity-divergent | ||
| ... | ||
| Issue draft written to /tmp/router-docs/divergence.md | ||
| ``` | ||
|
|
||
| ### The `--roundtrip-task` structural hook | ||
|
|
||
| By default `verify` scores the recorded `parity_experiment.json` at the | ||
| benchmark level. Pass `--roundtrip-task <task-dir>` to also run the structural | ||
| round-trip conformance check on one concrete task tree (it reuses the existing | ||
| Harbor round-trip parity utility). It is opt-in because that harness needs a | ||
| concrete task directory, which the benchmark-level verdict does not require. | ||
|
|
||
| `verify` exits non-zero for `parity-divergent` and `insufficient-evidence`, and | ||
| errors if the benchmark was never adopted: | ||
|
|
||
| ``` | ||
| $ bench agent verify nonexistent-bench --benchmarks-dir /tmp/router-docs/benchmarks | ||
| benchmark not adopted: /tmp/router-docs/benchmarks/nonexistent-bench — run | ||
| `bench agent create nonexistent-bench` first | ||
| ``` | ||
|
|
||
| ## From adoption to evaluation | ||
|
|
||
| Once `verify` reports `parity-confirmed`, the benchmark is a normal BenchFlow | ||
| benchmark: run its tasks with `bench eval create` (see | ||
| [Running benchmarks](./running-benchmarks.md)), using the job config the | ||
| scaffold generated. The router's job ends at `parity-confirmed`; evaluation | ||
| takes it from there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smoke gate ignores failures
High Severity
The new launch-prep JUnit follow-up only requires a positive
testscount and zeroskipped. It never sumsfailuresorerrors, so a live smoke that ran but failed—or errored during the session fixture—can still make the Python check exit 0 afterpytestalready reported failure.Additional Locations (1)
tests/test_smoke_wiring.py#L174-L183Reviewed by Cursor Bugbot for commit 93e0ae9. Configure here.