sync: bring v0.6-integration into release/v0.6.0 (ATIF/ADP runtime emission + router docs/tests + smoke gate) by xdotli · Pull Request #680 · benchflow-ai/benchflow

xdotli · 2026-06-11T22:33:13Z

Why

release/v0.6.0 and v0.6-integration have diverged. The release branch has the ATIF/ADP export library but is missing the runtime wiring, so the artifacts never emit on a real run — the CHANGELOG's "emitted from every scored rollout" would be false as shipped. This PR brings the integration delta into the release candidate.

What this brings into the release

🔴 Critical: ATIF/ADP runtime emission (feat(trajectories): emit ATIF and ADP from every scored rollout #662) — rollout.py::_write_trainer_artifact now writes trainer/atif.json + trainer/adp.jsonl on every scored rollout (+ job-level ADP aggregation). Proven on a live run (28-step ATIF, 53-item ADP with terminal reward). Without this the emitters are dead code.
Router docs + tests — docs/benchmark-adoption.md (dogfooded guide for bench agent create/run/verify) and tests/test_agent_router_cli_e2e.py (10 real-CLI integration tests; the prior tests were fakes-only).
Smoke gate false-green fix — a skipped live smoke now fails the launch-prep gate instead of silently passing; BENCHFLOW_SMOKE_AGENT/MODEL escape hatch for non-Anthropic contributors. New tests/test_smoke_wiring.py.
Doc truth — canonical result.json field surface pinned (no phantom top-level scalars); BENCHFLOW_DAYTONA_AUTO_REAP documented.

Full suite on v0.6-integration tip (93e0ae9d): 3,436 passed, 0 failed.

⚠️ For the release owner to resolve

Version/CHANGELOG: v0.6-integration is 0.6.0; release/v0.6.0 is on the 0.6.0rcN cadence with its own CHANGELOG. Keep the release branch's version + CHANGELOG on merge; this PR's value is the code delta, not its version line.
May overlap with in-flight PRs (compat: 4 downstream-consumer fixes from clawsbench v0.6 integration (daytona extra error, SDK.run alias, lenient reward.json, bare-model provider routing) #670, fix(cli,examples): resolve v0.6.0 stress-test defects (ENG-248..256) #678, feat(agents): Add MiMo Code (mimo) ACP agent #679) — review for conflicts.

Opened as a sync PR (not a force-push) per coordination between the two integration tracks.

Note

Low Risk
Mostly documentation and test coverage; smoke gate behavior changes release prep (skipped live tests now fail the gate) but does not alter production eval paths.

Overview
Adds benchmark adoption documentation for bench agent create|run|verify and wires it into README and Mintlify nav.

Release gate / smoke: Launch-prep Step 3 now writes JUnit XML and fails if the live smoke was skipped or never ran (pytest exit 0 on skip no longer counts as green). Live smoke gains BENCHFLOW_SMOKE_AGENT + BENCHFLOW_SMOKE_MODEL (must be set together) and provider-aware credential checks so skips name the missing vars; optional openhands + deepseek path for non-Anthropic machines. New test_smoke_wiring.py and test_agent_router_cli_e2e.py lock router CLI behavior and the JUnit gate.

Docs / contract: running-benchmarks.md documents canonical result.json fields (rewards.reward, agent_result.total_tokens; no top-level reward/status). CLI docs note BENCHFLOW_DAYTONA_AUTO_REAP=0. test_verify.py asserts nested token totals and absence of phantom top-level scalars.

^{Reviewed by Cursor Bugbot for commit 93e0ae9. Bugbot is set up for automated code reviews on this repo. Configure here.}

Document the bench agent create/run/verify benchmark-adoption router with real captured output: the scaffolded file tree and per-file purpose, the fail-closed create behavior, the --dry-run codex launch command, and the parity-gate verdicts (parity-confirmed / parity-divergent / insufficient-evidence) including the divergence issue draft. Link the guide from the docs.json Guides nav and the README docs table.

…n/verify Drive the real registered `bench agent create|run|verify` commands through CliRunner against an on-disk benchmarks tree, with no fake exec/report layer on the create/verify path. Covers: scaffold lands on disk and the generated Python compiles + YAML parses; create refuses on re-run and rejects invalid slugs; realistic parity records flip the verify verdict and exit code (confirmed exits 0, divergent exits non-zero and --issue-out writes a non-empty draft); run --dry-run prints the codex command with the source + CONVERT.md context markers; and the live run path fails closed on missing credentials without spawning codex.

…-level scalars The rollout result.json nests reward under rewards.reward and token totals under agent_result.total_tokens/final_metrics; outcome is a derived classification, not a stored field. There is intentionally no top-level reward/total_tokens/status key, so a naive consumer reading result["reward"] sees a missing key rather than a real value. Document the canonical fields in running-benchmarks.md (correcting a stale snippet that showed nonexistent top-level passed/verifier_output keys) and add mutation-resistant contract tests pinning (1) the absence of invented top-level scalars and (2) that a captured in-process token total round-trips into agent_result.total_tokens instead of dropping to null.

…greening The live smoke skipif's (pytest exit 0) when Docker is down or the chosen model has no credential. The launch-prep e2e gate ran it and treated exit 0 as green, so a skipped smoke false-greened the e2e step on a run that never executed. - Gate Step 3 now writes JUnit XML and fails explicitly when the live smoke was skipped or never ran (counts summed over the nested <testsuite> elements, where pytest writes them). - Add BENCHFLOW_SMOKE_AGENT / BENCHFLOW_SMOKE_MODEL so a contributor whose only credential is non-Anthropic can run the smoke against an agent/model they can authenticate (proven: openhands + deepseek) instead of skipping. Default claude-agent-acp + Haiku 4.5 behavior is unchanged. - Skip reasons are now credential-aware and name the exact missing env var for the resolved model. - tests/test_smoke_wiring.py covers the false-green path: skip-reason wiring, the escape hatch, and the gate's skipped-vs-passed JUnit predicate.

…v var

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 93e0ae9. Configure here.}

cursor · 2026-06-11T22:34:40Z

+  es=list(ET.parse("/tmp/smoke.xml").getroot().iter("testsuite")); \
+  t=sum(int(e.get("tests",0)) for e in es); \
+  s=sum(int(e.get("skipped",0)) for e in es); \
+  sys.exit(0) if t and not s else (print("LIVE SMOKE SKIPPED OR NOT RUN — gate is RED, not green") or sys.exit(1))'


Smoke gate ignores failures

High Severity

The new launch-prep JUnit follow-up only requires a positive tests count and zero skipped. It never sums failures or errors, so a live smoke that ran but failed—or errored during the session fixture—can still make the Python check exit 0 after pytest already reported failure.

Additional Locations (1)

tests/test_smoke_wiring.py#L174-L183

^{Reviewed by Cursor Bugbot for commit 93e0ae9. Configure here.}

cursor · 2026-06-11T22:34:40Z

-        return "no ANTHROPIC_API_KEY and no ~/.claude/.credentials.json"
-    return None
+    _, model = resolve_smoke_target()
+    return _missing_model_credentials(model)


Half-set smoke env crashes

Medium Severity

_smoke_skip_reason calls resolve_smoke_target(), which raises RuntimeError when only one of BENCHFLOW_SMOKE_AGENT and BENCHFLOW_SMOKE_MODEL is set. That surfaces as a pytest setup error instead of a named skip reason, which pairs poorly with the JUnit gate that does not treat errors as red.

^{Reviewed by Cursor Bugbot for commit 93e0ae9. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93e0ae9d92

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-11T22:35:26Z

+.venv/bin/python -c 'import sys,xml.etree.ElementTree as ET; \
+  es=list(ET.parse("/tmp/smoke.xml").getroot().iter("testsuite")); \
+  t=sum(int(e.get("tests",0)) for e in es); \
+  s=sum(int(e.get("skipped",0)) for e in es); \
+  sys.exit(0) if t and not s else (print("LIVE SMOKE SKIPPED OR NOT RUN — gate is RED, not green") or sys.exit(1))'


Check smoke failures before accepting JUnit counts

When the live smoke actually runs but fails (for example an assertion/verifier failure), pytest still writes a JUnit suite with tests=1 and skipped=0 while returning non-zero. This snippet then runs a second Python command whose predicate ignores failures/errors, so if the block is pasted into a normal shell without set -e, the final command can exit 0 and mark a failed smoke as green. Chain the pytest command with && or include failure/error counts in the XML predicate so the release gate preserves pytest's failure status.

Useful? React with 👍 / 👎.

xdotli added 13 commits June 10, 2026 20:19

Merge remote-tracking branch 'origin/release/v0.6.0' into integrate-back

c1fbbc9

Merge remote-tracking branch 'origin/release/v0.6.0' into sync-int

8ed3181

Merge remote-tracking branch 'origin/release/v0.6.0' into sync-int2

41d42e8

Merge branch 'docs/agent-router-guide' into integrate-router-docs

970eefb

Merge branch 'test/agent-router-cli-e2e' into integrate-router-docs

a32799b

docs(cli): document the BENCHFLOW_DAYTONA_AUTO_REAP automatic-reap en…

3a6500b

…v var

Merge branch 'docs/auto-reap-env' into int-reapdoc

587a941

Merge branch 'fix/smoke-no-false-green' into int-fixes

76aa0de

Merge branch 'fix/result-json-toplevel-fields' into int-fixes

93e0ae9

cursor Bot reviewed Jun 11, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

xdotli merged commit a19f57b into release/v0.6.0 Jun 11, 2026
1 of 2 checks passed

xdotli deleted the v0.6-integration branch June 12, 2026 03:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: bring v0.6-integration into release/v0.6.0 (ATIF/ADP runtime emission + router docs/tests + smoke gate)#680

sync: bring v0.6-integration into release/v0.6.0 (ATIF/ADP runtime emission + router docs/tests + smoke gate)#680
xdotli merged 13 commits into
release/v0.6.0from
v0.6-integration

xdotli commented Jun 11, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 11, 2026

Uh oh!

cursor Bot Jun 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xdotli commented Jun 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What this brings into the release

⚠️ For the release owner to resolve

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

Smoke gate ignores failures

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

Half-set smoke env crashes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xdotli commented Jun 11, 2026 •

edited by cursor Bot

Loading