sync: bring v0.6-integration into release/v0.6.0 (ATIF/ADP runtime emission + router docs/tests + smoke gate)#680
Conversation
Document the bench agent create/run/verify benchmark-adoption router with real captured output: the scaffolded file tree and per-file purpose, the fail-closed create behavior, the --dry-run codex launch command, and the parity-gate verdicts (parity-confirmed / parity-divergent / insufficient-evidence) including the divergence issue draft. Link the guide from the docs.json Guides nav and the README docs table.
…n/verify Drive the real registered `bench agent create|run|verify` commands through CliRunner against an on-disk benchmarks tree, with no fake exec/report layer on the create/verify path. Covers: scaffold lands on disk and the generated Python compiles + YAML parses; create refuses on re-run and rejects invalid slugs; realistic parity records flip the verify verdict and exit code (confirmed exits 0, divergent exits non-zero and --issue-out writes a non-empty draft); run --dry-run prints the codex command with the source + CONVERT.md context markers; and the live run path fails closed on missing credentials without spawning codex.
…-level scalars The rollout result.json nests reward under rewards.reward and token totals under agent_result.total_tokens/final_metrics; outcome is a derived classification, not a stored field. There is intentionally no top-level reward/total_tokens/status key, so a naive consumer reading result["reward"] sees a missing key rather than a real value. Document the canonical fields in running-benchmarks.md (correcting a stale snippet that showed nonexistent top-level passed/verifier_output keys) and add mutation-resistant contract tests pinning (1) the absence of invented top-level scalars and (2) that a captured in-process token total round-trips into agent_result.total_tokens instead of dropping to null.
…greening The live smoke skipif's (pytest exit 0) when Docker is down or the chosen model has no credential. The launch-prep e2e gate ran it and treated exit 0 as green, so a skipped smoke false-greened the e2e step on a run that never executed. - Gate Step 3 now writes JUnit XML and fails explicitly when the live smoke was skipped or never ran (counts summed over the nested <testsuite> elements, where pytest writes them). - Add BENCHFLOW_SMOKE_AGENT / BENCHFLOW_SMOKE_MODEL so a contributor whose only credential is non-Anthropic can run the smoke against an agent/model they can authenticate (proven: openhands + deepseek) instead of skipping. Default claude-agent-acp + Haiku 4.5 behavior is unchanged. - Skip reasons are now credential-aware and name the exact missing env var for the resolved model. - tests/test_smoke_wiring.py covers the false-green path: skip-reason wiring, the escape hatch, and the gate's skipped-vs-passed JUnit predicate.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 93e0ae9. Configure here.
| es=list(ET.parse("/tmp/smoke.xml").getroot().iter("testsuite")); \ | ||
| t=sum(int(e.get("tests",0)) for e in es); \ | ||
| s=sum(int(e.get("skipped",0)) for e in es); \ | ||
| sys.exit(0) if t and not s else (print("LIVE SMOKE SKIPPED OR NOT RUN — gate is RED, not green") or sys.exit(1))' |
There was a problem hiding this comment.
Smoke gate ignores failures
High Severity
The new launch-prep JUnit follow-up only requires a positive tests count and zero skipped. It never sums failures or errors, so a live smoke that ran but failed—or errored during the session fixture—can still make the Python check exit 0 after pytest already reported failure.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 93e0ae9. Configure here.
| return "no ANTHROPIC_API_KEY and no ~/.claude/.credentials.json" | ||
| return None | ||
| _, model = resolve_smoke_target() | ||
| return _missing_model_credentials(model) |
There was a problem hiding this comment.
Half-set smoke env crashes
Medium Severity
_smoke_skip_reason calls resolve_smoke_target(), which raises RuntimeError when only one of BENCHFLOW_SMOKE_AGENT and BENCHFLOW_SMOKE_MODEL is set. That surfaces as a pytest setup error instead of a named skip reason, which pairs poorly with the JUnit gate that does not treat errors as red.
Reviewed by Cursor Bugbot for commit 93e0ae9. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 93e0ae9d92
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| .venv/bin/python -c 'import sys,xml.etree.ElementTree as ET; \ | ||
| es=list(ET.parse("/tmp/smoke.xml").getroot().iter("testsuite")); \ | ||
| t=sum(int(e.get("tests",0)) for e in es); \ | ||
| s=sum(int(e.get("skipped",0)) for e in es); \ | ||
| sys.exit(0) if t and not s else (print("LIVE SMOKE SKIPPED OR NOT RUN — gate is RED, not green") or sys.exit(1))' |
There was a problem hiding this comment.
Check smoke failures before accepting JUnit counts
When the live smoke actually runs but fails (for example an assertion/verifier failure), pytest still writes a JUnit suite with tests=1 and skipped=0 while returning non-zero. This snippet then runs a second Python command whose predicate ignores failures/errors, so if the block is pasted into a normal shell without set -e, the final command can exit 0 and mark a failed smoke as green. Chain the pytest command with && or include failure/error counts in the XML predicate so the release gate preserves pytest's failure status.
Useful? React with 👍 / 👎.


Why
release/v0.6.0andv0.6-integrationhave diverged. The release branch has the ATIF/ADP export library but is missing the runtime wiring, so the artifacts never emit on a real run — the CHANGELOG's "emitted from every scored rollout" would be false as shipped. This PR brings the integration delta into the release candidate.What this brings into the release
rollout.py::_write_trainer_artifactnow writestrainer/atif.json+trainer/adp.jsonlon every scored rollout (+ job-level ADP aggregation). Proven on a live run (28-step ATIF, 53-item ADP with terminal reward). Without this the emitters are dead code.docs/benchmark-adoption.md(dogfooded guide forbench agent create/run/verify) andtests/test_agent_router_cli_e2e.py(10 real-CLI integration tests; the prior tests were fakes-only).BENCHFLOW_SMOKE_AGENT/MODELescape hatch for non-Anthropic contributors. Newtests/test_smoke_wiring.py.result.jsonfield surface pinned (no phantom top-level scalars);BENCHFLOW_DAYTONA_AUTO_REAPdocumented.Full suite on
v0.6-integrationtip (93e0ae9d): 3,436 passed, 0 failed.v0.6-integrationis0.6.0;release/v0.6.0is on the0.6.0rcNcadence with its own CHANGELOG. Keep the release branch's version + CHANGELOG on merge; this PR's value is the code delta, not its version line.Opened as a sync PR (not a force-push) per coordination between the two integration tracks.
Note
Low Risk
Mostly documentation and test coverage; smoke gate behavior changes release prep (skipped live tests now fail the gate) but does not alter production eval paths.
Overview
Adds benchmark adoption documentation for
bench agent create|run|verifyand wires it into README and Mintlify nav.Release gate / smoke: Launch-prep Step 3 now writes JUnit XML and fails if the live smoke was skipped or never ran (pytest exit 0 on skip no longer counts as green). Live smoke gains
BENCHFLOW_SMOKE_AGENT+BENCHFLOW_SMOKE_MODEL(must be set together) and provider-aware credential checks so skips name the missing vars; optional openhands + deepseek path for non-Anthropic machines. Newtest_smoke_wiring.pyandtest_agent_router_cli_e2e.pylock router CLI behavior and the JUnit gate.Docs / contract:
running-benchmarks.mddocuments canonicalresult.jsonfields (rewards.reward,agent_result.total_tokens; no top-levelreward/status). CLI docs noteBENCHFLOW_DAYTONA_AUTO_REAP=0.test_verify.pyasserts nested token totals and absence of phantom top-level scalars.Reviewed by Cursor Bugbot for commit 93e0ae9. Bugbot is set up for automated code reviews on this repo. Configure here.