Skip to content

sync: bring v0.6-integration into release/v0.6.0 (ATIF/ADP runtime emission + router docs/tests + smoke gate)#680

Merged
xdotli merged 13 commits into
release/v0.6.0from
v0.6-integration
Jun 11, 2026
Merged

sync: bring v0.6-integration into release/v0.6.0 (ATIF/ADP runtime emission + router docs/tests + smoke gate)#680
xdotli merged 13 commits into
release/v0.6.0from
v0.6-integration

Conversation

@xdotli

@xdotli xdotli commented Jun 11, 2026

Copy link
Copy Markdown
Member

Why

release/v0.6.0 and v0.6-integration have diverged. The release branch has the ATIF/ADP export library but is missing the runtime wiring, so the artifacts never emit on a real run — the CHANGELOG's "emitted from every scored rollout" would be false as shipped. This PR brings the integration delta into the release candidate.

What this brings into the release

  • 🔴 Critical: ATIF/ADP runtime emission (feat(trajectories): emit ATIF and ADP from every scored rollout #662)rollout.py::_write_trainer_artifact now writes trainer/atif.json + trainer/adp.jsonl on every scored rollout (+ job-level ADP aggregation). Proven on a live run (28-step ATIF, 53-item ADP with terminal reward). Without this the emitters are dead code.
  • Router docs + testsdocs/benchmark-adoption.md (dogfooded guide for bench agent create/run/verify) and tests/test_agent_router_cli_e2e.py (10 real-CLI integration tests; the prior tests were fakes-only).
  • Smoke gate false-green fix — a skipped live smoke now fails the launch-prep gate instead of silently passing; BENCHFLOW_SMOKE_AGENT/MODEL escape hatch for non-Anthropic contributors. New tests/test_smoke_wiring.py.
  • Doc truth — canonical result.json field surface pinned (no phantom top-level scalars); BENCHFLOW_DAYTONA_AUTO_REAP documented.

Full suite on v0.6-integration tip (93e0ae9d): 3,436 passed, 0 failed.

⚠️ For the release owner to resolve

Opened as a sync PR (not a force-push) per coordination between the two integration tracks.


Note

Low Risk
Mostly documentation and test coverage; smoke gate behavior changes release prep (skipped live tests now fail the gate) but does not alter production eval paths.

Overview
Adds benchmark adoption documentation for bench agent create|run|verify and wires it into README and Mintlify nav.

Release gate / smoke: Launch-prep Step 3 now writes JUnit XML and fails if the live smoke was skipped or never ran (pytest exit 0 on skip no longer counts as green). Live smoke gains BENCHFLOW_SMOKE_AGENT + BENCHFLOW_SMOKE_MODEL (must be set together) and provider-aware credential checks so skips name the missing vars; optional openhands + deepseek path for non-Anthropic machines. New test_smoke_wiring.py and test_agent_router_cli_e2e.py lock router CLI behavior and the JUnit gate.

Docs / contract: running-benchmarks.md documents canonical result.json fields (rewards.reward, agent_result.total_tokens; no top-level reward/status). CLI docs note BENCHFLOW_DAYTONA_AUTO_REAP=0. test_verify.py asserts nested token totals and absence of phantom top-level scalars.

Reviewed by Cursor Bugbot for commit 93e0ae9. Bugbot is set up for automated code reviews on this repo. Configure here.

xdotli added 13 commits June 10, 2026 20:19
Document the bench agent create/run/verify benchmark-adoption router with
real captured output: the scaffolded file tree and per-file purpose, the
fail-closed create behavior, the --dry-run codex launch command, and the
parity-gate verdicts (parity-confirmed / parity-divergent /
insufficient-evidence) including the divergence issue draft. Link the guide
from the docs.json Guides nav and the README docs table.
…n/verify

Drive the real registered `bench agent create|run|verify` commands through
CliRunner against an on-disk benchmarks tree, with no fake exec/report layer on
the create/verify path. Covers: scaffold lands on disk and the generated Python
compiles + YAML parses; create refuses on re-run and rejects invalid slugs;
realistic parity records flip the verify verdict and exit code (confirmed exits
0, divergent exits non-zero and --issue-out writes a non-empty draft); run
--dry-run prints the codex command with the source + CONVERT.md context markers;
and the live run path fails closed on missing credentials without spawning
codex.
…-level scalars

The rollout result.json nests reward under rewards.reward and token totals
under agent_result.total_tokens/final_metrics; outcome is a derived
classification, not a stored field. There is intentionally no top-level
reward/total_tokens/status key, so a naive consumer reading result["reward"]
sees a missing key rather than a real value.

Document the canonical fields in running-benchmarks.md (correcting a stale
snippet that showed nonexistent top-level passed/verifier_output keys) and add
mutation-resistant contract tests pinning (1) the absence of invented
top-level scalars and (2) that a captured in-process token total round-trips
into agent_result.total_tokens instead of dropping to null.
…greening

The live smoke skipif's (pytest exit 0) when Docker is down or the chosen
model has no credential. The launch-prep e2e gate ran it and treated exit 0
as green, so a skipped smoke false-greened the e2e step on a run that never
executed.

- Gate Step 3 now writes JUnit XML and fails explicitly when the live smoke
  was skipped or never ran (counts summed over the nested <testsuite>
  elements, where pytest writes them).
- Add BENCHFLOW_SMOKE_AGENT / BENCHFLOW_SMOKE_MODEL so a contributor whose
  only credential is non-Anthropic can run the smoke against an agent/model
  they can authenticate (proven: openhands + deepseek) instead of skipping.
  Default claude-agent-acp + Haiku 4.5 behavior is unchanged.
- Skip reasons are now credential-aware and name the exact missing env var
  for the resolved model.
- tests/test_smoke_wiring.py covers the false-green path: skip-reason wiring,
  the escape hatch, and the gate's skipped-vs-passed JUnit predicate.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 93e0ae9. Configure here.

es=list(ET.parse("/tmp/smoke.xml").getroot().iter("testsuite")); \
t=sum(int(e.get("tests",0)) for e in es); \
s=sum(int(e.get("skipped",0)) for e in es); \
sys.exit(0) if t and not s else (print("LIVE SMOKE SKIPPED OR NOT RUN — gate is RED, not green") or sys.exit(1))'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smoke gate ignores failures

High Severity

The new launch-prep JUnit follow-up only requires a positive tests count and zero skipped. It never sums failures or errors, so a live smoke that ran but failed—or errored during the session fixture—can still make the Python check exit 0 after pytest already reported failure.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 93e0ae9. Configure here.

Comment thread tests/test_smoke.py
return "no ANTHROPIC_API_KEY and no ~/.claude/.credentials.json"
return None
_, model = resolve_smoke_target()
return _missing_model_credentials(model)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Half-set smoke env crashes

Medium Severity

_smoke_skip_reason calls resolve_smoke_target(), which raises RuntimeError when only one of BENCHFLOW_SMOKE_AGENT and BENCHFLOW_SMOKE_MODEL is set. That surfaces as a pytest setup error instead of a named skip reason, which pairs poorly with the JUnit gate that does not treat errors as red.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 93e0ae9. Configure here.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93e0ae9d92

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +98 to +102
.venv/bin/python -c 'import sys,xml.etree.ElementTree as ET; \
es=list(ET.parse("/tmp/smoke.xml").getroot().iter("testsuite")); \
t=sum(int(e.get("tests",0)) for e in es); \
s=sum(int(e.get("skipped",0)) for e in es); \
sys.exit(0) if t and not s else (print("LIVE SMOKE SKIPPED OR NOT RUN — gate is RED, not green") or sys.exit(1))'

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Check smoke failures before accepting JUnit counts

When the live smoke actually runs but fails (for example an assertion/verifier failure), pytest still writes a JUnit suite with tests=1 and skipped=0 while returning non-zero. This snippet then runs a second Python command whose predicate ignores failures/errors, so if the block is pasted into a normal shell without set -e, the final command can exit 0 and mark a failed smoke as green. Chain the pytest command with && or include failure/error counts in the XML predicate so the release gate preserves pytest's failure status.

Useful? React with 👍 / 👎.

@xdotli xdotli merged commit a19f57b into release/v0.6.0 Jun 11, 2026
1 of 2 checks passed
@xdotli xdotli deleted the v0.6-integration branch June 12, 2026 03:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant