chore: release v0.6.0#665
Conversation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
…ide state The pre-agent workspace snapshot /testbed_verify is seeded world-readable (chmod -R o+rX) so the root verifier can diff against it, but it was missing from _DEFAULT_LOCKED. That let the non-root agent read grading-side state in the snapshot — verifier/judge config, rubrics, expected outputs, and judge credentials — enabling reward forgery. Add /testbed_verify to _DEFAULT_LOCKED. Seeding runs before lockdown_paths in install_agent, so lockdown's chown root:root + chmod 700 closes the read window before the agent starts. The verifier runs as root and still reads the snapshot regardless of mode, so oracle/verifier execution is unaffected. Tests: pin /testbed_verify into the default and resolved locked sets (mutation-killers), assert the emitted lockdown command chowns root and chmod 700s it, and add a hermetic sandbox-level test proving the seed->lockdown sequence leaves no group/other read or traverse bits on the snapshot dir.
…fold edit list running-any-benchmark.md: Layer 2 conflated two distinct translation flows. Split them: (a) migrate a task you control -> validate with bench tasks check (no parity gate); (b) adopt a foreign benchmark -> bench agent create then prove with bench agent verify. bench agent verify only scores an adopted benchmarks/<name>/ and errors 'benchmark not adopted ... run bench agent create first' on a migrated task.md. agent-quickstart.md: - STEP 7: add the three placeholder files the task.md scaffold also writes (verifier/verifier.md, verifier/rubrics/verifier.md, verifier/rubrics/verifier.toml) to the replace-the-placeholders list so a literal follower passes 'bench tasks check' first try. - STEP 1: add a release-candidate install note. 0.6.0 is not yet on PyPI, so point testers at the GitHub prerelease wheel for the real 0.6 flow today and keep the post-publish 'benchflow==0.6.0' command. Add a one-line trap about running 'uv run bench' from inside the project dir.
… full init scaffold bench agent verify <name> crashed with AttributeError on a top-level JSON array parity_experiment.json (extract_criterion_comparisons / extract_reward_samples called .get on a list). Guard a non-Mapping payload in both parsers so verify returns a documented verdict (insufficient-evidence) instead of crashing, and never fabricates phantom zero-delta reward samples from summary records. Widen the related type hints (load_parity_experiment, build_verify_report) to match reality. bench tasks init's "Created:" summary under-reported the scaffold (omitted verifier/test_outputs.py and verifier/rubrics/verifier.toml, both validated by bench tasks check). Add scaffold_task() returning every file actually written (derived from disk), keep init_task() as a thin wrapper, and have the CLI list the real file set. Tests: top-level-array parity file -> verify verdict not exception (unit + CLI); init reported set == written set across format/no-pytest/no-oracle.
…e gate The artifacts section promised `bench agent verify` scores every shipped parity_experiment.json and that the recorded experiments span per-criterion verdict agreement. Verified against code, neither holds: - `build_verify_report`/`extract_criterion_comparisons` require the file to be a JSON object; one shipped experiment is a top-level list of runs, which the gate cannot score, so `verify` does not work uniformly across the shipped set. - No shipped experiment records per-criterion data — every `verify` run reports 0/0 criteria — so "spanning per-criterion verdict agreement" was aspirational. Rewrite the section to state the object-shape requirement, give a worked `bench agent verify programbench` example (parity-confirmed, reward delta within the 0.02 tolerance), note the shipped set is non-uniform (others return insufficient-evidence; one is a list the gate cannot score), and reword the Layer 2 lead-in so it no longer implies verify scores every adopted benchmark.
…y' into integrate-iter3
iter2: verifier-home isolation (/testbed_verify locked from agent read). iter3: bench agent verify no longer crashes on list-shaped parity files (graceful verdict; guarded both parsers, avoids a false parity-confirmed); bench tasks init Created-summary now lists the real file set; docs L2 migrate-vs-adopt disambiguation + quickstart placeholder/RC-wheel fixes. Suite 3,437 green; harvey-lab verify confirmed non-crashing.
…th in authoring example Final cold-read nits (full flow already passed): the RC install example pinned an older tag (literal copy fetched rc.2); the authoring example used a relative hello.txt that could need an oracle iteration to resolve. Both docs-only; no code/wheel change.
Remove the dashboard/ dev status board (Linear/CI/HF-scores surface, never shipped in the wheel) and its 10 dashboard-only test modules. Redirect or reword every reference so nothing dangles: - pyproject.toml: daytona>=0.184 floor comment now cites sandbox/daytona.py (the actual sync-client user) instead of the removed dashboard panel. - tests/test_paths_symlink_helpers.py: drop the cross-ref to the removed symlink-ingestion test. - experiments/skillsbench-fill/README.md: describe the ledger summary instead of the removed Experiments tab path. - .claude/launch.json: drop the dashboard run config. - docs/running-any-benchmark.md: neutral wording for downstream consumers. Collapse labs/ by archiving the two 0.2.x-era research artifacts under labs/archive/ (git mv, evidence preserved): - benchjack-sandbox-hardening, reward-hack-matrix Add labs/archive/README.md and point README, docs/concepts.md, docs/sandbox-hardening.md, and .claude/dev-docs/labs.md at the archived location with a clear (historical, 0.2.x-era) note. The security story in docs/sandbox-hardening.md is unchanged. Gate: pytest tests/ (excl. integration) green; no dangling dashboard/ paths or broken labs/ links remain.
…hecks Pure structural quick wins in sandbox/daytona.py with no behavior change: - Add _require_sandbox() helper and route the repeated 'if not self._sandbox: raise' preconditions through it so the type checker narrows AsyncSandbox|None at each call site (same falsiness check and error message as the inlined guards). - Define _SDK_RETRY once and apply it to the six idempotent SDK helpers that inlined an identical tenacity policy; _create_sandbox and _stop_sandbox keep their distinct policies. - Add _reject_non_main_service() for the three identical single-container ValueError guards in the direct strategy. - Build CreateSandboxFromImageParams once in _DaytonaDirect.start by computing the image in an inner if/else (ownership label preserved). - Drop the no-op 'try/finally: pass' wrapper in _sandbox_exec, keeping the explanatory comment.
- Hoist shutil to a single top-level import; drop redundant local re/tempfile/shlex/shutil imports. - Extract _record_agent_timeout() shared by both TimeoutError handlers. - Add RolloutConfig._primary_role; collapse the three primary_* properties to one-liners. - Initialize _effective_task_path/_task_tmp/_session_traj_count/ _session_tool_count in __init__ and drop the now-redundant getattr guards on the install_agent()/execute() reads. Guards in _capture_partial_acp_trajectory() and cleanup() are kept: they intentionally tolerate Rollout.__new__() test stubs that skip __init__ (documented in-code).
…ult reporter Pure structural cleanups in the CLI lockdown surface (zero behavior change): - lockdown: drop the dead _under_path helper (the live copy is inlined in the trusted-PATH script string). - cli: extract _report_eval_result for the duplicated Score/errors summary print used by job and eval create. - cli: collapse the two near-identical EvaluationConfig constructions in eval create into one _make_eval_config(source_provenance=...) builder. - cli: add benchflow/cli/_options.py with reusable Typer option aliases (--agent/--model/--sandbox/--concurrency/--jobs-dir/--skill-mode) and apply them across commands; per-command defaults stay at the call site so --help output is byte-identical.
The daytona module is import-safe without the SDK (#358), so the factory's import guard never tripped and selecting the Daytona environment leaked a raw ImportError deep inside DaytonaSandbox.__init__ (after CPU/memory clamping). Force _load_daytona_sdk() at env selection and route the failure through _raise_missing_optional_sandbox_dependency, matching the modal branch — a clear 'uv sync --extra sandbox-daytona' / 'pip install benchflow[sandbox-daytona]' error. Widen the helper's exc type to ImportError. Add a focused unit test.
v0.6 renamed SDK.run(trial_name=...) to rollout_name with no alias, a hard break for pre-v0.6 callers. Re-add an optional 'trial_name' keyword that maps to 'rollout_name' and emits a DeprecationWarning (matching the warnings.warn( ..., DeprecationWarning, stacklevel=2) style used in task/config.py). Passing both is ambiguous and raises TypeError. Adds tests/test_sdk_run_alias.py covering the warn+map path, the no-warn rollout_name path, and the TypeError on both.
The strict validate_reward_map rejects any unrecognized non-numeric
top-level key (e.g. the Harbor-era {"reward":1.0,"done":true}) and any
non-numeric metric, so the classic test.sh->reward.json flow fails the
whole run. Add an opt-in lenient path aligned with the existing
validation abstraction.
- validate_reward_map(..., lenient=False): new flag. Lenient drops
unrecognized/non-numeric top-level keys and non-numeric metric entries
(and malformed recognized-structured keys) with a single warnings.warn
listing everything dropped, instead of raising. Still requires a usable
scalar reward: from 'reward', else a numeric 'score'/'rewards' alias,
else numeric metrics + a declared aggregate policy. Default stays strict
(no behaviour change).
- reward_lenient_from_env(): reads BENCHFLOW_REWARD_LENIENT (truthy
1/true/yes/on), matching benchflow's BENCHFLOW_* operator-toggle
convention. Threaded into Verifier._parse_reward_json and the final
_ensure_canonical_rewards gate in rollout.py.
- tests/test_reward_lenient.py: strict still raises on {reward,done};
lenient drops done + non-numeric metrics, keeps reward, warns; alias
derivation; env parsing.
strip_provider_prefix leaves a bare model id (e.g. deepseek-v4-flash) that find_provider can no longer match, so the openclaw shim's _infer_provider_prefix defaulted every non-gemini/gpt id to anthropic -- silently running deepseek/glm/qwen/... as anthropic. Add a registry-owned find_provider_for_bare_model(model): each provider declares the bare-model family tokens it owns via a new ProviderConfig.model_prefixes field, and the helper does longest-token matching (with a non-letter family boundary so glm matches glm-4.6 / glm5 but not glmnext), falling back to an unambiguous declared models[].id. _infer_provider_prefix now consults it before the native gemini/gpt heuristics, then anthropic -- existing gemini/gpt/anthropic behavior unchanged. Tests: tests/test_bare_model_provider.py.
…integrate-cleanup
16-probe deterministic sweep (task.md gate, CLI arg validation, provider/agent wiring, judge routing, ATIF/ADP, Daytona reaper) with adversarial verification, plus 4 live rollout campaigns (oracle Docker/Daytona, deepseek-v4-flash openhands Daytona, Docker/Daytona parity isolation, live gemini-3.1-flash-lite judge). 9 confirmed defects (2 P1, 6 P2, 3 P3) + 1 broken example oracle; 4 false alarms refuted. Engine, trajectory artifacts, secret redaction, reaper ownership-scoping, and conversion gate all pass under stress (3473/3474 fast tests).
From a 92-agent dead-code audit of the release HEAD. This PR removes only the high-confidence, pure-deletion subset — symbols re-verified zero-reference WITH CLASS CONTEXT (the audit's blanket signal flagged e.g. RolloutPaths.log_path as "dead" while 14 unrelated `.log_path` attributes existed on other classes; each item here was disambiguated against its real reference sites): - task/paths.py: 7 unused `*_path` @Property's — TaskPaths.{readme_path, gitignore_path, verifier_document_path} and RolloutPaths.{artifacts_manifest_path, result_path, exception_message_path, log_path}. The underlying files are still written/read via inline `dir / "name"`; only the unused accessors go. - sandbox/modal_impl.py: ModalSandbox.{supports_gpus, can_disable_internet} — vestigial capability properties not on the Sandbox Protocol (GPU config reads task_env_config.gpus; internet policy uses SandboxConfig.allow_internet). - cli/continue_cmd.py: unused module-level `logger` + its `import logging`. - experimental/mcp/hooks.py: orphaned `mcp_service_hooks_from_config` (not exported, no callers); `mcp_reviewer_hook` is kept (referenced in docs). No public API touched. Full suite 4069 passed; ruff/format/ty clean. Deferred to a follow-up (need per-site verification or an outward-facing decision, NOT included here): dataclass fields with positional-construction risk (AgentLift/JudgeConfig/ServiceConfig/ToolCall/SandboxReplayProxy), back-compat aliases requiring test repoints (scoring.classify_result_dict/count_result_outcomes, SDK pass-through wrappers, RuntimeResult.to_run_result), and the public-`__all__` / Protocol-surface items (OTelCollector, ACP types models, the Sandbox.read_file/ write_file runtime_checkable isinstance-trap, RoundResult.n_tool_calls).
chore: dead-code purge round 2 (verified zero-reference)
Continues the audit-driven cleanup. The keystone fix is the Sandbox Protocol: - Removed `read_file`/`write_file` from the `@runtime_checkable` `Sandbox` Protocol. No docker/daytona/modal backend implements them (backends expose the `upload_file`/`download_file` family) and there were zero call sites on a benchflow Sandbox — they were a latent `isinstance` trap on the contract surface. (`harvey_lab_acp_shim.DirectSandbox` and the deepagents test workspace define their own `read_file`/`write_file`; those are unrelated classes, not this Protocol.) Plus round-3 dead code — each symbol re-verified zero-reference WITH CLASS CONTEXT (the audit's blanket signal mislabeled a couple of items, e.g. `session_load`, which actually has a test caller and was therefore NOT removed): - `TaskMetrics.audit_outcome` property + its now-orphaned `classify_audit_outcome` import (`TaskMetrics.outcome` deliberately kept — a `result.outcome` ref could not be cleanly disambiguated, so it waits for a follow-up). - `OTelCollector.endpoint`, `ReplayRouter.cursor` (both unused @Property's; the look-alike `.endpoint`/`.cursor` reads belong to other classes). - `RuntimeResult.to_run_result` (legacy SDK-compat converter, only caller was its own test — both removed). - Never-read dataclass fields `ToolCall.output`, `JudgeConfig.{reference, prompt_template}` (+ their `_parse_judge` writes + a docs/llm-judge.md row). - Write-only `ReplayProxy._host`; inert `AgentProtocolError.code` annotation; unused `retry_if_exception_type` import + fallback in `sandbox/daytona.py`. Also renamed the misleading `TestReadFileError` test class (it only exercised `ExecResult`, never `read_file`). No public API affected. Full suite 4068 passed; ruff/format/ty clean; an adversarial second-pass review confirmed all 10 removals are unreferenced.
…ocol chore: remove dead Sandbox Protocol stubs + dead-code purge round 3
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6affa1c. Configure here.
…file removal The `Sandbox` Protocol example still listed `read_file`/`write_file` — the unimplemented stub methods removed from the Protocol in #745 (no backend implements them; a custom sandbox would be misled into "must implement" them). Replace with the real file-transfer contract (`upload_file`/`download_file`) and correct the `exec` signature to match `sandbox/protocol.py` (`cmd: str` → `ExecResult`, not `cmd: list[str]` → `SandboxExecResult`).
docs(python-api): fix Sandbox Protocol example after read_file/write_file removal
`OTelCollector` was a designed-but-never-wired OTLP/HTTP receiver from the v2
rewrite: it would have captured an agent's OpenTelemetry GenAI spans (prompts,
completions, model, token usage) into a benchflow Trajectory. But repo-wide it
is never instantiated, has zero tests, and is not part of any run path — the
default capture goes through ACP session events and the LiteLLM callback path.
Its only references were its own definition + the public `__all__` re-exports.
Per maintainer decision ("remove for now"), drop the `trajectories/otel.py`
module and its exports from `benchflow.__all__` and `benchflow.trajectories`.
This removes a public name, so it is a (minor) breaking change — appropriate in
the RC window. Re-add with a test + real wiring if OTel-based capture is revived.
Full suite 4068 passed; ruff/ty clean; `import benchflow` healthy.
chore: remove the unwired OTelCollector (public-API removal)
Pre-release cleanup of stale notes (per maintainer call): - `.claude/dev-docs/0.3-plan.md` — the 0.3-era dev plan; unreferenced and superseded (we're at 0.6). The actively-referenced dev-docs (architecture/harden-sandbox/labs/tested-agents, used by the review skills) are kept. - `docs/architecture-explorer.html` — an unlinked, stale generated explorer artifact (still showed the pre-rename 7-group CLI taxonomy); not in the docs nav and referenced nowhere. - `docs/task-standard-roadmap.md` — the roadmap doc; removed from both `docs.json` navs and de-linked from `task-standard.md` (the "Open Primitives and Roadmap" subsection that only pointed at it is dropped). Docs-only. Both docs.json files validate; no dangling references remain; the CLI<->docs drift guard passes.
chore(docs): remove stale dev/old notes
- pyproject: 0.6.0.dev0 -> 0.6.0 (final release version; __version__ derives from package metadata, so this is the single source of truth). - CHANGELOG: fold the accumulated `[Unreleased]` RC-loop entries into a single dated `## 0.6.0 — 2026-06-13` section, merging the two duplicate 0.6.0 blocks (the 06-10 draft + the [Unreleased] hardening) into one — Added / Changed / Renamed / Removed / Fixed, each appearing once. Updated the Added adoption-router bullet to the shipped `bench eval adopt` names (was the pre-rename `bench agent create|run|verify`). NOTE: install-doc wheel URLs still point at the rc.6 wheel — switch those to the final 0.6.0 / PyPI install at publish time (changing now would 404 until the wheel exists). Full suite 4068 passed; ruff/ty clean.
chore: cut 0.6.0 — bump version + consolidate CHANGELOG
`test_environment_group_is_hidden_but_still_resolves` matched `│`-anchored rows in `bench --help` raw output WITHOUT stripping ANSI. On CI, Rich emits color codes (FORCE_COLOR), so the `^\s*│` anchor matched nothing → `rows == set()` → `assert 'sandbox' in set()` failed. It passed locally only because there were no ANSI codes. This was the sole failure on #665's `test` gate (1 failed / 4112 passed), blocking the 0.6.0 release. Fix: assert against the Click command registry (`typer.main.get_command(app)` + `.hidden`) instead of a regex over rendered help — authoritative and immune to ANSI/locale rendering differences. Verified the fix passes under FORCE_COLOR=1 (which reproduces the CI failure with the old code). The two other help-row parsers (test_cli_docs_drift, test_cli_adopt_aliases) already strip ANSI, so they were unaffected.
test(cli): fix CI-fragile help-parse blocking the 0.6.0 release gate
#750's test rewrite passed `ruff check` (lint) but I didn't run `ruff format --check` — the CI `test` job runs format-check as a gating step and failed on `Would reformat: tests/test_cli_hub_env.py`. This applies ruff format (collapses the set comprehension to ruff's canonical form). No logic change. Verified: ruff format --check + ruff check + ty + the test file all pass.
style: ruff-format tests/test_cli_hub_env.py (CI gate fix)
…s-repros/) A fan-out dead-docs sweep (adversarially verified against release/v0.6.0) confirmed these as orphaned stray artifacts with ZERO inbound references: - STRESS-TEST-v0.6.0.md — a one-off, dated (2026-06-11, rc3) stress-test campaign report at the repo root. Exhausted: the defects it found were fixed and filed in Linear (ENG-248..257). Not in either docs.json nav. - stress-repros/ — the 12-file scratch repro tree (README, LINEAR-ISSUES, and 10 p*.sh scripts) for that campaign. Never wired into the test suite or CI; references an out-of-repo /tmp env file. Its only internal link was the back-link to STRESS-TEST-v0.6.0.md (also removed), so no dangling refs remain. Substantive content lives in Linear. Docs/scratch-only; no code, test, nav, or CI reference touched (verified by grep across md/json/py/toml/yml).
chore: purge dead stress-test artifacts
Bugbot couldn't run - usage limit reachedBugbot is counted against Cursor usage for this user or team, and this run hit a usage or spend limit. A user or team admin can review and increase usage limits in the Cursor dashboard. (requestId: serverGenReqId_15517029-9dbe-4f7f-9367-b0d0441c2a1d) |

Release v0.6.0
The task.md standard release. See CHANGELOG.md for the full list.
Headline
bench tasks init/check/migrate/export) with a leaderboard-grade acceptance gatebench agentadoption router (create/run/verify) — scaffolds, drives conversion, and runs the parity gateors-episodeverifier recognition. The hosted-environment episode runner is in progress (draft feat: OpenReward hosted-environment inbound runner #660) and is NOT in this release.Evidence
Checklist
Drafts to land separately: #660 (OpenReward episode runner — account credits), #661 (AgentBeats — @Yiminnn review).
Note
Medium Risk
Removes the status dashboard and large internal planning docs while adding a secrets-dependent integration workflow; release notes claim broad product/CLI changes reviewers should cross-check against the full branch beyond this doc-heavy diff.
Overview
Release packaging for 0.6.0 —
CHANGELOG.mdis finalized with the 0.6 feature set (nativetask.md,bench eval adopt, ATIF/ADP outputs, ORS interop, Daytona auto-reap, CLI/sandbox fixes, and removals such as the legacy CLI tree,experiments/, and unwired OTel).CITATION.cffbumps to 0.6.0 with an updated release date.Public docs and positioning —
README.mdis rewritten around the “universal environment framework” story: RC wheel install, quickstart viabench eval create,task.mdauthoring, hosted--source-env, loop strategies, and clearer separation ofbench agentvsbench eval adopt.AGENTS.mdupdates Bedrock Opus 4.8 guidance to the LiteLLM proxy patch path.benchmarks/CONVERT.mddocumentsbench eval adopt verifyparity inputs and switches adoption runs toEvaluationinstead ofJob.Repo cleanup — The entire
dashboard/tree and the Claudedashboardlaunch config are removed; internal 0.3 plan doc (.claude/dev-docs/0.3-plan.md) is deleted. Labs index (.claude/dev-docs/labs.md) now points at archiveddocs/labs/instead of inlining runbooks.Contributor/agent tooling — Claude skills (
benchflow,docs-review,branch-review,launch-prep) are aligned with 0.6 CLI (bench eval metrics/view,bench tasks init,jobs/layout, RC install).launch-preptightens the live smoke gate so an all-skipped pytest run cannot false-green e2e.CI — New
.github/workflows/integration-eval.yml: nightly/manual one-task Docker rollout, agent-judge gate, integration pytest slice (excluding duplicate agent rollout), artifact upload; fork-guarded and secrets-backed on the canonical repo.Reviewed by Cursor Bugbot for commit cb2dca9. Bugbot is set up for automated code reviews on this repo. Configure here.