chore: release v0.6.0 by xdotli · Pull Request #665 · benchflow-ai/benchflow

xdotli · 2026-06-10T23:06:01Z

Release v0.6.0

The task.md standard release. See CHANGELOG.md for the full list.

Headline

task.md task standard + authoring CLI (bench tasks init/check/migrate/export) with a leaderboard-grade acceptance gate
bench agent adoption router (create/run/verify) — scaffolds, drives conversion, and runs the parity gate
ATIF + ADP trajectory artifacts emitted from every scored rollout
OpenReward (ORS) reward-format interop — reward export + ors-episode verifier recognition. The hosted-environment episode runner is in progress (draft feat: OpenReward hosted-environment inbound runner #660) and is NOT in this release.
Daytona sandbox auto-reap — ownership-scoped: only benchflow-labeled sandboxes are reaped; foreign sandboxes on a shared key are never touched.
Quickstart/CLI docs reconciled with observed behavior; 15 dogfood findings closed

Evidence

Full suite: 3,406 passed, 0 failed
Layer-2 parity (49 tasks × 3 trials, deepseek-v4-flash): mean delta −2.4% ± 3.7% SE, 95% CI includes zero, 0 conversion defects across all 26 triaged divergent cells (independently recomputed from raw rollouts in audit)
Eight-lens adversarial ultraaudit run pre-tag: 2 P0 + several P1 found and fixed — destructive-reaper scoping, the false OpenReward claim corrected, the user-loop regression, citation + test guards. Security + privacy clean.

Checklist

CI gate passes (ruff format/check, full pytest, uv.lock)
e2e live smoke (Docker) — run before merge
CHANGELOG updated · CITATION.cff bumped to 0.6.0
Version bumped to 0.6.0
Ultraaudit blockers resolved
Merge and tag after review (do not merge automatically)

Drafts to land separately: #660 (OpenReward episode runner — account credits), #661 (AgentBeats — @Yiminnn review).

Note

Medium Risk
Removes the status dashboard and large internal planning docs while adding a secrets-dependent integration workflow; release notes claim broad product/CLI changes reviewers should cross-check against the full branch beyond this doc-heavy diff.

Overview
Release packaging for 0.6.0 — CHANGELOG.md is finalized with the 0.6 feature set (native task.md, bench eval adopt, ATIF/ADP outputs, ORS interop, Daytona auto-reap, CLI/sandbox fixes, and removals such as the legacy CLI tree, experiments/, and unwired OTel). CITATION.cff bumps to 0.6.0 with an updated release date.

Public docs and positioning — README.md is rewritten around the “universal environment framework” story: RC wheel install, quickstart via bench eval create, task.md authoring, hosted --source-env, loop strategies, and clearer separation of bench agent vs bench eval adopt. AGENTS.md updates Bedrock Opus 4.8 guidance to the LiteLLM proxy patch path. benchmarks/CONVERT.md documents bench eval adopt verify parity inputs and switches adoption runs to Evaluation instead of Job.

Repo cleanup — The entire dashboard/ tree and the Claude dashboard launch config are removed; internal 0.3 plan doc (.claude/dev-docs/0.3-plan.md) is deleted. Labs index (.claude/dev-docs/labs.md) now points at archived docs/labs/ instead of inlining runbooks.

Contributor/agent tooling — Claude skills (benchflow, docs-review, branch-review, launch-prep) are aligned with 0.6 CLI (bench eval metrics/view, bench tasks init, jobs/ layout, RC install). launch-prep tightens the live smoke gate so an all-skipped pytest run cannot false-green e2e.

CI — New .github/workflows/integration-eval.yml: nightly/manual one-task Docker rollout, agent-judge gate, integration pytest slice (excluding duplicate agent rollout), artifact upload; fork-guarded and secrets-backed on the canonical repo.

^{Reviewed by Cursor Bugbot for commit cb2dca9. Bugbot is set up for automated code reviews on this repo. Configure here.}

chatgpt-codex-connector · 2026-06-10T23:06:07Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

mintlify · 2026-06-10T23:06:25Z

Preview deployment for your docs. Learn more about Mintlify Previews.

Project	Status	Preview	Updated (UTC)
benchflow-bff148e7	🔴 Failed	–	Jun 10, 2026, 11:06 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

…ide state The pre-agent workspace snapshot /testbed_verify is seeded world-readable (chmod -R o+rX) so the root verifier can diff against it, but it was missing from _DEFAULT_LOCKED. That let the non-root agent read grading-side state in the snapshot — verifier/judge config, rubrics, expected outputs, and judge credentials — enabling reward forgery. Add /testbed_verify to _DEFAULT_LOCKED. Seeding runs before lockdown_paths in install_agent, so lockdown's chown root:root + chmod 700 closes the read window before the agent starts. The verifier runs as root and still reads the snapshot regardless of mode, so oracle/verifier execution is unaffected. Tests: pin /testbed_verify into the default and resolved locked sets (mutation-killers), assert the emitted lockdown command chowns root and chmod 700s it, and add a hermetic sandbox-level test proving the seed->lockdown sequence leaves no group/other read or traverse bits on the snapshot dir.

…o integrate-iso

…fold edit list running-any-benchmark.md: Layer 2 conflated two distinct translation flows. Split them: (a) migrate a task you control -> validate with bench tasks check (no parity gate); (b) adopt a foreign benchmark -> bench agent create then prove with bench agent verify. bench agent verify only scores an adopted benchmarks/<name>/ and errors 'benchmark not adopted ... run bench agent create first' on a migrated task.md. agent-quickstart.md: - STEP 7: add the three placeholder files the task.md scaffold also writes (verifier/verifier.md, verifier/rubrics/verifier.md, verifier/rubrics/verifier.toml) to the replace-the-placeholders list so a literal follower passes 'bench tasks check' first try. - STEP 1: add a release-candidate install note. 0.6.0 is not yet on PyPI, so point testers at the GitHub prerelease wheel for the real 0.6 flow today and keep the post-publish 'benchflow==0.6.0' command. Add a one-line trap about running 'uv run bench' from inside the project dir.

… full init scaffold bench agent verify <name> crashed with AttributeError on a top-level JSON array parity_experiment.json (extract_criterion_comparisons / extract_reward_samples called .get on a list). Guard a non-Mapping payload in both parsers so verify returns a documented verdict (insufficient-evidence) instead of crashing, and never fabricates phantom zero-delta reward samples from summary records. Widen the related type hints (load_parity_experiment, build_verify_report) to match reality. bench tasks init's "Created:" summary under-reported the scaffold (omitted verifier/test_outputs.py and verifier/rubrics/verifier.toml, both validated by bench tasks check). Add scaffold_task() returning every file actually written (derived from disk), keep init_task() as a thin wrapper, and have the CLI list the real file set. Tests: top-level-array parity file -> verify verdict not exception (unit + CLI); init reported set == written set across format/no-pytest/no-oracle.

…e gate The artifacts section promised `bench agent verify` scores every shipped parity_experiment.json and that the recorded experiments span per-criterion verdict agreement. Verified against code, neither holds: - `build_verify_report`/`extract_criterion_comparisons` require the file to be a JSON object; one shipped experiment is a top-level list of runs, which the gate cannot score, so `verify` does not work uniformly across the shipped set. - No shipped experiment records per-criterion data — every `verify` run reports 0/0 criteria — so "spanning per-criterion verdict agreement" was aspirational. Rewrite the section to state the object-shape requirement, give a worked `bench agent verify programbench` example (parity-confirmed, reward delta within the 0.02 tolerance), note the shipped set is non-uniform (others return insufficient-evidence; one is a list the gate cannot score), and reword the Layer 2 lead-in so it no longer implies verify scores every adopted benchmark.

…y' into integrate-iter3

…integrate-iter3

iter2: verifier-home isolation (/testbed_verify locked from agent read). iter3: bench agent verify no longer crashes on list-shaped parity files (graceful verdict; guarded both parsers, avoids a false parity-confirmed); bench tasks init Created-summary now lists the real file set; docs L2 migrate-vs-adopt disambiguation + quickstart placeholder/RC-wheel fixes. Suite 3,437 green; harvey-lab verify confirmed non-crashing.

…th in authoring example Final cold-read nits (full flow already passed): the RC install example pinned an older tag (literal copy fetched rc.2); the authoring example used a relative hello.txt that could need an oracle iteration to resolve. Both docs-only; no code/wheel change.

Remove the dashboard/ dev status board (Linear/CI/HF-scores surface, never shipped in the wheel) and its 10 dashboard-only test modules. Redirect or reword every reference so nothing dangles: - pyproject.toml: daytona>=0.184 floor comment now cites sandbox/daytona.py (the actual sync-client user) instead of the removed dashboard panel. - tests/test_paths_symlink_helpers.py: drop the cross-ref to the removed symlink-ingestion test. - experiments/skillsbench-fill/README.md: describe the ledger summary instead of the removed Experiments tab path. - .claude/launch.json: drop the dashboard run config. - docs/running-any-benchmark.md: neutral wording for downstream consumers. Collapse labs/ by archiving the two 0.2.x-era research artifacts under labs/archive/ (git mv, evidence preserved): - benchjack-sandbox-hardening, reward-hack-matrix Add labs/archive/README.md and point README, docs/concepts.md, docs/sandbox-hardening.md, and .claude/dev-docs/labs.md at the archived location with a clear (historical, 0.2.x-era) note. The security story in docs/sandbox-hardening.md is unchanged. Gate: pytest tests/ (excl. integration) green; no dangling dashboard/ paths or broken labs/ links remain.

…hecks Pure structural quick wins in sandbox/daytona.py with no behavior change: - Add _require_sandbox() helper and route the repeated 'if not self._sandbox: raise' preconditions through it so the type checker narrows AsyncSandbox|None at each call site (same falsiness check and error message as the inlined guards). - Define _SDK_RETRY once and apply it to the six idempotent SDK helpers that inlined an identical tenacity policy; _create_sandbox and _stop_sandbox keep their distinct policies. - Add _reject_non_main_service() for the three identical single-container ValueError guards in the direct strategy. - Build CreateSandboxFromImageParams once in _DaytonaDirect.start by computing the image in an inner if/else (ownership label preserved). - Drop the no-op 'try/finally: pass' wrapper in _sandbox_exec, keeping the explanatory comment.

- Hoist shutil to a single top-level import; drop redundant local re/tempfile/shlex/shutil imports. - Extract _record_agent_timeout() shared by both TimeoutError handlers. - Add RolloutConfig._primary_role; collapse the three primary_* properties to one-liners. - Initialize _effective_task_path/_task_tmp/_session_traj_count/ _session_tool_count in __init__ and drop the now-redundant getattr guards on the install_agent()/execute() reads. Guards in _capture_partial_acp_trajectory() and cleanup() are kept: they intentionally tolerate Rollout.__new__() test stubs that skip __init__ (documented in-code).

…ult reporter Pure structural cleanups in the CLI lockdown surface (zero behavior change): - lockdown: drop the dead _under_path helper (the live copy is inlined in the trusted-PATH script string). - cli: extract _report_eval_result for the duplicated Score/errors summary print used by job and eval create. - cli: collapse the two near-identical EvaluationConfig constructions in eval create into one _make_eval_config(source_provenance=...) builder. - cli: add benchflow/cli/_options.py with reusable Typer option aliases (--agent/--model/--sandbox/--concurrency/--jobs-dir/--skill-mode) and apply them across commands; per-command defaults stay at the call site so --help output is byte-identical.

The daytona module is import-safe without the SDK (#358), so the factory's import guard never tripped and selecting the Daytona environment leaked a raw ImportError deep inside DaytonaSandbox.__init__ (after CPU/memory clamping). Force _load_daytona_sdk() at env selection and route the failure through _raise_missing_optional_sandbox_dependency, matching the modal branch — a clear 'uv sync --extra sandbox-daytona' / 'pip install benchflow[sandbox-daytona]' error. Widen the helper's exc type to ImportError. Add a focused unit test.

v0.6 renamed SDK.run(trial_name=...) to rollout_name with no alias, a hard break for pre-v0.6 callers. Re-add an optional 'trial_name' keyword that maps to 'rollout_name' and emits a DeprecationWarning (matching the warnings.warn( ..., DeprecationWarning, stacklevel=2) style used in task/config.py). Passing both is ambiguous and raises TypeError. Adds tests/test_sdk_run_alias.py covering the warn+map path, the no-warn rollout_name path, and the TypeError on both.

The strict validate_reward_map rejects any unrecognized non-numeric top-level key (e.g. the Harbor-era {"reward":1.0,"done":true}) and any non-numeric metric, so the classic test.sh->reward.json flow fails the whole run. Add an opt-in lenient path aligned with the existing validation abstraction. - validate_reward_map(..., lenient=False): new flag. Lenient drops unrecognized/non-numeric top-level keys and non-numeric metric entries (and malformed recognized-structured keys) with a single warnings.warn listing everything dropped, instead of raising. Still requires a usable scalar reward: from 'reward', else a numeric 'score'/'rewards' alias, else numeric metrics + a declared aggregate policy. Default stays strict (no behaviour change). - reward_lenient_from_env(): reads BENCHFLOW_REWARD_LENIENT (truthy 1/true/yes/on), matching benchflow's BENCHFLOW_* operator-toggle convention. Threaded into Verifier._parse_reward_json and the final _ensure_canonical_rewards gate in rollout.py. - tests/test_reward_lenient.py: strict still raises on {reward,done}; lenient drops done + non-numeric metrics, keeps reward, warns; alias derivation; env parsing.

strip_provider_prefix leaves a bare model id (e.g. deepseek-v4-flash) that find_provider can no longer match, so the openclaw shim's _infer_provider_prefix defaulted every non-gemini/gpt id to anthropic -- silently running deepseek/glm/qwen/... as anthropic. Add a registry-owned find_provider_for_bare_model(model): each provider declares the bare-model family tokens it owns via a new ProviderConfig.model_prefixes field, and the helper does longest-token matching (with a non-letter family boundary so glm matches glm-4.6 / glm5 but not glmnext), falling back to an unambiguous declared models[].id. _infer_provider_prefix now consults it before the native gemini/gpt heuristics, then anthropic -- existing gemini/gpt/anthropic behavior unchanged. Tests: tests/test_bare_model_provider.py.

…tegrate-qw

…to integrate-qw

…integrate-cleanup

16-probe deterministic sweep (task.md gate, CLI arg validation, provider/agent wiring, judge routing, ATIF/ADP, Daytona reaper) with adversarial verification, plus 4 live rollout campaigns (oracle Docker/Daytona, deepseek-v4-flash openhands Daytona, Docker/Daytona parity isolation, live gemini-3.1-flash-lite judge). 9 confirmed defects (2 P1, 6 P2, 3 P3) + 1 broken example oracle; 4 false alarms refuted. Engine, trajectory artifacts, secret redaction, reaper ownership-scoping, and conversion gate all pass under stress (3473/3474 fast tests).

From a 92-agent dead-code audit of the release HEAD. This PR removes only the high-confidence, pure-deletion subset — symbols re-verified zero-reference WITH CLASS CONTEXT (the audit's blanket signal flagged e.g. RolloutPaths.log_path as "dead" while 14 unrelated `.log_path` attributes existed on other classes; each item here was disambiguated against its real reference sites): - task/paths.py: 7 unused `*_path` @Property's — TaskPaths.{readme_path, gitignore_path, verifier_document_path} and RolloutPaths.{artifacts_manifest_path, result_path, exception_message_path, log_path}. The underlying files are still written/read via inline `dir / "name"`; only the unused accessors go. - sandbox/modal_impl.py: ModalSandbox.{supports_gpus, can_disable_internet} — vestigial capability properties not on the Sandbox Protocol (GPU config reads task_env_config.gpus; internet policy uses SandboxConfig.allow_internet). - cli/continue_cmd.py: unused module-level `logger` + its `import logging`. - experimental/mcp/hooks.py: orphaned `mcp_service_hooks_from_config` (not exported, no callers); `mcp_reviewer_hook` is kept (referenced in docs). No public API touched. Full suite 4069 passed; ruff/format/ty clean. Deferred to a follow-up (need per-site verification or an outward-facing decision, NOT included here): dataclass fields with positional-construction risk (AgentLift/JudgeConfig/ServiceConfig/ToolCall/SandboxReplayProxy), back-compat aliases requiring test repoints (scoring.classify_result_dict/count_result_outcomes, SDK pass-through wrappers, RuntimeResult.to_run_result), and the public-`__all__` / Protocol-surface items (OTelCollector, ACP types models, the Sandbox.read_file/ write_file runtime_checkable isinstance-trap, RoundResult.n_tool_calls).

chore: dead-code purge round 2 (verified zero-reference)

Continues the audit-driven cleanup. The keystone fix is the Sandbox Protocol: - Removed `read_file`/`write_file` from the `@runtime_checkable` `Sandbox` Protocol. No docker/daytona/modal backend implements them (backends expose the `upload_file`/`download_file` family) and there were zero call sites on a benchflow Sandbox — they were a latent `isinstance` trap on the contract surface. (`harvey_lab_acp_shim.DirectSandbox` and the deepagents test workspace define their own `read_file`/`write_file`; those are unrelated classes, not this Protocol.) Plus round-3 dead code — each symbol re-verified zero-reference WITH CLASS CONTEXT (the audit's blanket signal mislabeled a couple of items, e.g. `session_load`, which actually has a test caller and was therefore NOT removed): - `TaskMetrics.audit_outcome` property + its now-orphaned `classify_audit_outcome` import (`TaskMetrics.outcome` deliberately kept — a `result.outcome` ref could not be cleanly disambiguated, so it waits for a follow-up). - `OTelCollector.endpoint`, `ReplayRouter.cursor` (both unused @Property's; the look-alike `.endpoint`/`.cursor` reads belong to other classes). - `RuntimeResult.to_run_result` (legacy SDK-compat converter, only caller was its own test — both removed). - Never-read dataclass fields `ToolCall.output`, `JudgeConfig.{reference, prompt_template}` (+ their `_parse_judge` writes + a docs/llm-judge.md row). - Write-only `ReplayProxy._host`; inert `AgentProtocolError.code` annotation; unused `retry_if_exception_type` import + fallback in `sandbox/daytona.py`. Also renamed the misleading `TestReadFileError` test class (it only exercised `ExecResult`, never `read_file`). No public API affected. Full suite 4068 passed; ruff/format/ty clean; an adversarial second-pass review confirmed all 10 removals are unreferenced.

…ocol chore: remove dead Sandbox Protocol stubs + dead-code purge round 3

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 6affa1c. Configure here.}

…file removal The `Sandbox` Protocol example still listed `read_file`/`write_file` — the unimplemented stub methods removed from the Protocol in #745 (no backend implements them; a custom sandbox would be misled into "must implement" them). Replace with the real file-transfer contract (`upload_file`/`download_file`) and correct the `exec` signature to match `sandbox/protocol.py` (`cmd: str` → `ExecResult`, not `cmd: list[str]` → `SandboxExecResult`).

docs(python-api): fix Sandbox Protocol example after read_file/write_file removal

`OTelCollector` was a designed-but-never-wired OTLP/HTTP receiver from the v2 rewrite: it would have captured an agent's OpenTelemetry GenAI spans (prompts, completions, model, token usage) into a benchflow Trajectory. But repo-wide it is never instantiated, has zero tests, and is not part of any run path — the default capture goes through ACP session events and the LiteLLM callback path. Its only references were its own definition + the public `__all__` re-exports. Per maintainer decision ("remove for now"), drop the `trajectories/otel.py` module and its exports from `benchflow.__all__` and `benchflow.trajectories`. This removes a public name, so it is a (minor) breaking change — appropriate in the RC window. Re-add with a test + real wiring if OTel-based capture is revived. Full suite 4068 passed; ruff/ty clean; `import benchflow` healthy.

chore: remove the unwired OTelCollector (public-API removal)

Pre-release cleanup of stale notes (per maintainer call): - `.claude/dev-docs/0.3-plan.md` — the 0.3-era dev plan; unreferenced and superseded (we're at 0.6). The actively-referenced dev-docs (architecture/harden-sandbox/labs/tested-agents, used by the review skills) are kept. - `docs/architecture-explorer.html` — an unlinked, stale generated explorer artifact (still showed the pre-rename 7-group CLI taxonomy); not in the docs nav and referenced nowhere. - `docs/task-standard-roadmap.md` — the roadmap doc; removed from both `docs.json` navs and de-linked from `task-standard.md` (the "Open Primitives and Roadmap" subsection that only pointed at it is dropped). Docs-only. Both docs.json files validate; no dangling references remain; the CLI<->docs drift guard passes.

chore(docs): remove stale dev/old notes

- pyproject: 0.6.0.dev0 -> 0.6.0 (final release version; __version__ derives from package metadata, so this is the single source of truth). - CHANGELOG: fold the accumulated `[Unreleased]` RC-loop entries into a single dated `## 0.6.0 — 2026-06-13` section, merging the two duplicate 0.6.0 blocks (the 06-10 draft + the [Unreleased] hardening) into one — Added / Changed / Renamed / Removed / Fixed, each appearing once. Updated the Added adoption-router bullet to the shipped `bench eval adopt` names (was the pre-rename `bench agent create|run|verify`). NOTE: install-doc wheel URLs still point at the rc.6 wheel — switch those to the final 0.6.0 / PyPI install at publish time (changing now would 404 until the wheel exists). Full suite 4068 passed; ruff/ty clean.

chore: cut 0.6.0 — bump version + consolidate CHANGELOG

`test_environment_group_is_hidden_but_still_resolves` matched `│`-anchored rows in `bench --help` raw output WITHOUT stripping ANSI. On CI, Rich emits color codes (FORCE_COLOR), so the `^\s*│` anchor matched nothing → `rows == set()` → `assert 'sandbox' in set()` failed. It passed locally only because there were no ANSI codes. This was the sole failure on #665's `test` gate (1 failed / 4112 passed), blocking the 0.6.0 release. Fix: assert against the Click command registry (`typer.main.get_command(app)` + `.hidden`) instead of a regex over rendered help — authoritative and immune to ANSI/locale rendering differences. Verified the fix passes under FORCE_COLOR=1 (which reproduces the CI failure with the old code). The two other help-row parsers (test_cli_docs_drift, test_cli_adopt_aliases) already strip ANSI, so they were unaffected.

test(cli): fix CI-fragile help-parse blocking the 0.6.0 release gate

#750's test rewrite passed `ruff check` (lint) but I didn't run `ruff format --check` — the CI `test` job runs format-check as a gating step and failed on `Would reformat: tests/test_cli_hub_env.py`. This applies ruff format (collapses the set comprehension to ruff's canonical form). No logic change. Verified: ruff format --check + ruff check + ty + the test file all pass.

style: ruff-format tests/test_cli_hub_env.py (CI gate fix)

…s-repros/) A fan-out dead-docs sweep (adversarially verified against release/v0.6.0) confirmed these as orphaned stray artifacts with ZERO inbound references: - STRESS-TEST-v0.6.0.md — a one-off, dated (2026-06-11, rc3) stress-test campaign report at the repo root. Exhausted: the defects it found were fixed and filed in Linear (ENG-248..257). Not in either docs.json nav. - stress-repros/ — the 12-file scratch repro tree (README, LINEAR-ISSUES, and 10 p*.sh scripts) for that campaign. Never wired into the test suite or CI; references an out-of-repo /tmp env file. Its only internal link was the back-link to STRESS-TEST-v0.6.0.md (also removed), so no dangling refs remain. Substantive content lives in Linear. Docs/scratch-only; no code, test, nav, or CI reference touched (verified by grep across md/json/py/toml/yml).

chore: purge dead stress-test artifacts

cursor · 2026-06-13T23:44:49Z

Bugbot couldn't run - usage limit reached

Bugbot is counted against Cursor usage for this user or team, and this run hit a usage or spend limit.

A user or team admin can review and increase usage limits in the Cursor dashboard.

(requestId: serverGenReqId_15517029-9dbe-4f7f-9367-b0d0441c2a1d)

cursor Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread docs/examples/task-md/real-skillsbench/3d-scan-calc/verifier/test_outputs.py

cursor Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread .../examples/task-md/generated-skill-eval/models-as-skills/regex-email-parser/verifier/judge.py

This was referenced Jun 11, 2026

V0.6 integration #664

Closed

docs: v0.6 release-candidate preview pointer #666

Closed

cursor Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread .../examples/task-md/generated-skill-eval/models-as-skills/regex-email-parser/verifier/judge.py

xdotli added 2 commits June 11, 2026 02:44

Merge remote-tracking branch 'origin/rc2/verifier-home-isolation' int…

6322183

…o integrate-iso

cursor Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread docs/agent-quickstart.md

xdotli and others added 20 commits June 11, 2026 03:09

Merge remote-tracking branch 'origin/rc3/verify-crash-and-init-summar…

93aab7b

…y' into integrate-iter3

Merge remote-tracking branch 'origin/rc3/doc-l2-and-quickstart' into …

2871f61

…integrate-iter3

Merge remote-tracking branch 'origin/smell/rollout-quickwins' into in…

c206cca

…tegrate-qw

Merge remote-tracking branch 'origin/smell/daytona-quickwins' into in…

36f517f

…tegrate-qw

Merge remote-tracking branch 'origin/smell/cli-lockdown-quickwins' in…

b98cf58

…to integrate-qw

Merge remote-tracking branch 'origin/rc/cleanup-dashboard-labs' into …

0f25f78

…integrate-cleanup

xdotli added 4 commits June 13, 2026 17:17

Merge pull request #744 from benchflow-ai/chore/dead-code-purge-r9

4a8db01

chore: dead-code purge round 2 (verified zero-reference)

Merge pull request #745 from benchflow-ai/chore/dead-code-r3-and-prot…

6affa1c

…ocol chore: remove dead Sandbox Protocol stubs + dead-code purge round 3

mintlify Bot deployed to staging - docs June 13, 2026 21:43 View deployment

cursor Bot reviewed Jun 13, 2026

View reviewed changes

Comment thread .claude/skills/launch-prep/SKILL.md

xdotli added 2 commits June 13, 2026 17:45

Merge pull request #746 from benchflow-ai/docs/v06-dogfood-truth

2a3a4a1

docs(python-api): fix Sandbox Protocol example after read_file/write_file removal

mintlify Bot deployed to staging - docs June 13, 2026 21:46 View deployment

xdotli added 5 commits June 13, 2026 18:36

Merge pull request #747 from benchflow-ai/chore/remove-unwired-otel

c38090a

chore: remove the unwired OTelCollector (public-API removal)

Merge pull request #748 from benchflow-ai/chore/remove-stale-notes

3a46e03

chore(docs): remove stale dev/old notes

xdotli mentioned this pull request Jun 13, 2026

chore: cut 0.6.0 — bump version + consolidate CHANGELOG #749

Merged

xdotli added 2 commits June 13, 2026 18:58

Merge pull request #749 from benchflow-ai/chore/cut-0.6.0

c121554

chore: cut 0.6.0 — bump version + consolidate CHANGELOG

xdotli mentioned this pull request Jun 13, 2026

test(cli): fix CI-fragile help-parse blocking the 0.6.0 release gate #750

Merged

xdotli added 2 commits June 13, 2026 19:20

Merge pull request #750 from benchflow-ai/fix/ci-help-parse-ansi

306cf29

test(cli): fix CI-fragile help-parse blocking the 0.6.0 release gate

xdotli mentioned this pull request Jun 13, 2026

style: ruff-format tests/test_cli_hub_env.py (CI gate fix) #751

Merged

xdotli added 3 commits June 13, 2026 19:24

Merge pull request #751 from benchflow-ai/fix/format-hub-env-test

3ef837d

style: ruff-format tests/test_cli_hub_env.py (CI gate fix)

Merge pull request #752 from benchflow-ai/chore/purge-stress-artifacts

cb2dca9

chore: purge dead stress-test artifacts

xdotli merged commit 73fbca5 into main Jun 13, 2026
3 of 4 checks passed

bingran-you mentioned this pull request Jun 14, 2026

Dashboard computes stale release evidence but never renders the warning #544

Closed

xdotli deleted the release/v0.6.0 branch June 14, 2026 21:44

This was referenced Jun 15, 2026

ManifestEnvironment service start lacks fd detachment — manifest evals wedge on Daytona (same class as downstream service-hook wedge) #676

Closed

fix: fully detach manifest service starts so Daytona evals don't wedge (#676) #734

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: release v0.6.0#665

chore: release v0.6.0#665
xdotli merged 339 commits into
mainfrom
release/v0.6.0

xdotli commented Jun 10, 2026 •

edited by cursor Bot

Loading

Uh oh!

chatgpt-codex-connector Bot commented Jun 10, 2026

Uh oh!

mintlify Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xdotli commented Jun 10, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release v0.6.0

Headline

Evidence

Checklist

Uh oh!

chatgpt-codex-connector Bot commented Jun 10, 2026

Uh oh!

mintlify Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cursor Bot commented Jun 13, 2026

Bugbot couldn't run - usage limit reached

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xdotli commented Jun 10, 2026 •

edited by cursor Bot

Loading

mintlify Bot commented Jun 10, 2026 •

edited

Loading