V0.6 integration by xdotli · Pull Request #664 · benchflow-ai/benchflow

xdotli · 2026-06-10T23:04:11Z

Note

Low Risk
Changes are primarily release notes, navigation, and documentation/examples with a version bump; no application logic appears in the shown diff, so runtime risk is minimal unless unstaged code ships separately.

Overview
Release packaging and public docs for BenchFlow 0.6.0, aligning README, CHANGELOG, CITATION.cff, and install pins with the new version and documenting major 0.6 capabilities (task.md, bench tasks, hosted --source-env, trainer ATIF/ADP artifacts, and related fixes called out in the changelog).

Documentation and site structure: Mintlify docs.json adds agent quickstart, task-standard pages, architecture/environment guides, and a validation report; new docs/agent-quickstart.md is a copy-paste agent workflow. Historical v0.5 evidence moves under docs/archive/ with machine-specific paths replaced by repo-relative ones. Internal Claude skills now point at .claude/dev-docs/ instead of .dev-docs/, and the 0.3 plan is marked historical.

Authoring model and examples: User-facing docs treat task.md (YAML frontmatter + prompt) as the canonical task shape alongside legacy split layouts. Large new docs/examples/task-md/ fixtures cover schema-only packages, multi-scene/nudgebench patterns, SkillsBench ports, and generated skill-eval tasks with LLM-judge verifiers. Example notebooks/scripts bump to 0.6.0 and correct multi-agent guidance (shared-workspace file handoff, not automatic outbox injection).

^{Reviewed by Cursor Bugbot for commit 41d42e8. Bugbot is set up for automated code reviews on this repo. Configure here.}

Comment/docstring-only cleanup split out of #652 so the task.md feature PRs review clean. No behavior change.

…xamples The task.md standard document, the 2026-06-09 validation evidence (88/88 conversion parity, 6/6 oracle E2E, 12 of 15 live mode-cells), and the worked example corpus. Split from #652 — pure new docs, no code. The task package implementation stacks on this PR because its test suites use these examples as fixtures.

… export, rollout integration The task.md package: TaskDocument parser, verifier strategies and document, runtime capabilities, round-trip export with loss report, prompt-plan compilation with roles/scenes sidecars, document-nudge user contracts, and the rollout integration that loads task.md scenes. Test suites included; the docs/examples PR underneath provides their fixtures. Lives outside this slice (next in stack): acceptance_live + authoring CLI, adapters/sandbox/cli wiring, and their suites. Split from #652.

…sk.md The remainder of #652 on top of the core package: the live acceptance gate (acceptance_live), the authoring pipeline (init/check/migrate/ export via task_authoring + CLI --format task-md), inbound adapter promotion (tests/->verifier/, solution/->oracle/), ORS tool-output reward events, sandbox lockdown/launch validation and Daytona download fallback, skill_eval + trace generation output formats, and the remaining test updates.

Add tests for benchflow.task.acceptance_live and the acceptance evidence gate in task_authoring, which previously had no coverage: - spec parsing accept/reject paths, including generated calibration cases, report path handling, and leaderboard declarations - execution paths against a scripted fake rollout layer: report schema and sha256 sidecar, run records, reward expectations, flake budgets, verifier error classification, oracle reruns, and leaderboard suitability - static evidence gate accept/reject paths, including tampered and malformed artifacts, sha256 pinning, and threshold cross-checks - check_task wiring: validation-level gating, fail-closed live execution preconditions, and a green end-to-end acceptance-live run

An explicit zero, negative, or non-finite [verifier].timeout_sec passed validation and flowed straight into the verifier execution budget (asyncio.wait_for / sandbox exec), producing an instant silent timeout. Reject unusable budgets at parse time with an error naming the field; omitting the field keeps inheriting the documented 600s default, and authoring profiles keep overriding it.

Convert captured ACP trajectory events to two open agent-trajectory interchange formats, alongside the existing Verifiers/ORS exporter: - export_atif.py: emits one ATIF trajectory document per rollout (trainer/atif.json), pinned against the Harbor pydantic models (harbor-framework/harbor, src/harbor/models/trajectories/*.py) and the ATIF RFC (rfcs/0001-trajectory-format.md), schema ATIF-v1.7. Honors the spec validators: sequential step_ids, agent-only fields, and same-step source_call_id references. ACP titles/statuses ride in ToolCall.extra rather than being passed off as structured arguments. - export_adp.py: emits one ADP Trajectory record per rollout (trainer/adp.jsonl) plus a job-level aggregator, pinned against the ADP schemas (neulab/agent-data-protocol, schema/*.py), version 1.3.1. Every api_action carries a unique tool_call_id with exactly one matched environment text_observation, as ADP validation requires. Both reuse the house redaction and non-finite-scrub seams and carry golden-fixture tests asserting the exact emitted records.

…formance Replace the synthetic Harbor parity fixtures (fabricated sha256 digests) with five small text-only tasks vendored verbatim from benchflow-ai/skillsbench at the commit pinned in manifest.json, so the conversion-parity claim is reproducible from the public repo. The test is now a conformance runner: each vendored task goes through build_harbor_roundtrip_conformance_report (split -> task.md -> split) and must report zero mismatches across the canonical task.toml config, normalized instruction.md prompt, and environment/solution/tests file maps. Companion tests pin the fixture inventory and per-file digests to the manifest and enforce the text-only, under-300KB footprint budget.

litellm defaults to LITELLM_MODE=DEV, which calls load_dotenv() on first import and walks up parent directories until it finds a developer `.env`. Real credentials from that file then leak into os.environ for every test that runs after the first litellm import, breaking tests that assert resolved agent envs carry no inherited API keys (the subscription-auth alias test failed this way whenever a `.env` existed above the checkout). Default the suite to LITELLM_MODE=PRODUCTION, mirroring the existing BENCHFLOW_DOTENV_PATH isolation.

- New docs/task-authoring-task-md.md: minimal three-file native task, frontmatter key classes and shorthands, prompts/ sidecars, verifier.md strategy declaration, oracle layout, and legacy migration; linked from docs.json nav and the split-layout authoring guide. - task-standard.md: scope the cross-standard interop sequencing claim -- ORS interop is a reward-contract bridge, there is no AAA import adapter, and Harbor split-layout conversion is the proven import/export path. - 2026-06-09 validation report: state live-run coverage as 12 of 15 mode-cells (4 of 5 tasks across 3 skill modes; the fifth fails for non-task.md infrastructure reasons).

… end The three native verifier.md strategies had no coverage through Verifier.verify(). Add tests/test_verifier_strategies.py driving the full verify() dispatch over a fake sandbox and a real rollout directory: - reward-kit: every reachable criteria aggregation (weighted_mean, weighted_sum, all_pass, any_pass, threshold) with exact reward assertions including >=0.5 boundary cases, the runner command/env/ manifest contract, and the metrics-must-match-criteria error. - agent-judge: judge model stubbed at the call_judge seam; pass, fail, and fractional-score verdicts plus malformed responses, which fail closed (VerifierOutputParseError, no reward file emitted) and missing declared inputs (AgentJudgeInputError before the judge is called). - ors-episode: terminal-reward extraction pinned against surrounding dense step events, which are retained in the ORS evidence but never aggregated into the headline reward; dense-only episodes, out-of-range terminal rewards, and missing evidence pinned to their error types. Running the suite locally surfaced cross-test secret leakage: importing litellm runs load_dotenv(), which walks ancestor directories and injects any developer .env into os.environ mid-suite, breaking later env-sensitive tests (test_subscription_auth). Set LITELLM_MODE=PRODUCTION in conftest to disable that import-time side effect, matching the intent of the existing isolate_local_dotenv fixture.

Address review findings on the vendored conversion-parity slice: - Restore the eight runtime-results checker tests (rc codes, output messages, schema normalization, reward drift, provenance, skill lane, source-sha, baseline pin) as tests/test_check_skillsbench_harbor_parity.py; the checker module is still wired into tests/integration/run_suite.py and must keep fast unit coverage. - Rename the conformance file to tests/test_skillsbench_conversion_conformance.py so its name stops implying it covers the runtime-results checker. - Add a negative conformance test that tampers one exported environment file via a wrapped export_task_to_split_layout and asserts the report flags drift with the exact mismatch path and reason, so a silently disabled comparator fails the suite. - Derive the vendored task list and parametrization from manifest.json instead of a hand-maintained tuple; the inventory test asserts disk directories match manifest keys. - Shield test_claude_oauth_alias_satisfies_anthropic_key_requirement from an ANTHROPIC_API_KEY exported in the developer shell with monkeypatch.delenv, matching the sibling subscription tests.

…s, split roadmap - task-authoring-task-md.md: agent.timeout_sec is optional in the implementation (AgentConfig defaults to None; check_task treats it as optional; rollout falls back to no wall-clock cap), so describe it as strongly recommended instead of fail-closed required. Drop the stale version pin from the minimal example; the schema default applies. - task-authoring.md / task-standard roadmap: replace named hosted platforms with neutral 'hosted competition platforms' framing. - Split the Open Primitives and Roadmap section out of task-standard.md into docs/task-standard-roadmap.md (task-standard.md was over the 1,000-line bar) and add the new page to the docs.json nav. - Harden two environment-sensitive tests to isolate host credentials: the litellm required-usage test now hides a real ~/.codex/auth.json via a temp HOME, and the Claude OAuth alias test clears host Anthropic/Claude env vars and redirects expanduser.

Extract the format-independent pieces of the Verifiers/ATIF/ADP emitters into trajectories/_export_common.py: - aggregate_rollout_jsonl single-owns the job-level skip/normalize/ redact aggregation; write_job_verifiers_jsonl and write_job_adp_jsonl were verbatim copies that could drift, and now both delegate to it. - content_blocks_to_text moves out of export_atif so the ADP emitter no longer reaches into the ATIF module for a generic ACP content renderer. Its Verifiers sibling _tool_call_to_content is cross-referenced for the eventual consolidation of the renderers. - ThoughtBuffer single-owns the agent_thought join-and-clear bookkeeping that the ATIF and ADP event walkers duplicated. write_rollout_atif_json now builds the steps once, catching the documented empty-trajectory ValueError instead of pre-walking the events, and the ATIF fallback tool-call id documents why it needs no trajectory-wide dedupe (unlike ADP's claim_call_id). Direct tests cover the shared aggregator's newline normalization and unreadable-artifact skip paths, which were previously only reachable through the wrappers.

# Conflicts: # tests/conftest.py

The default scaffold pinned agent: claude-agent-acp inside a scenes block, which silently overrode --agent on bench eval create — the scaffold's own documented oracle smoke test ran claude-agent-acp and failed on ACP auth. Scaffold is now bare prose (front matter + prompt); roles/scenes stay opt-in. Regression test asserts the scaffold pins no agent and declares no scenes.

On Docker 29.x, compose up can print 'Network ... Created' yet fail the container create/start that follows with 'Error response from daemon: failed to set up container networking: network <project>_default not found' — a daemon-side race between network creation and attachment. Retry compose up with bounded backoff (2s, 5s) only when the failure matches that daemon network-not-found signature, logging each retry. All other compose up failures surface unchanged.

A bare msg[:50] slice in the per-rollout console line could cut an embedded token in half with no marker, so an error like 'Docker compose command failed for environment authored-task...' rendered as '... environment auth' — reading as a complete environment named 'auth'. Add _utils.text.truncate_end, which only ever removes the tail of a message, backs up to the previous word boundary when the cut lands mid-token, and marks the cut with an ellipsis while staying within the width budget. Apply it to the [ERR] result line, the retry preview, and the resume-skip log in evaluation.py.

- Lead the first-eval examples with the local --tasks-dir pattern and note that --source-repo clones the full repository, with a sparse checkout alternative for fetching a single task - Document the <PROVIDER>_API_KEY + <PROVIDER>_BASE_URL convention for provider-prefixed models and the export requirement for .env files - Correct the output layout to the real <jobs-dir>/<timestamp>/<task>__<hash8>/ contract and list what lands in result.json, trajectory/, trainer/, and verifier/ - Add a 'Reading results' note on exit codes, Score vs reward, and the Docker daemon requirement - Document bench tasks migrate/normalize and the --level acceptance checks as development-branch features, with a pointer to the benchflow.evidence schema in docs/task-standard.md

…-consistent Three fixes from live-run dogfooding of sandbox rollouts: - config.json recorded sandbox_setup_timeout (default 120) while the agent install step enforced the per-agent registry install_timeout (900), so the recorded budget never matched the 'Installing <agent> in sandbox (timeout=900s)' log. The fallback chain now lives in one place, agents.install.effective_install_timeout: a per-agent registry value overrides the configured sandbox_setup_timeout, installers without a registry entry fall back to the config value instead of a hardcoded 900, and config.json records the effective value as agent_install_timeout (None when the agent has no install step). Hosted-env config.json carries the same key for schema parity. - 'Discovered N pytest plugins from container' said container even on non-container backends; the discovery messages now say sandbox. - Docker rollouts left agent/acp_trajectory.jsonl on the host while Daytona rollouts did not. The publication exists for in-sandbox verifiers (/logs/agent/acp_trajectory.jsonl, read by LLM-judge verifiers); on Docker the host copy appeared only because /logs/agent is bind-mounted to the host agent dir, while remote sandboxes never mirror it back. Decision: write the host copy on every backend rather than suppress it on Docker — /logs/agent maps to the host agent dir by mount design, the sandbox-side file must keep existing for verifiers, and dropping the Docker copy would delete an already-published artifact. _publish_trajectory_for_verifier now writes the redacted payload to the rollout agent/ dir itself, so both backends produce the same artifact set.

bench tasks migrate materialized the full TaskConfig model dump into the generated task.md frontmatter — dozens of runtime defaults the author never declared, including a verifier.judge block pinning a judge model. Migration now emits only the keys the source task.toml actually declared: import_task_config_toml exposes the declared (extras-sanitized) mapping, and render_task_md_from_legacy renders that surface instead of reparsing a full model dump. The existing migrate-time equivalence check still proves the minimal document parses to the same TaskConfig. Also reconcile the version key spelling between authoring entrypoints: init scaffolded 'version:' while migrate wrote 'schema_version'. The task standard treats schema_version as canonical with version as a parse-time alias, so both init and migrate now emit schema_version (pinned first in migrated frontmatter), and the alias remains accepted on parse.

…tions - Quote the full missing-base-URL error including 'to build the provider base URL', matching the message raised during agent env resolution. - Split exit codes accurately: unknown agents / missing credentials exit 1, CLI usage errors (bad flags) exit 2. - Mark the jobs-root summary.json as a copy of the latest job summary and llm_trajectory.jsonl as conditional on captured proxy exchanges. - Scope the <PROVIDER>_API_KEY + <PROVIDER>_BASE_URL convention to user-supplied-endpoint providers; fixed-endpoint providers need only a key. - Note the dataset cache resolves against the enclosing git repo root. - Restyle the tasks check --level availability note as a callout to match the migrate/normalize sections.

…ntegrate/v06

Add three cohesive subcommands to the existing `bench agent` group for adopting an upstream benchmark into a BenchFlow benchmark: - create <name>: deterministic, fail-closed scaffold of benchmarks/<name>/ matching the reference layout and benchmarks/CONVERT.md (converter with a documented convert() entry point, parity test, templated parity_experiment, benchmark.yaml, runner, README). Validates the slug; refuses to overwrite. - run <source>: driver that assembles the adoption context (source + CONVERT.md + adoption skills) and launches the host codex CLI to drive the conversion. Context assembly and launch-command construction are pure functions, unit-proven with a fake exec layer; live codex run deferred. Fails closed with a precise message when codex has no credentials. - verify <name>: parity-only gate over a deterministic per-criterion conversion-faithfulness floor plus a statistical reward-distribution layer. Emits a confidence verdict and, on divergence, a draft GitHub issue body for human support (never auto-filed). Real logic lives in benchflow/agent_router.py (+ scaffold templates module); cli/main.py only registers the commands. Tests cover scaffold contents, name validation, refuse-on-exists, context/command assembly, the credential fail-closed path, and pass/fail/insufficient verdicts with issue drafts.

…e text, drop unused arg Review follow-ups: --roundtrip-task wires roundtrip_conformance_status into verify (was library-only); confidence_line no longer claims criterion agreement when zero criteria were compared (reward-only confirm path); collect_adoption_skills drops the unused repo_root; new test exercises the real build_harbor_roundtrip_conformance_report import so the wiring can't silently rot.

…e-router

…readme

chatgpt-codex-connector · 2026-06-10T23:04:16Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

mintlify · 2026-06-10T23:04:31Z

Preview deployment for your docs. Learn more about Mintlify Previews.

Project	Status	Preview	Updated (UTC)
benchflow-bff148e7	🔴 Failed	–	Jun 10, 2026, 11:04 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

cursor · 2026-06-10T23:05:01Z

+
+        # Component Separation
+        def quantize(v):
+            return (round(v[0], 4), round(v[1], 4), round(v[2], 4))


Verifier skill vertex precision mismatch

High Severity

The pytest ground-truth path quantizes STL vertices to four decimals when building connected components, while the bundled MeshAnalyzer skill rounds to five. That can split or merge components differently than agents using the skill, so correct mass_report.json outputs may fail material_id or mass checks.

Additional Locations (1)

docs/examples/task-md/real-skillsbench/3d-scan-calc/environment/skills/mesh-analysis/scripts/mesh_tool.py#L85-L87

^{Reviewed by Cursor Bugbot for commit 650dcc6. Configure here.}

cursor · 2026-06-10T23:05:01Z

+                    model=model if is_gemini_model else "gemini-3.1-flash-lite",
+                    contents=prompt,
+                )
+                return response.text


Judge ignores configured model

Medium Severity

In call_llm, whenever GOOGLE_API_KEY or GEMINI_API_KEY is set, the judge calls Gemini (defaulting to gemini-3.1-flash-lite) before Anthropic, even when JUDGE_MODEL is a Claude model. Skill-eval scores can come from a different model than configured.

^{Reviewed by Cursor Bugbot for commit 650dcc6. Configure here.}

…nt-matter shape Normalize all 11 task.md examples without adding/removing any or changing what each one teaches: - Front matter key: every example now declares schema_version: "1.3" (the parser's current TaskConfig default); the legacy 'version:' alias and the two missing-key cases are gone. - Agent pinning convention: examples whose pedagogy is roles/scenes (multi-scene, nudgebench-team, harbor-parity, private-facts-nudges) keep their agents/scenes blocks but pin one agent uniformly: claude-agent-acp, the registry's default ACP agent (DEFAULT_AGENT in evaluation.py). Per-role model/reasoning_effort pins are dropped as noise; roles stay differentiated by capabilities. Examples whose point is not roles/scenes pin no agent. - Simulated-user blocks use the neutral 'model: scripted' runtime consistently instead of naming specific model ids. - Timeout values normalized to integers in the real-skillsbench trio. - docs/examples/task-md/README.md index now lists clean-body-roles-scenes among the schema-only fixtures. - docs/examples/README.md notes that nanofirm-task intentionally stays in the legacy task.toml split layout as the bench tasks migrate fixture. Gate: bench tasks check passes for all 11 (schema level for the four authoring fixtures, publication-grade for the seven runnable packages); all examples parse via TaskDocument.from_path; full non-integration pytest suite green.

…0.6 surface - coder-reviewer-demo.py: describe the shared-workspace file handoff (the outbox auto-injection convention no longer exists in the runtime); pin benchflow==0.6.0 - scene-patterns.ipynb: replace the removed outbox convention with the explicit feedback-file handoff used by the runtime and the demo script; clear pattern-3 and comparison outputs recorded under the old prompts; pin benchflow==0.6.0 - swebench_pro_progressive_disclosure.ipynb: fix ../docs/ relative links (notebook lives in docs/examples/), replace the hardcoded author-machine chdir with a repo-root walk, re-execute the pure-local cells so the committed RoundResult output shows the 0.6 fields (scene, role, handoff_from, handoff_to), document the task.md frontmatter form of the hardening opt-out, and add the task.md-declared user pattern to the multi-round comparison - user_dogfood.py: round-0 comment names task.md as the canonical instruction source alongside legacy instruction.md

- Bump release-state text and install pins from 0.5.2 to 0.6.0 across README, getting-started, release, llm-judge, skill-eval, python-api, and the runnable examples; update release.md worked examples to the 0.6.x line (next dev: 0.6.1.dev0). - Mark the task standard as shipped with 0.6.0; lead the Task primitive and authoring links with the native task.md format. - Fix doc/code drift: timeout_sec is optional (no wall-clock cap when unset), bench tasks check no longer claims to require it, oracle runs use --agent (no -a short flag), legacy scaffold examples now pass --format legacy, and the results tree documents rewards.jsonl, timing.json, prompts.json, and the trainer ATIF/ADP artifacts. - CLI reference: document bench tasks export, eval create worker/build concurrency flags and --reasoning-effort, bench continue --proxy-mode, bench continue-batch, tasks init --format, and tasks generate --task-format; add an export note to the task.md authoring guide. - Add docs/architecture and docs/environment-plane to the site nav. - Archive dated v0.4/v0.5 evidence docs under docs/archive/ (with README) and scrub absolute home-dir paths from the archived reports; update the skill-eval inbound links. - Fix broken .dev-docs/ links in labs READMEs and internal skill files; mark the 0.3 plan as historical. - Add docs/agent-quickstart.md (copy-paste prompt that walks an AI agent through install, one real eval, artifact inspection, and task authoring) and link it from the README and nav.

…grate-final

# Conflicts: # docs/examples/coder-reviewer-demo.py # docs/examples/scene-patterns.ipynb

bench tasks init wrote schema_version "1.0" while the parser default, the docs, and the migrate path all say "1.3" — fresh scaffolds started life two schema versions behind. Found during the example-corpus harmonization.

… PyPI The all-or-nothing version gate stopped a verbatim cold user during the publish window. Steps 0-6 work on 0.5.x; the prompt now continues in degraded mode with explicit expectations (verifiers.jsonl only, split scaffold) and tells the user what 0.6.0 unlocks.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

There are 4 total unresolved issues (including 2 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 8ed3181. Configure here.}

cursor · 2026-06-11T00:22:12Z

+                    model=model if is_gemini_model else "gemini-3.1-flash-lite",
+                    contents=prompt,
+                )
+                return response.text


Judge ignores Anthropic model

Medium Severity

In the new skill-eval example judge.py verifiers, call_llm enters the Gemini branch whenever GOOGLE_API_KEY or GEMINI_API_KEY is set, even when JUDGE_MODEL is an Anthropic model (the default). It then calls Gemini with a hardcoded flash-lite model instead of the configured judge, so rubric scoring can silently use the wrong provider.

Additional Locations (2)

docs/examples/task-md/generated-skill-eval/models-as-skills/optimize-quadratic-to-nlogn/verifier/judge.py#L91-L109

docs/examples/task-md/generated-skill-eval/models-as-skills/topo-sort-with-cycle-detection/verifier/judge.py#L91-L109

^{Reviewed by Cursor Bugbot for commit 8ed3181. Configure here.}

cursor · 2026-06-11T00:22:12Z

+  2. `agent_result.total_tokens` > 0   (real model traffic was captured)
+  3. `rewards` is present and its value is not null (the verifier scored it)
+Check them with a one-liner (kept on one line so indentation cannot break it):
+    python3 -c "import json,sys; d=json.load(open(sys.argv[1])); t=d.get('n_tool_calls') or 0; k=(d.get('agent_result') or {}).get('total_tokens') or 0; r=d.get('rewards'); print('n_tool_calls:',t,'total_tokens:',k,'rewards:',r); print('REAL run' if t>0 and k>0 and r else 'NOT a real run'); sys.exit(0 if t>0 and k>0 and r else 1)" "$(find "$JOBS_DIR" -name result.json | head -1)"


Quickstart REAL check too weak

Low Severity

Step 5 tells agents a run is REAL only when rewards is present and not null, but the bundled one-liner treats any non-empty rewards object as success. A result.json with rewards: {} or {"reward": null} can be labeled REAL even though no verifier score was recorded.

^{Reviewed by Cursor Bugbot for commit 8ed3181. Configure here.}

…tadata Audit P0/P1: the 0.6.0 changelog claimed an OpenReward hosted-environment episode runner, but only the ORS reward-format export + ors-episode verifier recognition ship in this release (the runner is in progress, PR pending). Reworded to what actually ships. Also bump CITATION.cff to 0.6.0 — a missed release step that would show the wrong version in GitHub's cite box.

The user-driven rollout loop ran the scene-step loop and then a free-round while-loop. Both break paths out of the scene-step loop — user.run() raising (self._error set) and a classic user returning None (stop) — fell straight into the free-round loop, which called user.run() again with the same round_num. This resurrected stopped users and retried users that had already raised, executing extra rounds while self._error stayed set: a half-script rollout reported as errored. Set a loop_terminated flag on both break paths and guard the free-round while-loop on it so it neither executes nor re-calls user.run() once the scene-step loop has stopped or errored. Normal completion is unchanged. Add call-counting tests in test_user.py: after an explicit stop, run() is invoked exactly once and zero free rounds execute; after a raise, the loop terminates with the error set and no further run()/agent rounds occur. Both fail if the guard is removed (the prior tests used a user that always raised, masking the resurrection).

The legacy split scaffold still ships on this release: bench tasks init --format legacy emits task.toml + instruction.md + tests/ + solution/, and the legacy code paths (render_task_md_from_legacy, the tests/ and solution/ aliases, legacy_solution_dir) remain in src/benchflow. Ten tests guarding that behavior had been marked skip as 'legacy scaffold superseded', leaving the shipped path unguarded. Un-skip all ten and pin them to the current contract: - TestCheckTask.test_missing_tests_dir asserts the current missing verifier-package message that names the legacy tests/ fallback. - The nine init scaffold tests move into TestInitTaskLegacyScaffold and call init_task(..., task_format='legacy'), so they exercise the legacy layout (tests/, solution/) rather than the new task.md default. - The replace-all-placeholders test now also replaces tests/test_outputs.py, which the fresh legacy scaffold ships with a [REPLACE: ...] marker, so a fully edited task is the only one that passes check_task. Full suite green; tests/test_tasks.py carries no skips.

The Daytona auto-reaper deleted every sandbox visible to DAYTONA_API_KEY by age alone. On a key shared across an org or with other tools this irreversibly destroyed unrelated sandboxes, since nothing distinguished benchflow's sandboxes from foreign ones. Stamp every sandbox benchflow creates with a 'benchflow.managed' label at all six CreateSandboxFromImageParams / CreateSandboxFromSnapshotParams sites (the pinned daytona==0.184.0 SDK supports labels on CreateSandboxBaseParams), and restrict reap_stale_sandboxes to only those sandboxes before applying the age TTL. Foreign or unlabeled sandboxes are never touched, regardless of age. The TTL tiers and the BENCHFLOW_DAYTONA_AUTO_REAP gate are unchanged. A fresh label dict is built per creation call because the SDK mutates params.labels in place (it injects the language label). Tests: foreign/unlabeled sandboxes (even 9999m old) are never reaped; owned stale ones still are; the scope predicate and CLI cleanup wrapper are covered directly, including value-equality and non-mapping cases.

…pe' into integrate-audit-fixes

…tegrate-audit-fixes

…' into integrate-audit-fixes

…udit-fixes

xdotli · 2026-06-11T05:53:32Z

Consolidating v0.6 onto a single working branch: release/v0.6.0 is now the v0.6 branch (carries the same content as v0.6-integration). Closing this duplicate review vehicle; v0.6 work and the release cut track via #665.

Test User and others added 30 commits June 9, 2026 19:07

chore: normalize comment banners and stale task.toml doc references

99360e9

Comment/docstring-only cleanup split out of #652 so the task.md feature PRs review clean. No behavior change.

Merge branch 'harden/acceptance-live-tests' into integrate/wiring

949fa8b

Merge branch 'harden/verifier-strategy-e2e' into integrate/wiring

d5af890

Merge branch 'harden/timeout-sec-failclosed' into integrate/wiring

77183b1

Merge branch 'harden/vendored-parity-slice' into integrate/wiring

51c3e24

# Conflicts: # tests/conftest.py

Merge branch 'harden/migrate-minimal-frontmatter' into integrate/wiring2

74e821a

Merge branch 'harden/console-error-truncation' into integrate/wiring2

4a01d62

Merge branch 'harden/compose-network-retry' into integrate/wiring2

0996c85

Merge branch 'harden/daytona-recorded-config' into integrate/wiring2

0445363

Merge remote-tracking branch 'origin/v06/cleanup-code-quality' into i…

2d0bdfe

…ntegrate/v06

xdotli added 4 commits June 10, 2026 18:48

Merge remote-tracking branch 'origin/feat/agent-router' into integrat…

4697116

…e-router

Merge remote-tracking branch 'origin/docs/readme-v06' into integrate-…

650dcc6

…readme

cursor Bot reviewed Jun 10, 2026

View reviewed changes

xdotli added 11 commits June 10, 2026 19:05

chore: release v0.6.0

40c4757

Merge remote-tracking branch 'origin/docs/v06-final-polish' into inte…

59480c0

…grate-final

Merge branch 'docs/harmonize-examples' into integrate-final

84082a2

Merge branch 'docs/harmonize-examples-wip' into integrate-final

433b015

# Conflicts: # docs/examples/coder-reviewer-demo.py # docs/examples/scene-patterns.ipynb

fix(authoring): scaffold emits the current schema_version

3eba1a5

bench tasks init wrote schema_version "1.0" while the parser default, the docs, and the migrate path all say "1.3" — fresh scaffolds started life two schema versions behind. Found during the example-corpus harmonization.

Merge remote-tracking branch 'origin/release/v0.6.0' into integrate-back

c1fbbc9

Merge remote-tracking branch 'origin/release/v0.6.0' into sync-int

8ed3181

cursor Bot reviewed Jun 11, 2026

View reviewed changes

xdotli added 9 commits June 10, 2026 22:52

Merge remote-tracking branch 'origin/fix/daytona-reaper-ownership-sco…

8c9f3d1

…pe' into integrate-audit-fixes

Merge remote-tracking branch 'origin/fix/user-loop-stop-flag' into in…

5a6f95c

…tegrate-audit-fixes

Merge remote-tracking branch 'origin/fix/unskip-legacy-scaffold-tests…

6d9db71

…' into integrate-audit-fixes

Merge remote-tracking branch 'origin/fix/audit-docs' into integrate-a…

7d49b59

…udit-fixes

Merge remote-tracking branch 'origin/release/v0.6.0' into sync-int2

41d42e8

xdotli closed this Jun 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V0.6 integration#664

V0.6 integration#664
xdotli wants to merge 65 commits into
mainfrom
v0.6-integration

xdotli commented Jun 10, 2026 •

edited by cursor Bot

Loading

Uh oh!

chatgpt-codex-connector Bot commented Jun 10, 2026

Uh oh!

mintlify Bot commented Jun 10, 2026 •

edited

Loading

Uh oh!

cursor Bot Jun 10, 2026

Uh oh!

cursor Bot Jun 10, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 11, 2026

Uh oh!

cursor Bot Jun 11, 2026

Uh oh!

xdotli commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xdotli commented Jun 10, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented Jun 10, 2026

Uh oh!

mintlify Bot commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot Jun 10, 2026

Choose a reason for hiding this comment

Verifier skill vertex precision mismatch

Uh oh!

cursor Bot Jun 10, 2026

Choose a reason for hiding this comment

Judge ignores configured model

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

Judge ignores Anthropic model

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

Quickstart REAL check too weak

Uh oh!

xdotli commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xdotli commented Jun 10, 2026 •

edited by cursor Bot

Loading

mintlify Bot commented Jun 10, 2026 •

edited

Loading