Skip to content

V0.6 integration#664

Closed
xdotli wants to merge 65 commits into
mainfrom
v0.6-integration
Closed

V0.6 integration#664
xdotli wants to merge 65 commits into
mainfrom
v0.6-integration

Conversation

@xdotli

@xdotli xdotli commented Jun 10, 2026

Copy link
Copy Markdown
Member

Note

Low Risk
Changes are primarily release notes, navigation, and documentation/examples with a version bump; no application logic appears in the shown diff, so runtime risk is minimal unless unstaged code ships separately.

Overview
Release packaging and public docs for BenchFlow 0.6.0, aligning README, CHANGELOG, CITATION.cff, and install pins with the new version and documenting major 0.6 capabilities (task.md, bench tasks, hosted --source-env, trainer ATIF/ADP artifacts, and related fixes called out in the changelog).

Documentation and site structure: Mintlify docs.json adds agent quickstart, task-standard pages, architecture/environment guides, and a validation report; new docs/agent-quickstart.md is a copy-paste agent workflow. Historical v0.5 evidence moves under docs/archive/ with machine-specific paths replaced by repo-relative ones. Internal Claude skills now point at .claude/dev-docs/ instead of .dev-docs/, and the 0.3 plan is marked historical.

Authoring model and examples: User-facing docs treat task.md (YAML frontmatter + prompt) as the canonical task shape alongside legacy split layouts. Large new docs/examples/task-md/ fixtures cover schema-only packages, multi-scene/nudgebench patterns, SkillsBench ports, and generated skill-eval tasks with LLM-judge verifiers. Example notebooks/scripts bump to 0.6.0 and correct multi-agent guidance (shared-workspace file handoff, not automatic outbox injection).

Reviewed by Cursor Bugbot for commit 41d42e8. Bugbot is set up for automated code reviews on this repo. Configure here.

Test User and others added 30 commits June 9, 2026 19:07
Comment/docstring-only cleanup split out of #652 so the task.md feature
PRs review clean. No behavior change.
…xamples

The task.md standard document, the 2026-06-09 validation evidence
(88/88 conversion parity, 6/6 oracle E2E, 12 of 15 live mode-cells),
and the worked example corpus. Split from #652 — pure new docs, no
code. The task package implementation stacks on this PR because its
test suites use these examples as fixtures.
… export, rollout integration

The task.md package: TaskDocument parser, verifier strategies and
document, runtime capabilities, round-trip export with loss report,
prompt-plan compilation with roles/scenes sidecars, document-nudge
user contracts, and the rollout integration that loads task.md scenes.
Test suites included; the docs/examples PR underneath provides their
fixtures.

Lives outside this slice (next in stack): acceptance_live + authoring
CLI, adapters/sandbox/cli wiring, and their suites. Split from #652.
…sk.md

The remainder of #652 on top of the core package: the live acceptance
gate (acceptance_live), the authoring pipeline (init/check/migrate/
export via task_authoring + CLI --format task-md), inbound adapter
promotion (tests/->verifier/, solution/->oracle/), ORS tool-output
reward events, sandbox lockdown/launch validation and Daytona download
fallback, skill_eval + trace generation output formats, and the
remaining test updates.
Add tests for benchflow.task.acceptance_live and the acceptance evidence
gate in task_authoring, which previously had no coverage:

- spec parsing accept/reject paths, including generated calibration
  cases, report path handling, and leaderboard declarations
- execution paths against a scripted fake rollout layer: report schema
  and sha256 sidecar, run records, reward expectations, flake budgets,
  verifier error classification, oracle reruns, and leaderboard
  suitability
- static evidence gate accept/reject paths, including tampered and
  malformed artifacts, sha256 pinning, and threshold cross-checks
- check_task wiring: validation-level gating, fail-closed live
  execution preconditions, and a green end-to-end acceptance-live run
An explicit zero, negative, or non-finite [verifier].timeout_sec passed
validation and flowed straight into the verifier execution budget
(asyncio.wait_for / sandbox exec), producing an instant silent timeout.
Reject unusable budgets at parse time with an error naming the field;
omitting the field keeps inheriting the documented 600s default, and
authoring profiles keep overriding it.
Convert captured ACP trajectory events to two open agent-trajectory
interchange formats, alongside the existing Verifiers/ORS exporter:

- export_atif.py: emits one ATIF trajectory document per rollout
  (trainer/atif.json), pinned against the Harbor pydantic models
  (harbor-framework/harbor, src/harbor/models/trajectories/*.py) and
  the ATIF RFC (rfcs/0001-trajectory-format.md), schema ATIF-v1.7.
  Honors the spec validators: sequential step_ids, agent-only fields,
  and same-step source_call_id references. ACP titles/statuses ride in
  ToolCall.extra rather than being passed off as structured arguments.

- export_adp.py: emits one ADP Trajectory record per rollout
  (trainer/adp.jsonl) plus a job-level aggregator, pinned against the
  ADP schemas (neulab/agent-data-protocol, schema/*.py), version 1.3.1.
  Every api_action carries a unique tool_call_id with exactly one
  matched environment text_observation, as ADP validation requires.

Both reuse the house redaction and non-finite-scrub seams and carry
golden-fixture tests asserting the exact emitted records.
…formance

Replace the synthetic Harbor parity fixtures (fabricated sha256 digests)
with five small text-only tasks vendored verbatim from
benchflow-ai/skillsbench at the commit pinned in manifest.json, so the
conversion-parity claim is reproducible from the public repo.

The test is now a conformance runner: each vendored task goes through
build_harbor_roundtrip_conformance_report (split -> task.md -> split)
and must report zero mismatches across the canonical task.toml config,
normalized instruction.md prompt, and environment/solution/tests file
maps. Companion tests pin the fixture inventory and per-file digests to
the manifest and enforce the text-only, under-300KB footprint budget.
litellm defaults to LITELLM_MODE=DEV, which calls load_dotenv() on first
import and walks up parent directories until it finds a developer
`.env`. Real credentials from that file then leak into os.environ for
every test that runs after the first litellm import, breaking tests
that assert resolved agent envs carry no inherited API keys (the
subscription-auth alias test failed this way whenever a `.env` existed
above the checkout). Default the suite to LITELLM_MODE=PRODUCTION,
mirroring the existing BENCHFLOW_DOTENV_PATH isolation.
- New docs/task-authoring-task-md.md: minimal three-file native task,
  frontmatter key classes and shorthands, prompts/ sidecars, verifier.md
  strategy declaration, oracle layout, and legacy migration; linked from
  docs.json nav and the split-layout authoring guide.
- task-standard.md: scope the cross-standard interop sequencing claim --
  ORS interop is a reward-contract bridge, there is no AAA import adapter,
  and Harbor split-layout conversion is the proven import/export path.
- 2026-06-09 validation report: state live-run coverage as 12 of 15
  mode-cells (4 of 5 tasks across 3 skill modes; the fifth fails for
  non-task.md infrastructure reasons).
… end

The three native verifier.md strategies had no coverage through
Verifier.verify(). Add tests/test_verifier_strategies.py driving the full
verify() dispatch over a fake sandbox and a real rollout directory:

- reward-kit: every reachable criteria aggregation (weighted_mean,
  weighted_sum, all_pass, any_pass, threshold) with exact reward
  assertions including >=0.5 boundary cases, the runner command/env/
  manifest contract, and the metrics-must-match-criteria error.
- agent-judge: judge model stubbed at the call_judge seam; pass, fail,
  and fractional-score verdicts plus malformed responses, which fail
  closed (VerifierOutputParseError, no reward file emitted) and missing
  declared inputs (AgentJudgeInputError before the judge is called).
- ors-episode: terminal-reward extraction pinned against surrounding
  dense step events, which are retained in the ORS evidence but never
  aggregated into the headline reward; dense-only episodes, out-of-range
  terminal rewards, and missing evidence pinned to their error types.

Running the suite locally surfaced cross-test secret leakage: importing
litellm runs load_dotenv(), which walks ancestor directories and injects
any developer .env into os.environ mid-suite, breaking later
env-sensitive tests (test_subscription_auth). Set LITELLM_MODE=PRODUCTION
in conftest to disable that import-time side effect, matching the intent
of the existing isolate_local_dotenv fixture.
Address review findings on the vendored conversion-parity slice:

- Restore the eight runtime-results checker tests (rc codes, output
  messages, schema normalization, reward drift, provenance, skill lane,
  source-sha, baseline pin) as
  tests/test_check_skillsbench_harbor_parity.py; the checker module is
  still wired into tests/integration/run_suite.py and must keep fast
  unit coverage.
- Rename the conformance file to
  tests/test_skillsbench_conversion_conformance.py so its name stops
  implying it covers the runtime-results checker.
- Add a negative conformance test that tampers one exported environment
  file via a wrapped export_task_to_split_layout and asserts the report
  flags drift with the exact mismatch path and reason, so a silently
  disabled comparator fails the suite.
- Derive the vendored task list and parametrization from
  manifest.json instead of a hand-maintained tuple; the inventory test
  asserts disk directories match manifest keys.
- Shield test_claude_oauth_alias_satisfies_anthropic_key_requirement
  from an ANTHROPIC_API_KEY exported in the developer shell with
  monkeypatch.delenv, matching the sibling subscription tests.
…s, split roadmap

- task-authoring-task-md.md: agent.timeout_sec is optional in the
  implementation (AgentConfig defaults to None; check_task treats it as
  optional; rollout falls back to no wall-clock cap), so describe it as
  strongly recommended instead of fail-closed required. Drop the stale
  version pin from the minimal example; the schema default applies.
- task-authoring.md / task-standard roadmap: replace named hosted
  platforms with neutral 'hosted competition platforms' framing.
- Split the Open Primitives and Roadmap section out of task-standard.md
  into docs/task-standard-roadmap.md (task-standard.md was over the
  1,000-line bar) and add the new page to the docs.json nav.
- Harden two environment-sensitive tests to isolate host credentials:
  the litellm required-usage test now hides a real ~/.codex/auth.json
  via a temp HOME, and the Claude OAuth alias test clears host
  Anthropic/Claude env vars and redirects expanduser.
Extract the format-independent pieces of the Verifiers/ATIF/ADP
emitters into trajectories/_export_common.py:

- aggregate_rollout_jsonl single-owns the job-level skip/normalize/
  redact aggregation; write_job_verifiers_jsonl and write_job_adp_jsonl
  were verbatim copies that could drift, and now both delegate to it.
- content_blocks_to_text moves out of export_atif so the ADP emitter no
  longer reaches into the ATIF module for a generic ACP content
  renderer. Its Verifiers sibling _tool_call_to_content is
  cross-referenced for the eventual consolidation of the renderers.
- ThoughtBuffer single-owns the agent_thought join-and-clear
  bookkeeping that the ATIF and ADP event walkers duplicated.

write_rollout_atif_json now builds the steps once, catching the
documented empty-trajectory ValueError instead of pre-walking the
events, and the ATIF fallback tool-call id documents why it needs no
trajectory-wide dedupe (unlike ADP's claim_call_id). Direct tests cover
the shared aggregator's newline normalization and unreadable-artifact
skip paths, which were previously only reachable through the wrappers.
The default scaffold pinned agent: claude-agent-acp inside a scenes
block, which silently overrode --agent on bench eval create — the
scaffold's own documented oracle smoke test ran claude-agent-acp and
failed on ACP auth. Scaffold is now bare prose (front matter + prompt);
roles/scenes stay opt-in. Regression test asserts the scaffold pins no
agent and declares no scenes.
On Docker 29.x, compose up can print 'Network ... Created' yet fail the
container create/start that follows with 'Error response from daemon:
failed to set up container networking: network <project>_default not
found' — a daemon-side race between network creation and attachment.

Retry compose up with bounded backoff (2s, 5s) only when the failure
matches that daemon network-not-found signature, logging each retry.
All other compose up failures surface unchanged.
A bare msg[:50] slice in the per-rollout console line could cut an
embedded token in half with no marker, so an error like 'Docker compose
command failed for environment authored-task...' rendered as
'... environment auth' — reading as a complete environment named 'auth'.

Add _utils.text.truncate_end, which only ever removes the tail of a
message, backs up to the previous word boundary when the cut lands
mid-token, and marks the cut with an ellipsis while staying within the
width budget. Apply it to the [ERR] result line, the retry preview, and
the resume-skip log in evaluation.py.
- Lead the first-eval examples with the local --tasks-dir pattern and
  note that --source-repo clones the full repository, with a sparse
  checkout alternative for fetching a single task
- Document the <PROVIDER>_API_KEY + <PROVIDER>_BASE_URL convention for
  provider-prefixed models and the export requirement for .env files
- Correct the output layout to the real
  <jobs-dir>/<timestamp>/<task>__<hash8>/ contract and list what lands
  in result.json, trajectory/, trainer/, and verifier/
- Add a 'Reading results' note on exit codes, Score vs reward, and the
  Docker daemon requirement
- Document bench tasks migrate/normalize and the --level acceptance
  checks as development-branch features, with a pointer to the
  benchflow.evidence schema in docs/task-standard.md
…-consistent

Three fixes from live-run dogfooding of sandbox rollouts:

- config.json recorded sandbox_setup_timeout (default 120) while the agent
  install step enforced the per-agent registry install_timeout (900), so the
  recorded budget never matched the 'Installing <agent> in sandbox
  (timeout=900s)' log. The fallback chain now lives in one place,
  agents.install.effective_install_timeout: a per-agent registry value
  overrides the configured sandbox_setup_timeout, installers without a
  registry entry fall back to the config value instead of a hardcoded 900,
  and config.json records the effective value as agent_install_timeout
  (None when the agent has no install step). Hosted-env config.json carries
  the same key for schema parity.

- 'Discovered N pytest plugins from container' said container even on
  non-container backends; the discovery messages now say sandbox.

- Docker rollouts left agent/acp_trajectory.jsonl on the host while Daytona
  rollouts did not. The publication exists for in-sandbox verifiers
  (/logs/agent/acp_trajectory.jsonl, read by LLM-judge verifiers); on Docker
  the host copy appeared only because /logs/agent is bind-mounted to the
  host agent dir, while remote sandboxes never mirror it back. Decision:
  write the host copy on every backend rather than suppress it on Docker —
  /logs/agent maps to the host agent dir by mount design, the sandbox-side
  file must keep existing for verifiers, and dropping the Docker copy would
  delete an already-published artifact. _publish_trajectory_for_verifier now
  writes the redacted payload to the rollout agent/ dir itself, so both
  backends produce the same artifact set.
bench tasks migrate materialized the full TaskConfig model dump into the
generated task.md frontmatter — dozens of runtime defaults the author
never declared, including a verifier.judge block pinning a judge model.
Migration now emits only the keys the source task.toml actually declared:
import_task_config_toml exposes the declared (extras-sanitized) mapping,
and render_task_md_from_legacy renders that surface instead of reparsing
a full model dump. The existing migrate-time equivalence check still
proves the minimal document parses to the same TaskConfig.

Also reconcile the version key spelling between authoring entrypoints:
init scaffolded 'version:' while migrate wrote 'schema_version'. The
task standard treats schema_version as canonical with version as a
parse-time alias, so both init and migrate now emit schema_version
(pinned first in migrated frontmatter), and the alias remains accepted
on parse.
…tions

- Quote the full missing-base-URL error including 'to build the provider
  base URL', matching the message raised during agent env resolution.
- Split exit codes accurately: unknown agents / missing credentials exit 1,
  CLI usage errors (bad flags) exit 2.
- Mark the jobs-root summary.json as a copy of the latest job summary and
  llm_trajectory.jsonl as conditional on captured proxy exchanges.
- Scope the <PROVIDER>_API_KEY + <PROVIDER>_BASE_URL convention to
  user-supplied-endpoint providers; fixed-endpoint providers need only a key.
- Note the dataset cache resolves against the enclosing git repo root.
- Restyle the tasks check --level availability note as a callout to match
  the migrate/normalize sections.
xdotli added 4 commits June 10, 2026 18:48
Add three cohesive subcommands to the existing `bench agent` group for
adopting an upstream benchmark into a BenchFlow benchmark:

- create <name>: deterministic, fail-closed scaffold of benchmarks/<name>/
  matching the reference layout and benchmarks/CONVERT.md (converter with a
  documented convert() entry point, parity test, templated parity_experiment,
  benchmark.yaml, runner, README). Validates the slug; refuses to overwrite.
- run <source>: driver that assembles the adoption context (source +
  CONVERT.md + adoption skills) and launches the host codex CLI to drive the
  conversion. Context assembly and launch-command construction are pure
  functions, unit-proven with a fake exec layer; live codex run deferred.
  Fails closed with a precise message when codex has no credentials.
- verify <name>: parity-only gate over a deterministic per-criterion
  conversion-faithfulness floor plus a statistical reward-distribution layer.
  Emits a confidence verdict and, on divergence, a draft GitHub issue body for
  human support (never auto-filed).

Real logic lives in benchflow/agent_router.py (+ scaffold templates module);
cli/main.py only registers the commands. Tests cover scaffold contents, name
validation, refuse-on-exists, context/command assembly, the credential
fail-closed path, and pass/fail/insufficient verdicts with issue drafts.
…e text, drop unused arg

Review follow-ups: --roundtrip-task wires roundtrip_conformance_status
into verify (was library-only); confidence_line no longer claims
criterion agreement when zero criteria were compared (reward-only
confirm path); collect_adoption_skills drops the unused repo_root; new
test exercises the real build_harbor_roundtrip_conformance_report import
so the wiring can't silently rot.
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, add credits to your account and enable them for code reviews in your settings.

@mintlify

mintlify Bot commented Jun 10, 2026

Copy link
Copy Markdown

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
benchflow-bff148e7 🔴 Failed Jun 10, 2026, 11:04 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.


# Component Separation
def quantize(v):
return (round(v[0], 4), round(v[1], 4), round(v[2], 4))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verifier skill vertex precision mismatch

High Severity

The pytest ground-truth path quantizes STL vertices to four decimals when building connected components, while the bundled MeshAnalyzer skill rounds to five. That can split or merge components differently than agents using the skill, so correct mass_report.json outputs may fail material_id or mass checks.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 650dcc6. Configure here.

model=model if is_gemini_model else "gemini-3.1-flash-lite",
contents=prompt,
)
return response.text

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judge ignores configured model

Medium Severity

In call_llm, whenever GOOGLE_API_KEY or GEMINI_API_KEY is set, the judge calls Gemini (defaulting to gemini-3.1-flash-lite) before Anthropic, even when JUDGE_MODEL is a Claude model. Skill-eval scores can come from a different model than configured.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 650dcc6. Configure here.

xdotli added 11 commits June 10, 2026 19:05
…nt-matter shape

Normalize all 11 task.md examples without adding/removing any or changing
what each one teaches:

- Front matter key: every example now declares schema_version: "1.3"
  (the parser's current TaskConfig default); the legacy 'version:' alias
  and the two missing-key cases are gone.
- Agent pinning convention: examples whose pedagogy is roles/scenes
  (multi-scene, nudgebench-team, harbor-parity, private-facts-nudges)
  keep their agents/scenes blocks but pin one agent uniformly:
  claude-agent-acp, the registry's default ACP agent (DEFAULT_AGENT in
  evaluation.py). Per-role model/reasoning_effort pins are dropped as
  noise; roles stay differentiated by capabilities. Examples whose point
  is not roles/scenes pin no agent.
- Simulated-user blocks use the neutral 'model: scripted' runtime
  consistently instead of naming specific model ids.
- Timeout values normalized to integers in the real-skillsbench trio.
- docs/examples/task-md/README.md index now lists clean-body-roles-scenes
  among the schema-only fixtures.
- docs/examples/README.md notes that nanofirm-task intentionally stays in
  the legacy task.toml split layout as the bench tasks migrate fixture.

Gate: bench tasks check passes for all 11 (schema level for the four
authoring fixtures, publication-grade for the seven runnable packages);
all examples parse via TaskDocument.from_path; full non-integration
pytest suite green.
…0.6 surface

- coder-reviewer-demo.py: describe the shared-workspace file handoff
  (the outbox auto-injection convention no longer exists in the runtime);
  pin benchflow==0.6.0
- scene-patterns.ipynb: replace the removed outbox convention with the
  explicit feedback-file handoff used by the runtime and the demo script;
  clear pattern-3 and comparison outputs recorded under the old prompts;
  pin benchflow==0.6.0
- swebench_pro_progressive_disclosure.ipynb: fix ../docs/ relative links
  (notebook lives in docs/examples/), replace the hardcoded author-machine
  chdir with a repo-root walk, re-execute the pure-local cells so the
  committed RoundResult output shows the 0.6 fields (scene, role,
  handoff_from, handoff_to), document the task.md frontmatter form of the
  hardening opt-out, and add the task.md-declared user pattern to the
  multi-round comparison
- user_dogfood.py: round-0 comment names task.md as the canonical
  instruction source alongside legacy instruction.md
- Bump release-state text and install pins from 0.5.2 to 0.6.0 across
  README, getting-started, release, llm-judge, skill-eval, python-api,
  and the runnable examples; update release.md worked examples to the
  0.6.x line (next dev: 0.6.1.dev0).
- Mark the task standard as shipped with 0.6.0; lead the Task primitive
  and authoring links with the native task.md format.
- Fix doc/code drift: timeout_sec is optional (no wall-clock cap when
  unset), bench tasks check no longer claims to require it, oracle runs
  use --agent (no -a short flag), legacy scaffold examples now pass
  --format legacy, and the results tree documents rewards.jsonl,
  timing.json, prompts.json, and the trainer ATIF/ADP artifacts.
- CLI reference: document bench tasks export, eval create worker/build
  concurrency flags and --reasoning-effort, bench continue --proxy-mode,
  bench continue-batch, tasks init --format, and tasks generate
  --task-format; add an export note to the task.md authoring guide.
- Add docs/architecture and docs/environment-plane to the site nav.
- Archive dated v0.4/v0.5 evidence docs under docs/archive/ (with README)
  and scrub absolute home-dir paths from the archived reports; update the
  skill-eval inbound links.
- Fix broken .dev-docs/ links in labs READMEs and internal skill files;
  mark the 0.3 plan as historical.
- Add docs/agent-quickstart.md (copy-paste prompt that walks an AI agent
  through install, one real eval, artifact inspection, and task
  authoring) and link it from the README and nav.
# Conflicts:
#	docs/examples/coder-reviewer-demo.py
#	docs/examples/scene-patterns.ipynb
bench tasks init wrote schema_version "1.0" while the parser default,
the docs, and the migrate path all say "1.3" — fresh scaffolds started
life two schema versions behind. Found during the example-corpus
harmonization.
… PyPI

The all-or-nothing version gate stopped a verbatim cold user during the
publish window. Steps 0-6 work on 0.5.x; the prompt now continues in
degraded mode with explicit expectations (verifiers.jsonl only, split
scaffold) and tells the user what 0.6.0 unlocks.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

There are 4 total unresolved issues (including 2 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8ed3181. Configure here.

model=model if is_gemini_model else "gemini-3.1-flash-lite",
contents=prompt,
)
return response.text

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Judge ignores Anthropic model

Medium Severity

In the new skill-eval example judge.py verifiers, call_llm enters the Gemini branch whenever GOOGLE_API_KEY or GEMINI_API_KEY is set, even when JUDGE_MODEL is an Anthropic model (the default). It then calls Gemini with a hardcoded flash-lite model instead of the configured judge, so rubric scoring can silently use the wrong provider.

Additional Locations (2)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8ed3181. Configure here.

Comment thread docs/agent-quickstart.md
2. `agent_result.total_tokens` > 0 (real model traffic was captured)
3. `rewards` is present and its value is not null (the verifier scored it)
Check them with a one-liner (kept on one line so indentation cannot break it):
python3 -c "import json,sys; d=json.load(open(sys.argv[1])); t=d.get('n_tool_calls') or 0; k=(d.get('agent_result') or {}).get('total_tokens') or 0; r=d.get('rewards'); print('n_tool_calls:',t,'total_tokens:',k,'rewards:',r); print('REAL run' if t>0 and k>0 and r else 'NOT a real run'); sys.exit(0 if t>0 and k>0 and r else 1)" "$(find "$JOBS_DIR" -name result.json | head -1)"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quickstart REAL check too weak

Low Severity

Step 5 tells agents a run is REAL only when rewards is present and not null, but the bundled one-liner treats any non-empty rewards object as success. A result.json with rewards: {} or {"reward": null} can be labeled REAL even though no verifier score was recorded.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8ed3181. Configure here.

xdotli added 9 commits June 10, 2026 22:52
…tadata

Audit P0/P1: the 0.6.0 changelog claimed an OpenReward hosted-environment
episode runner, but only the ORS reward-format export + ors-episode
verifier recognition ship in this release (the runner is in progress, PR
pending). Reworded to what actually ships. Also bump CITATION.cff to
0.6.0 — a missed release step that would show the wrong version in
GitHub's cite box.
The user-driven rollout loop ran the scene-step loop and then a
free-round while-loop. Both break paths out of the scene-step loop —
user.run() raising (self._error set) and a classic user returning None
(stop) — fell straight into the free-round loop, which called user.run()
again with the same round_num. This resurrected stopped users and
retried users that had already raised, executing extra rounds while
self._error stayed set: a half-script rollout reported as errored.

Set a loop_terminated flag on both break paths and guard the free-round
while-loop on it so it neither executes nor re-calls user.run() once the
scene-step loop has stopped or errored. Normal completion is unchanged.

Add call-counting tests in test_user.py: after an explicit stop, run()
is invoked exactly once and zero free rounds execute; after a raise, the
loop terminates with the error set and no further run()/agent rounds
occur. Both fail if the guard is removed (the prior tests used a user
that always raised, masking the resurrection).
The legacy split scaffold still ships on this release: bench tasks init
--format legacy emits task.toml + instruction.md + tests/ + solution/,
and the legacy code paths (render_task_md_from_legacy, the tests/ and
solution/ aliases, legacy_solution_dir) remain in src/benchflow. Ten
tests guarding that behavior had been marked skip as 'legacy scaffold
superseded', leaving the shipped path unguarded.

Un-skip all ten and pin them to the current contract:

- TestCheckTask.test_missing_tests_dir asserts the current missing
  verifier-package message that names the legacy tests/ fallback.
- The nine init scaffold tests move into TestInitTaskLegacyScaffold and
  call init_task(..., task_format='legacy'), so they exercise the legacy
  layout (tests/, solution/) rather than the new task.md default.
- The replace-all-placeholders test now also replaces tests/test_outputs.py,
  which the fresh legacy scaffold ships with a [REPLACE: ...] marker, so a
  fully edited task is the only one that passes check_task.

Full suite green; tests/test_tasks.py carries no skips.
The Daytona auto-reaper deleted every sandbox visible to DAYTONA_API_KEY by
age alone. On a key shared across an org or with other tools this irreversibly
destroyed unrelated sandboxes, since nothing distinguished benchflow's
sandboxes from foreign ones.

Stamp every sandbox benchflow creates with a 'benchflow.managed' label at all
six CreateSandboxFromImageParams / CreateSandboxFromSnapshotParams sites (the
pinned daytona==0.184.0 SDK supports labels on CreateSandboxBaseParams), and
restrict reap_stale_sandboxes to only those sandboxes before applying the age
TTL. Foreign or unlabeled sandboxes are never touched, regardless of age. The
TTL tiers and the BENCHFLOW_DAYTONA_AUTO_REAP gate are unchanged.

A fresh label dict is built per creation call because the SDK mutates
params.labels in place (it injects the language label).

Tests: foreign/unlabeled sandboxes (even 9999m old) are never reaped; owned
stale ones still are; the scope predicate and CLI cleanup wrapper are covered
directly, including value-equality and non-mapping cases.
@xdotli

xdotli commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

Consolidating v0.6 onto a single working branch: release/v0.6.0 is now the v0.6 branch (carries the same content as v0.6-integration). Closing this duplicate review vehicle; v0.6 work and the release cut track via #665.

@xdotli xdotli closed this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant