Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
339 commits
Select commit Hold shift + click to select a range
71d2f4f
fix(sandbox): lock /testbed_verify so the agent cannot read grading-s…
xdotli Jun 11, 2026
6322183
Merge remote-tracking branch 'origin/rc2/verifier-home-isolation' int…
xdotli Jun 11, 2026
52bccde
docs: disambiguate Layer 2 flows and fix quickstart RC install + scaf…
xdotli Jun 11, 2026
731ee8c
fix(agent verify, tasks init): guard list-shaped parity files; report…
xdotli Jun 11, 2026
058b742
docs(running-any-benchmark): narrow Layer 2 verify claims to match th…
xdotli Jun 11, 2026
93aab7b
Merge remote-tracking branch 'origin/rc3/verify-crash-and-init-summar…
xdotli Jun 11, 2026
2871f61
Merge remote-tracking branch 'origin/rc3/doc-l2-and-quickstart' into …
xdotli Jun 11, 2026
862f21a
chore: roll 0.6.0rc3 (RC-loop iterations 2-3)
xdotli Jun 11, 2026
f642afd
docs(quickstart): bump RC-wheel example to rc.3, use absolute /app pa…
xdotli Jun 11, 2026
60c10f2
chore(cleanup): remove dev dashboard, archive 0.2.x-era labs
xdotli Jun 11, 2026
a5d743b
refactor(daytona): dedupe sandbox guards, retry policy, and service c…
xdotli Jun 11, 2026
f30fed6
refactor(rollout): structural quick wins, zero behavior change
xdotli Jun 11, 2026
4bdcb1a
refactor(cli): factor shared option aliases, eval-config builder, res…
xdotli Jun 11, 2026
7e040b3
fix(sandbox): fail fast with install hint when daytona extra is missing
Jun 11, 2026
4877c7a
fix(sdk): restore deprecated trial_name alias for SDK.run
Jun 11, 2026
ef57aa0
fix(rewards): add lenient mode to reward.json validation (BF-3)
Jun 11, 2026
3e73015
fix(agents): route bare model ids to providers via registry (BF-4)
Jun 11, 2026
c206cca
Merge remote-tracking branch 'origin/smell/rollout-quickwins' into in…
xdotli Jun 11, 2026
36f517f
Merge remote-tracking branch 'origin/smell/daytona-quickwins' into in…
xdotli Jun 11, 2026
b98cf58
Merge remote-tracking branch 'origin/smell/cli-lockdown-quickwins' in…
xdotli Jun 11, 2026
0f25f78
Merge remote-tracking branch 'origin/rc/cleanup-dashboard-labs' into …
xdotli Jun 11, 2026
0400513
test(stress): v0.6.0 stress-test report + repros
xdotli Jun 11, 2026
8c09fb8
fix(agents): set up custom provider for bare model ids before prefixi…
xdotli Jun 11, 2026
88ed671
test: name guarding PR #670 in BF-* regression docstrings (codex P2)
xdotli Jun 11, 2026
b87f268
fix(sandbox): cap unbounded Daytona exec poll when timeout_sec is Non…
xdotli Jun 11, 2026
e672fbd
fix(rc): clear up four dogfood judge/task footguns
xdotli Jun 11, 2026
4cd6b55
docs: fix pickiest-dev RC install/runtime-path friction (D1-D6)
xdotli Jun 11, 2026
efc60d7
refactor(sandbox): de-fork env-file secret redaction into _base
xdotli Jun 11, 2026
697fc50
refactor(cli): split main.py into a cli/ package via register_<group>
xdotli Jun 11, 2026
4652e85
style(cli): ruff format wrap typer.Option/Argument args in legacy/mon…
xdotli Jun 11, 2026
04992a7
merge(rc): dogfood footgun fixes — gemini/ prefix strip, judge-extra …
xdotli Jun 11, 2026
464af2a
merge(rc): BR2 de-fork docker/daytona secret-redaction into sandbox/_…
xdotli Jun 11, 2026
6d48101
merge(rc): docs friction D1-D6 — RC-install truth, quickstart wheel r…
xdotli Jun 11, 2026
5072ffd
merge(rc): BR1 split cli/main.py into cli/ package (eval_create stays…
xdotli Jun 11, 2026
961fc62
refactor: address code-quality audit findings (PR #670 follow-up)
xdotli Jun 11, 2026
f37d753
release: 0.6.0rc4 — dogfood footgun fixes + BR1 cli split + BR2 secre…
xdotli Jun 11, 2026
f1118a9
refactor(lockdown): extract harden_before_verify step helpers and nam…
xdotli Jun 11, 2026
33fa4dd
refactor(sandbox): split daytona.py at cohesion seams behind re-expor…
xdotli Jun 11, 2026
e595361
refactor(eval-create): extract planning logic into benchflow.eval_plan
xdotli Jun 11, 2026
a9852eb
refactor(rollout): split rollout.py into a package behind full re-exp…
xdotli Jun 11, 2026
c93bb48
fix(stress-repros): target agent.timeout_sec block precisely
xdotli Jun 11, 2026
5bc4731
fix(agents): fall back to generic provider setup before prefixing bar…
xdotli Jun 11, 2026
ac1c2c1
Merge remote-tracking branch 'origin/release/v0.6.0' into compat/claw…
xdotli Jun 11, 2026
960fcfc
merge(rc): BR3 decompose lockdown.harden_before_verify into named ste…
xdotli Jun 11, 2026
e6424fe
merge(rc): BR4 extract eval_create business logic into eval_plan.py (…
xdotli Jun 11, 2026
a1317cd
merge(rc): split sandbox/daytona.py into reaper/pty/strategies/dind s…
xdotli Jun 11, 2026
3f9f696
merge(rc): BR5 split rollout.py 3,381 into a benchflow.rollout packag…
xdotli Jun 11, 2026
bf37ce4
release: 0.6.0rc5 — thermo-nuclear big-rock splits (rollout/daytona/l…
xdotli Jun 11, 2026
5960c17
fix(cli,examples): resolve 9 v0.6.0 stress-test defects
xdotli Jun 11, 2026
c84e167
fix(ci): restore ty-clean gate after big-rock splits
xdotli Jun 11, 2026
c3c9ed3
refactor(task): split document.py at cohesion seams behind façade re-…
xdotli Jun 11, 2026
931c6b2
refactor(task): split verifier.py god-module into prefixed submodules
xdotli Jun 11, 2026
ea87a92
refactor(task): split acceptance_live into model/validation/report su…
xdotli Jun 11, 2026
897de3b
refactor(task-authoring): split task_authoring into a package façade
xdotli Jun 11, 2026
32bf64b
merge(rc): split task/document.py into profiles/evidence/normalize/pa…
xdotli Jun 11, 2026
28d75b4
merge(rc): split task/acceptance_live.py into model/validation/report…
xdotli Jun 11, 2026
6e0ff43
merge(rc): split _utils/task_authoring.py into a package (scaffolding…
xdotli Jun 11, 2026
64cb2f6
merge(rc): split task/verifier.py god-module into 7 strategy/helper s…
xdotli Jun 11, 2026
4effb2b
release: 0.6.0rc6 — split the 4 remaining real >1000-line smells (ver…
xdotli Jun 11, 2026
9ca5395
merge: bring main's #601 (symlink-safe cache reclaim) + #602 (Bedrock…
xdotli Jun 11, 2026
a855fe8
docs(changelog): record #601 (symlink-safe cache reclaim) + #602 (Bed…
xdotli Jun 11, 2026
2aa19e6
docs(stress): record Linear issue IDs (ENG-248..257)
xdotli Jun 11, 2026
6131305
docs(agent): add benchmark-adoption guide for the bench agent router
xdotli Jun 11, 2026
57d8610
test(agent-router): add end-to-end CLI integration test for create/ru…
xdotli Jun 11, 2026
ab90c80
merge: bring release/v0.6.0 (0.6.0rc6) into stress fix branch
xdotli Jun 11, 2026
313f0d0
docs(results): pin canonical result.json reward/token surface, no top…
xdotli Jun 11, 2026
0aff775
fix(smoke): make a skipped live smoke fail the gate instead of false-…
xdotli Jun 11, 2026
970eefb
Merge branch 'docs/agent-router-guide' into integrate-router-docs
xdotli Jun 11, 2026
a32799b
Merge branch 'test/agent-router-cli-e2e' into integrate-router-docs
xdotli Jun 11, 2026
3a6500b
docs(cli): document the BENCHFLOW_DAYTONA_AUTO_REAP automatic-reap en…
xdotli Jun 11, 2026
587a941
Merge branch 'docs/auto-reap-env' into int-reapdoc
xdotli Jun 11, 2026
76aa0de
Merge branch 'fix/smoke-no-false-green' into int-fixes
xdotli Jun 11, 2026
93e0ae9
Merge branch 'fix/result-json-toplevel-fields' into int-fixes
xdotli Jun 11, 2026
5874fb8
fix(cli): reject unknown --sandbox values at planning
xdotli Jun 11, 2026
a19f57b
sync: v0.6-integration into release/v0.6.0 (ATIF/ADP runtime emission…
xdotli Jun 11, 2026
7a556c4
chore(docs): migrate labs/ into docs/labs; remove docs/archive
xdotli Jun 11, 2026
7f7f657
feat(eval): Capture silent provider API errors as unhealthy results
Yiminnn Jun 11, 2026
432bedf
fix(docs): add Mintlify config under docs root
bingran-you Jun 11, 2026
0073d67
chore(experiments): remove unreferenced scratch; keep tested + docs-c…
xdotli Jun 11, 2026
3273bb6
feat(integration): agent-as-judge CI verification on benchflow primit…
xdotli Jun 11, 2026
821b3fa
fix(cli): scope eager agent/model validation to --tasks-dir/--source-…
xdotli Jun 12, 2026
ce15895
fix(labs): resolve REPO_ROOT to the actual repo root after move to do…
xdotli Jun 12, 2026
6bd1477
fix(cli): skip --sandbox preflight for hosted source-env runs
xdotli Jun 12, 2026
a0cfea1
Merge pull request #681 from benchflow-ai/chore/labs-to-docs-rm-archive
xdotli Jun 12, 2026
e7b1c4d
Merge pull request #678 from benchflow-ai/stress/v0.6.0-fork
xdotli Jun 12, 2026
9f09fcd
Merge remote-tracking branch 'origin/release/v0.6.0' into feat/integr…
xdotli Jun 12, 2026
6e97fe6
test(integration): robust agent-as-judge scenario suite
xdotli Jun 12, 2026
ed4f6ad
fix(v06): resolve 19 bug-hunt findings across router, trajectories, h…
xdotli Jun 12, 2026
bb52839
test(integration): make agent-rollout scenario robust + skip without …
xdotli Jun 12, 2026
794bb90
fix(agent-judge): harden reward read, prompt-injection, corrupt-resul…
xdotli Jun 12, 2026
d6a44d3
Merge remote-tracking branch 'origin/feat/ci-agent-judge' into fix/ag…
xdotli Jun 12, 2026
f83f129
fix(v06): close 19 bug-hunt findings + test gap-fills (#684)
xdotli Jun 12, 2026
60d8d89
feat(ci): agent-as-judge integration check + robustness fixes (#683)
xdotli Jun 12, 2026
3f0fde5
fix(agent-router): verify shows support guidance, not a divergence dr…
xdotli Jun 12, 2026
a57e581
fix(agent-router): clean insufficient-evidence verify output (#685)
xdotli Jun 12, 2026
5f6ca8b
Merge pull request #682 from benchflow-ai/feat/api-error-capture
xdotli Jun 12, 2026
e66f6ae
fix(verifier): alias /tests -> /verifier so converted task.md verifie…
xdotli Jun 12, 2026
36e0a88
test(integration): deepseek/deepagents harness + agent-judge hardening
xdotli Jun 12, 2026
801ab8f
fix(ci): restore ty-clean + format-clean on release/v0.6.0
xdotli Jun 12, 2026
6509d4d
Merge pull request #688 from benchflow-ai/fix/v06-ci-gates
xdotli Jun 12, 2026
9323b60
test(integration): address review findings on the deepseek/deepagents…
xdotli Jun 12, 2026
1573ac3
Merge remote-tracking branch 'origin/release/v0.6.0' into feat/deepag…
xdotli Jun 12, 2026
f4de848
test(integration): bound deepagents harness timeout + quote sandbox p…
xdotli Jun 12, 2026
bb3dc0d
Merge pull request #687 from benchflow-ai/feat/deepagents-harness
xdotli Jun 12, 2026
af37a9b
Merge pull request #686 from benchflow-ai/claude/verifier-legacy-test…
xdotli Jun 12, 2026
cae1be2
Merge remote-tracking branch 'origin/release/v0.6.0' into compat/claw…
xdotli Jun 12, 2026
af98159
Merge main into release/v0.6.0: port #689 tasks-digest + #690 eval -d…
xdotli Jun 12, 2026
8defe19
Merge main into release/v0.6.0: port #691 dev-run digest + #693 deps/…
xdotli Jun 12, 2026
8d78b9a
feat(agent-router): `bench agent verify --rerun` independently re-exe…
xdotli Jun 12, 2026
c9740c6
feat(agent-router): `bench agent run -c key=value` codex config passt…
xdotli Jun 12, 2026
05ff000
fix(dataset): treat bench RC/dev builds as in-range for their release…
xdotli Jun 12, 2026
926d4de
fix(cli): bench tasks digest recognizes native task.md tasks
xdotli Jun 12, 2026
ea95cbd
Merge pull request #697 from benchflow-ai/claude/tasks-digest-recogni…
xdotli Jun 12, 2026
f7a54ad
Merge pull request #696 from benchflow-ai/claude/bench-version-prerel…
xdotli Jun 12, 2026
240c80f
Merge pull request #695 from benchflow-ai/claude/agent-run-codex-config
xdotli Jun 12, 2026
fcefe41
Merge pull request #670 from benchflow-ai/compat/clawsbench-v06-fixes
xdotli Jun 12, 2026
f49c571
Merge remote-tracking branch 'origin/release/v0.6.0' into claude/agen…
xdotli Jun 12, 2026
459c328
rework(agent-router): fail-closed + tested rerun scoring seam (review…
xdotli Jun 12, 2026
7c7017b
Merge pull request #694 from benchflow-ai/claude/agent-verify-rerun
xdotli Jun 12, 2026
3ad6931
docs(cli): document v0.6 adapter features (verify --rerun, run -c, ev…
xdotli Jun 12, 2026
ce917d5
fix(test): collapse rich line-wrap in malformed-JSON verify test (CI …
xdotli Jun 12, 2026
3676a17
Merge pull request #699 from benchflow-ai/claude/fix-ci-rich-wrap-ver…
xdotli Jun 12, 2026
82879a0
feat(eval): make out-of-range bench_version a hard gate for dataset r…
bingran-you Jun 12, 2026
113f350
fix(docs): wrap literal braces in inline code — MDX/acorn parse failure
bingran-you Jun 12, 2026
5ba5b65
Merge remote-tracking branch 'upstream/main' into v060-sync
bingran-you Jun 12, 2026
6821740
fix(agents): bump JS-agent Node pin 22.14.0 -> 22.20.0 for openclaw
xdotli Jun 12, 2026
654febf
fix(environment): detach manifest service starts for Daytona session …
xdotli Jun 12, 2026
4b01ef2
feat(rewards): task-declared reward_range widens the [0,1] contract
xdotli Jun 12, 2026
4d5390b
Merge pull request #646 from benchflow-ai/codex/daytona-pty-timeout
xdotli Jun 12, 2026
b8ab859
Merge pull request #617 from benchflow-ai/bry/jolly-ptolemy-843a37
xdotli Jun 12, 2026
1a1a58e
Merge release/v0.6.0 into #589: re-home task-MCP-through-ACP onto the…
xdotli Jun 12, 2026
d4b6e7a
Merge pull request #589 from benchflow-ai/bry/nostalgic-tereshkova-ff…
xdotli Jun 12, 2026
edf069d
Merge release/v0.6.0 into #640: re-implement clean-terminal timeout o…
xdotli Jun 12, 2026
c956dfc
Merge pull request #640 from benchflow-ai/codex/normal-timeout-acp
xdotli Jun 12, 2026
3a783b8
Merge release/v0.6.0 into #628: keep the net-new docs, drop supersede…
xdotli Jun 12, 2026
4b4b3de
Merge pull request #628 from benchflow-ai/bry/vigilant-gagarin-c62278
xdotli Jun 12, 2026
1fb7db1
Merge remote-tracking branch 'origin/release/v0.6.0' into codex/hf-pu…
xdotli Jun 12, 2026
9268931
Merge pull request #645 from benchflow-ai/codex/hf-publish-token-marker
xdotli Jun 12, 2026
b517f7d
Merge remote-tracking branch 'origin/release/v0.6.0' into codex/skill…
xdotli Jun 12, 2026
c1ff5be
Merge pull request #641 from benchflow-ai/codex/skillsbench-strict-da…
xdotli Jun 12, 2026
b01eb43
test(agents): fix codex-acp custom-provider routing test
xdotli Jun 12, 2026
d697d8a
chore: purge verified-dead code smells (mechanical, no behavior change)
xdotli Jun 12, 2026
8ada9de
Merge pull request #704 from benchflow-ai/feat/js-agent-node-pin-open…
xdotli Jun 12, 2026
17abc0d
Merge pull request #703 from benchflow-ai/feat/manifest-service-detach
xdotli Jun 12, 2026
30fa159
fix(provider): surface 429/503 provider failures behind ACP errors
bingran-you Jun 12, 2026
bfca1d2
chore: remove the Terminal-Bench inbound adapter (approved format drop)
xdotli Jun 12, 2026
c70ec33
Merge pull request #702 from benchflow-ai/feat/task-declared-reward-r…
xdotli Jun 12, 2026
f750ef1
chore: retire the legacy top-level CLI; promote metrics/view to eval …
xdotli Jun 12, 2026
e1ac5bb
Merge pull request #653 from bingran-you/bry/acp-provider-failure-cla…
bingran-you Jun 12, 2026
93e1840
fix(agent): print dry-run codex command verbatim (no rich hard-wrap)
bingran-you Jun 12, 2026
72266fe
chore: remove the experiments/ dev tooling tree (results preserved ou…
xdotli Jun 12, 2026
85e0fe1
Merge pull request #698 from benchflow-ai/claude/docs-v06-adapter-fea…
bingran-you Jun 12, 2026
6b6b7bc
Merge pull request #710 from bingran-you/bry/fix-dry-run-hard-wrap
bingran-you Jun 12, 2026
a0116fc
docs(cli): fix legacy.py module docstring to list only retained commands
xdotli Jun 12, 2026
f4fc0d3
Merge pull request #707 from benchflow-ai/chore/purge-smells-v06
xdotli Jun 12, 2026
96ce428
Merge remote-tracking branch 'origin/release/v0.6.0' into chore/retir…
xdotli Jun 12, 2026
b813f7b
Merge pull request #709 from benchflow-ai/chore/retire-legacy-cli
xdotli Jun 12, 2026
c98e048
Merge remote-tracking branch 'origin/release/v0.6.0' into chore/remov…
xdotli Jun 12, 2026
7ce4de9
Merge pull request #711 from benchflow-ai/chore/remove-experiments-dir
xdotli Jun 12, 2026
3480e1f
docs: scrub remaining Terminal-Bench adapter references (audit blockers)
xdotli Jun 12, 2026
2c4525a
Merge remote-tracking branch 'origin/release/v0.6.0' into chore/remov…
xdotli Jun 12, 2026
24052fb
Merge pull request #708 from benchflow-ai/chore/remove-terminal-bench…
xdotli Jun 12, 2026
7d7f412
fix: enable Bedrock max effort for Claude Fable 5
bingran-you Jun 12, 2026
262db36
Merge pull request #712 from bingran-you/bry/fable5-bedrock-max
bingran-you Jun 12, 2026
4b4a00e
Add verify-retry loop strategy with mid-loop verifier isolation
xdotli Jun 12, 2026
a0be941
Merge remote-tracking branch 'origin/release/v0.6.0' into loop/pr1-ve…
xdotli Jun 12, 2026
ec747ab
fix(loop): coerce dict-form loop_strategy on the non-sharded --config…
xdotli Jun 12, 2026
5fe8713
feat(agents): integrate deepagents as an ACP-shim agent (deepseek-v4-…
xdotli Jun 12, 2026
289258a
fix(task): port #651 guard fixes — schema_version major gate, judge e…
Yiminnn Jun 12, 2026
065cc6b
fix(loop): mid-loop verifier isolation is best-effort, not fatal
xdotli Jun 13, 2026
af4c97a
Merge pull request #713 from benchflow-ai/loop/pr1-verify-retry
xdotli Jun 13, 2026
82830df
Merge pull request #715 from benchflow-ai/feat/deepagents-acp-agent
xdotli Jun 13, 2026
e3dae4a
feat(loop): job-level convergence report (loop_summary / pass@iteration)
xdotli Jun 13, 2026
48150e6
Merge pull request #716 from benchflow-ai/loop/pr2-loop-summary
xdotli Jun 13, 2026
0f04eeb
feat(loop): self-review-k strategy (feedback-source contrast to verif…
xdotli Jun 13, 2026
80b4040
Merge pull request #717 from benchflow-ai/loop/pr2-self-review
xdotli Jun 13, 2026
fd01dc2
feat(loop): capture per-iteration token spend for the cost-curve x-axis
xdotli Jun 13, 2026
8a1f7ff
Merge pull request #718 from benchflow-ai/loop/pr3-token-capture
xdotli Jun 13, 2026
69b3d54
fix(loop): None (not 0) tokens when no trusted usage was captured
xdotli Jun 13, 2026
d0d95e4
Merge pull request #719 from benchflow-ai/loop/pr3-token-capture
xdotli Jun 13, 2026
6fd72e0
test(agents): fix codex-acp custom-provider routing test (#705)
xdotli Jun 13, 2026
31ea713
feat(loop): {model x loop} sweep -> pass-rate-vs-tokens cost-curve ma…
xdotli Jun 13, 2026
8752864
fix(loop): resolve sweep baseline by canonical loop identity
xdotli Jun 13, 2026
c26a652
fix(loop): injective sweep job dirs + partial-telemetry cost is undec…
xdotli Jun 13, 2026
110d803
ci: ruff-format 4 files to unbreak the format-check gate on release/v…
xdotli Jun 13, 2026
e51dcb4
fix(loop): isolate each sweep cell in its own jobs_dir
xdotli Jun 13, 2026
89443e3
docs(readme): revamp for v0.6 — hero, quickstart, loops, fresh install
xdotli Jun 13, 2026
fb0a2a8
Merge pull request #721 from benchflow-ai/ci/fix-ruff-format-release
xdotli Jun 13, 2026
5e0a452
Merge pull request #720 from benchflow-ai/loop/pr4-sweep-runner
xdotli Jun 13, 2026
e7419a5
Merge pull request #722 from benchflow-ai/docs/revamp-readme
xdotli Jun 13, 2026
4ccfa3c
ci: fix the 13 ty type-check errors blocking release/v0.6.0 CI
xdotli Jun 13, 2026
d60f6e8
docs(readme): reframe tagline as the universal environment framework
xdotli Jun 13, 2026
05496e6
Merge pull request #723 from benchflow-ai/ci/fix-ty-release
xdotli Jun 13, 2026
b8e62bf
Merge pull request #724 from benchflow-ai/docs/readme-tagline
xdotli Jun 13, 2026
a8337c5
Merge pull request #714 from benchflow-ai/fix/v06-guard-ports
xdotli Jun 13, 2026
a019489
chore(release): set version to 0.6.0.dev0 for the internal-preview ch…
xdotli Jun 13, 2026
a498879
Merge pull request #725 from benchflow-ai/chore/v0.6.0-dev-version
xdotli Jun 13, 2026
15caec6
feat(cli): live eval dashboard + UX hardening (dogfood findings)
xdotli Jun 13, 2026
636a84f
fix(cli): correct live-dashboard accounting on resume / sequential / …
xdotli Jun 13, 2026
7ebe8f0
fix(cli): config-driven eval also prints Artifacts/Summary paths
xdotli Jun 13, 2026
8c42778
Merge pull request #726 from benchflow-ai/feat/cli-live-progress
xdotli Jun 13, 2026
d287b1f
feat(cli): group top-level --help into panels; hide non-core verbs
xdotli Jun 13, 2026
da1f5c2
Merge pull request #727 from benchflow-ai/cli/help-polish
xdotli Jun 13, 2026
84db0ee
chore: remove dead code + dead test helpers (adversarially verified)
xdotli Jun 13, 2026
e12ff45
Merge pull request #728 from benchflow-ai/chore/dead-code-purge
xdotli Jun 13, 2026
d2defcd
feat(cli)!: rename `bench compat` -> `bench hub` (env-hub compatibility)
xdotli Jun 13, 2026
9b36a3e
Merge pull request #729 from benchflow-ai/cli/rename-compat-to-hub
xdotli Jun 13, 2026
debee4f
fix(cli): pre-release error-path hardening (10 fail-fast/clean-error …
xdotli Jun 13, 2026
b96c6f6
docs(prerelease): fix copy-paste-breaking drift across guides + cli r…
xdotli Jun 13, 2026
d1c1c7b
Merge pull request #730 from benchflow-ai/fix/prerelease-error-paths
xdotli Jun 13, 2026
52eb4e0
test(cli): guard CLI↔doc drift bidirectionally + pin drifted defaults…
xdotli Jun 13, 2026
3b75ae9
docs(skills): regenerate the user-invocable `benchflow` skill against…
xdotli Jun 13, 2026
d16a013
Merge pull request #731 from benchflow-ai/docs/prerelease-drift
xdotli Jun 13, 2026
8c35d4f
fix(cli): round-2 error-path & edge-case hardening from the v0.6 sweep
xdotli Jun 13, 2026
256663b
Merge pull request #732 from benchflow-ai/fix/edge-case-sweep-r2
xdotli Jun 13, 2026
cc3cedf
fix(cli): round-3 dev-ex hardening — sandbox registry, install-URL gu…
xdotli Jun 13, 2026
bb919dc
fix(eval): batch tasks-dir with a stray broken root task.md warns, no…
xdotli Jun 13, 2026
9ad4ff2
Merge pull request #733 from benchflow-ai/fix/round3-devex
xdotli Jun 13, 2026
7e1d99f
feat(cli): bench adopt (init/convert/verify) + environment --provider…
xdotli Jun 13, 2026
cfefe24
Merge pull request #735 from benchflow-ai/fix/cli-adopt-verbs
xdotli Jun 13, 2026
9702d16
fix(cli): systemic markup-escape sweep + dataset-registry guards + do…
xdotli Jun 13, 2026
7243ed3
Merge pull request #738 from benchflow-ai/fix/markup-escape-sweep
xdotli Jun 13, 2026
e610255
feat(cli): split the overloaded `bench environment` — hosted reads → …
xdotli Jun 13, 2026
1e3862c
docs(cli): broaden `bench hub` help to mention the env subgroup (revi…
xdotli Jun 13, 2026
236b974
Merge pull request #740 from benchflow-ai/fix/split-environment
xdotli Jun 13, 2026
c67cb68
feat(cli): rename local `bench environment` → `bench sandbox`; enviro…
xdotli Jun 13, 2026
4524a04
docs(cli): drop `[deprecated]` markup prefix from environment help (r…
xdotli Jun 13, 2026
cb91c39
Merge pull request #741 from benchflow-ai/fix/environment-to-sandbox
xdotli Jun 13, 2026
491ea40
feat(cli): move benchmark adoption under `bench eval adopt`
xdotli Jun 13, 2026
be5a2d8
Merge pull request #742 from benchflow-ai/fix/adopt-under-eval
xdotli Jun 13, 2026
a25b97f
fix(cli): error-path hardening (stress sweep round r9)
xdotli Jun 13, 2026
365c8e0
Merge pull request #743 from benchflow-ai/fix/error-path-hardening-r9
xdotli Jun 13, 2026
d398202
chore: dead-code purge round 2 (verified zero-reference)
xdotli Jun 13, 2026
4a8db01
Merge pull request #744 from benchflow-ai/chore/dead-code-purge-r9
xdotli Jun 13, 2026
ba59d9e
chore: remove dead Sandbox Protocol stubs + dead-code purge round 3
xdotli Jun 13, 2026
6affa1c
Merge pull request #745 from benchflow-ai/chore/dead-code-r3-and-prot…
xdotli Jun 13, 2026
dfea432
docs(python-api): fix Sandbox Protocol example after read_file/write_…
xdotli Jun 13, 2026
2a3a4a1
Merge pull request #746 from benchflow-ai/docs/v06-dogfood-truth
xdotli Jun 13, 2026
e84795d
chore: remove the unwired OTelCollector (public-API removal)
xdotli Jun 13, 2026
c38090a
Merge pull request #747 from benchflow-ai/chore/remove-unwired-otel
xdotli Jun 13, 2026
5ed243c
chore(docs): remove stale dev/old notes
xdotli Jun 13, 2026
3a46e03
Merge pull request #748 from benchflow-ai/chore/remove-stale-notes
xdotli Jun 13, 2026
508a8a7
chore: cut 0.6.0 — bump version + consolidate CHANGELOG
xdotli Jun 13, 2026
c121554
Merge pull request #749 from benchflow-ai/chore/cut-0.6.0
xdotli Jun 13, 2026
c81f926
test(cli): fix CI-fragile help-parse that blocked the release gate
xdotli Jun 13, 2026
306cf29
Merge pull request #750 from benchflow-ai/fix/ci-help-parse-ansi
xdotli Jun 13, 2026
6842565
style: ruff-format tests/test_cli_hub_env.py (CI gate)
xdotli Jun 13, 2026
3ef837d
Merge pull request #751 from benchflow-ai/fix/format-hub-env-test
xdotli Jun 13, 2026
7f15638
chore: purge dead stress-test artifacts (STRESS-TEST-v0.6.0.md, stres…
xdotli Jun 13, 2026
cb2dca9
Merge pull request #752 from benchflow-ai/chore/purge-stress-artifacts
xdotli Jun 13, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
362 changes: 0 additions & 362 deletions .claude/dev-docs/0.3-plan.md

This file was deleted.

86 changes: 7 additions & 79 deletions .claude/dev-docs/labs.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,87 +2,15 @@

Runnable, Docker-heavy experiments that exercise the full benchflow SDK end-to-end. Labs are distinct from unit tests (real Docker, no mocking) and from docs (executable, with expected output). Each lab is self-contained with its own README and orchestrator script.

Labs live under [`labs/`](../labs/).
> **Historical (0.2.x-era).** These labs are archived under [`docs/labs/`](../../docs/labs/). They compare benchflow 0.2.0 against 0.2.1/0.2.2 and are kept as cited security evidence; the hardening they validate still ships. The public write-up is [`docs/sandbox-hardening.md`](../../docs/sandbox-hardening.md).

| Lab | Question summary | Benchflow versions | API key needed |
| ----------------------------------------------------------- | -------------------------------------------------------------------------------- | ------------------ | ---------------------------- |
| [benchjack-sandbox-hardening](#benchjack-sandbox-hardening) | Does 0.2.1 block BenchJack exploits that succeed under 0.2.0 | 0.2.0 vs 0.2.1 | No |
| [reward-hack-matrix](#reward-hack-matrix) | Do the same exploits succeed on real benchmark tasks, and does 0.2.2 block them? | 0.2.0 vs 0.2.2 | Optional (`DAYTONA_API_KEY`) |
| Lab | Question summary | Benchflow versions | API key needed |
| --- | --- | --- | --- |
| [`benchjack-sandbox-hardening`](../../docs/labs/benchjack-sandbox-hardening/) | Does 0.2.1 block BenchJack exploits that succeed under 0.2.0? | 0.2.0 vs 0.2.1 | No |
| [`reward-hack-matrix`](../../docs/labs/reward-hack-matrix/) | Do the same exploits succeed on real benchmark tasks, and does 0.2.2 block them? | 0.2.0 vs 0.2.2 | Optional (`DAYTONA_API_KEY`) |

---

## benchjack-sandbox-hardening

**Question:** Does sandbox hardening in benchflow 0.2.1 block BenchJack-style exploits that succeed under 0.2.0?

**Location:** [`labs/benchjack-sandbox-hardening/`](../labs/benchjack-sandbox-hardening/)

**Prerequisites:**

- Docker daemon
- Python 3.12+
- `uv` on PATH
- Network access to PyPI
- No API keys required (uses the `oracle` agent)

**Run:**

```sh
python3 labs/benchjack-sandbox-hardening/run_comparison.py
```

- `--clean` — delete `.venvs/` and `.jobs/` before running
- First run is ~5 min (Docker builds + pip installs); subsequent runs use cached `.venvs/` (~1 min)

**Key takeaways:**

- Three exploit patterns (P1 conftest-hook, P2 answer-lookup, P7 pth-injection) flip reward from 0.0 → 1.0 against benchflow 0.2.0 and are blocked under 0.2.1 (reward stays 0.0).
- Defenses are layered: `chmod 700` on `/tests` and `/solution`, non-root `sandbox_user`, and pre-verify conftest cleanup.
- The `oracle` agent executes `solution/solve.sh` directly — deterministic and free of API costs. Swap `agent="oracle"` for `agent="claude-agent-acp"` in `_attack_runner.py` to test with a real LLM.

**Related:** `comparison.ipynb` — narrative deep-dive into P1; run `run_comparison.py` first, then open with:

```sh
uv run --with jupyter jupyter notebook labs/benchjack-sandbox-hardening/comparison.ipynb
```

---

## reward-hack-matrix

**Question:** Do the same BenchJack exploits succeed on real production benchmark tasks, and does benchflow 0.2.2's hardening block them there too?

**Location:** [`labs/reward-hack-matrix/`](../labs/reward-hack-matrix/)

**Prerequisites:**

- `DAYTONA_API_KEY` (default) or Docker daemon (pass `--env docker`)
- Python 3.12+
- `uv` on PATH
- Network access to PyPI and GitHub
- Corpora must be cloned first:
```sh
cd labs/reward-hack-matrix && ./fetch_corpora.sh
```

**Run:**

```sh
python labs/reward-hack-matrix/run_matrix.py
```

- `--cells "P1@swebench-verified/astropy__astropy-12907"` — run a single cell
- `--sweep` — enumerate all tasks across all three corpora
- `--clean` — remove `.venvs/`, `.jobs/`, and `.cells/`

**Key takeaways:**

- One tailored exploit per benchmark (P1 conftest-hook for swebench-verified, P7 pth-injection for skillsbench, P7 path-trojan for terminal-bench-2) achieves reward 1.0 against 0.2.0 and is blocked to 0.0 under 0.2.2.
- Each benchmark has a single structural weak point; the lab demonstrates these are closed by the same layered defenses as the synthetic lab, not by benchmark-specific patches.
- Independently corroborated by Berkeley RDI and BrachioLab (Penn) findings published concurrently in April 2026.

---
Each lab's README documents its prerequisites, the one-command repro, and key takeaways. See [`docs/sandbox-hardening.md`](../../docs/sandbox-hardening.md) for the narrative and results tables.

## See also

- [`.dev-docs/harden-sandbox.md`](../.dev-docs/harden-sandbox.md) — full seven-pattern BenchJack threat model and hardening audit
- [`harden-sandbox.md`](./harden-sandbox.md) — full seven-pattern BenchJack threat model and hardening audit
9 changes: 1 addition & 8 deletions .claude/launch.json
Original file line number Diff line number Diff line change
@@ -1,11 +1,4 @@
{
"version": "0.0.1",
"configurations": [
{
"name": "dashboard",
"runtimeExecutable": "python3",
"runtimeArgs": ["dashboard/serve.py"],
"port": 8777
}
]
"configurations": []
}
63 changes: 46 additions & 17 deletions .claude/skills/benchflow/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Arguments passed: `$ARGUMENTS`
1. Check if benchflow is installed: `uv tool list | grep benchflow`
2. Check if API keys are set (GEMINI_API_KEY, ANTHROPIC_API_KEY, etc.)
3. Check available agents: `bench agent list`
4. Show recent eval results if any exist in `evaluations/` or `jobs/`
4. Show recent eval results if any exist under `jobs/` (the default `--jobs-dir`)
5. Point to next action based on state

### `run <task-path>` — run a single task
Expand Down Expand Up @@ -94,12 +94,13 @@ max_retries: 1
### `metrics <jobs-dir>` — analyze results

```bash
bench eval list jobs/
bench eval metrics jobs/ # aggregate pass-rate / tokens / cost (add --json to pipe)
bench eval list jobs/ # per-rollout table
```

### `view <rollout-dir>` — view a trajectory

Results are in `evaluations/<eval-name>/<rollout-name>/` or `jobs/<job-name>/<rollout-name>/`:
Results land under `jobs/<job-name>/<rollout-name>/` (the default `--jobs-dir` is `jobs/`):
```
rollout-dir/
├── result.json # rewards, agent, timing
Expand All @@ -114,20 +115,38 @@ rollout-dir/
### `create-task` — create a new benchmark task

```bash
bench tasks init my-task
bench tasks init my-task --no-pytest --no-solution
bench tasks init my-task # native task.md format (default)
bench tasks init my-task --no-pytest --no-oracle
bench tasks check tasks/my-task # structural validation
```

Quick structure:
Quick structure (native `task.md` format, the default):
```
my-task/
├── task.toml # timeouts, resources, metadata
├── instruction.md # what the agent should do
├── task.md # YAML frontmatter (config) + prompt body
├── environment/
│ └── Dockerfile # sandbox setup
├── tests/
│ └── test.sh # verifier -> writes to /logs/verifier/reward.txt
└── solution/ # optional reference solution
├── verifier/
│ ├── test.sh # verifier entrypoint -> writes /logs/verifier/reward.txt
│ └── test_outputs.py
└── oracle/ # optional reference solution (solve.sh)
```

`--format legacy` instead scaffolds the older split layout (`task.toml` +
`instruction.md` + `tests/` + `solution/`).

### `skills` — discover and evaluate agent skills

```bash
bench skills list # discover skills on disk
bench skills eval skills/citation-management \
--agent claude-agent-acp # score a skill against its evals/evals.json
```

### `hub` — check external-environment-hub compatibility

```bash
bench hub check # inventory/structurally-check representative Harbor-registry tasks
```

### `agents` — list available agents
Expand Down Expand Up @@ -155,15 +174,19 @@ The underlying agent's install, env vars, credentials, and skill paths are prese

### `compare` — multi-agent comparison

Compare by running one config per agent (the `agent:` key lives in each YAML)
and printing the aggregate scores:
```python
import asyncio
from benchflow.evaluation import Evaluation

async def main():
for agent_name in ["claude-agent-acp", "gemini", "opencode"]:
eval_obj = Evaluation.from_yaml("benchmarks/harvey-lab/harvey-lab-gemini-flash-lite.yaml")
result = await eval_obj.run()
print(f"{agent_name}: {result.passed}/{result.total} ({result.score:.1%})")
for config_path in [
"benchmarks/harvey-lab/harvey-lab-gemini-flash-lite.yaml",
"benchmarks/harvey-lab/harvey-lab-harness-parity.yaml",
]:
result = await Evaluation.from_yaml(config_path).run()
print(f"{config_path}: {result.passed}/{result.total} ({result.score:.1%})")

asyncio.run(main())
```
Expand All @@ -173,7 +196,10 @@ asyncio.run(main())
## Setup

```bash
uv tool install benchflow # or: uv sync --extra dev --locked (from source)
# 0.6 is pre-release — not yet on PyPI. Install the RC wheel from GitHub releases:
uv tool install --prerelease allow \
'benchflow @ https://github.com/benchflow-ai/benchflow/releases/download/0.6.0-rc.6/benchflow-0.6.0rc6-py3-none-any.whl'
# (replace rc.6 with the newest 0.6.0-rc.* release; or from source: uv sync --extra dev --locked)
export GEMINI_API_KEY=... # or ANTHROPIC_API_KEY, OPENAI_API_KEY, etc.
export DAYTONA_API_KEY=... # for cloud sandboxes
```
Expand Down Expand Up @@ -204,10 +230,13 @@ bench eval create \
--agent claude-agent-acp \
--sandbox daytona \
--skills-dir skills/ \
--skill-mode with-skill \
--agent-env BENCHFLOW_SKILL_NUDGE=name
```

Skills are uploaded to `/skills/` in the sandbox and symlinked to agent-specific paths.
`--skill-mode with-skill` is required whenever you pass `--skills-dir` (omitting
it errors). Skills are uploaded to `/skills/` in the sandbox and symlinked to
agent-specific paths.

## Tips

Expand Down
2 changes: 1 addition & 1 deletion .claude/skills/branch-review/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ Route changed files by path:

- **tests** → `/test-review`: `tests/**/test_*.py`, `tests/**/*_test.py`
- **src** → `/code-cleanup`: `src/benchflow/**/*.py`
- **docs** → `/docs-review`: `*.md`, `docs/**/*`, `.dev-docs/**/*`, `src/benchflow/**/*.md`
- **docs** → `/docs-review`: `*.md`, `docs/**/*`, `.claude/dev-docs/**/*`, `src/benchflow/**/*.md`
Comment thread
cursor[bot] marked this conversation as resolved.

**Skip silently** (no routing, no findings, no warning): `uv.lock`, `.venv/**`, `__pycache__/**`, `*.egg-info/**`, `dist/**`, `build/**`, `.pytest_cache/**`, generated files.

Expand Down
33 changes: 14 additions & 19 deletions .claude/skills/docs-review/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,17 +32,13 @@ The user may say `/docs-review` with an optional argument:

### Light-touch (checks 1, 2, 6 only — drift, stale refs, link integrity)

- `.dev-docs/sdk-reference.md` — internal SDK surface; verify class/function
names + signatures still resolve in `src/benchflow/`.
- `.dev-docs/harden-sandbox.md` — sandbox hardening notes; verify referenced
files / knobs / env vars still exist.
- `.dev-docs/tested-agents.md` — matrix of agent × model × provider; verify
names still appear in `agents/registry.py` and `agents/providers.py`.
- `.claude/dev-docs/harden-sandbox.md` — sandbox hardening notes; verify
referenced files / knobs / env vars still exist.
- `.claude/dev-docs/tested-agents.md` — matrix of agent × model × provider;
verify names still appear in `agents/registry.py` and `agents/providers.py`.

### Skipped entirely

- `.dev-docs/sdk-refactor-notes.md` — dated refactor record (April 2026);
historical, status language is expected. Do not flag or edit.
- Anything matching `*-notes.md`, `*-archive.md`.
- `.smoke-jobs/`, `trajectories/`, `examples/`, `fixtures/` — generated or
sample output, not documentation.
Expand All @@ -64,7 +60,7 @@ entries. Cross-check:
mentioned in docs, grep `src/benchflow/agents/registry.py` and
`src/benchflow/agents/providers.py`. A name in docs but not in the
registry dict → stale; a name in the registry but not documented where
expected (`docs/architecture.md` matrix, `.dev-docs/tested-agents.md`)
expected (`docs/architecture.md` matrix, `.claude/dev-docs/tested-agents.md`)
→ gap.
- Env vars mentioned in docs (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`,
`GROQ_API_KEY`, `BENCHFLOW_*`, etc.) — still referenced in
Expand Down Expand Up @@ -94,11 +90,11 @@ Grep for implementation-tracking words:

For each hit, ask: is this describing the *design* (stays true) or
*in-flight work* (rots)? In-flight language belongs in commit messages,
PR descriptions, or `.dev-docs/*-notes.md`, not user-facing reference
PR descriptions, or `.claude/dev-docs/*-notes.md`, not user-facing reference
docs.

**Suppress for `.dev-docs/*-notes.md`** — dated refactor notes legitimately
carry status language.
**Suppress for `.claude/dev-docs/*-notes.md`** — dated refactor notes
legitimately carry status language.

### 4. Duplication

Expand All @@ -109,11 +105,10 @@ for benchflow:
- **SDK Run Phases** (SETUP → START → AGENT → VERIFY) — should live in
`architecture.md`; others should link.
- **Registry examples** — one copy in `architecture.md` + one in
`task-authoring.md` or `.dev-docs/sdk-reference.md` is OK if they
illustrate distinct use cases; two near-identical `register_agent(...)`
blocks is not.
`task-authoring.md` is OK if they illustrate distinct use cases; two
near-identical `register_agent(...)` blocks is not.
- **Agent × Model × Provider matrix** — live in
`.dev-docs/tested-agents.md`; `architecture.md` should link, not
`.claude/dev-docs/tested-agents.md`; `architecture.md` should link, not
duplicate.
- **Env var reference** — should live in `docs/getting-started.md` or
`docs/cli-reference.md`; not re-listed in README.
Expand Down Expand Up @@ -164,7 +159,7 @@ All markdown links resolve:
"how benchflow works" overview — link to architecture for that.
- **docs/getting-started.md / docs/labs.md**: tutorial tone; design
rationale belongs elsewhere.
- **.dev-docs/**: internal — can carry status language, refactor
- **.claude/dev-docs/**: internal — can carry status language, refactor
histories, signature tables.

## Execution
Expand Down Expand Up @@ -207,7 +202,7 @@ For a full review:
prose quality or tone.
- **Don't grow scope.** If a check isn't in the seven above, don't add it
mid-review. File a suggestion in the punch list instead.
- **Don't touch archives or refactor notes.** `.dev-docs/*-notes.md`
- **Don't touch archives or refactor notes.** `.claude/dev-docs/*-notes.md`
legitimately carry status language and reflect state at the time they
were written; don't normalize them.
- **Don't flag registry drift without reading the registry dict.**
Expand All @@ -229,7 +224,7 @@ Stale:
- docs/architecture.md:39 — "Phase 1: SETUP (host)" numbering implies sequential work-in-progress; phases are always-on
- docs/task-authoring.md:88 — "TODO: document verifier timeout knob"
- docs/getting-started.md:121 — example task ID "demo-fizzbuzz" renamed to "examples-fizzbuzz"
- .dev-docs/tested-agents.md:14 — lists claude-code-acp; registry only has claude-agent-acp
- .claude/dev-docs/tested-agents.md:14 — lists claude-code-acp; registry only has claude-agent-acp

Polish:
- README.md:98-134 — full src/ tree duplicates docs/architecture.md:12-38
Expand Down
32 changes: 29 additions & 3 deletions .claude/skills/launch-prep/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Run A → B → C → D serially (each skill spawns its own subagents; stacking
them saturates the pool).

**A — `/docs-review`** (full). Covers `README.md`, `docs/*.md`, `AGENTS.md`,
and the light-touch `.dev-docs/` set. Captures drift vs. code, stale refs,
and the light-touch `.claude/dev-docs/` set. Captures drift vs. code, stale refs,
link integrity, registry alignment. Supersedes the old ad-hoc docs pass.

**B — labs/ (ad-hoc subagent)** — `/docs-review` skips labs. Spawn one
Expand Down Expand Up @@ -90,10 +90,36 @@ If `ruff format` changed files: `git diff --name-only`, then `git add <those fil

```bash
source .env 2>/dev/null || true
.venv/bin/python -m pytest -m live tests/test_smoke.py -v
.venv/bin/python -m pytest -m live tests/test_smoke.py -v -ra \
--junitxml=/tmp/smoke.xml
# A skipped live smoke is NOT green — exit 0 on a run that never executed
# would false-green the e2e gate. pytest puts tests/skipped on the nested
# <testsuite> elements, so sum over them and fail unless one ran clean.
.venv/bin/python -c 'import sys,xml.etree.ElementTree as ET; \
es=list(ET.parse("/tmp/smoke.xml").getroot().iter("testsuite")); \
t=sum(int(e.get("tests",0)) for e in es); \
s=sum(int(e.get("skipped",0)) for e in es); \
sys.exit(0) if t and not s else (print("LIVE SMOKE SKIPPED OR NOT RUN — gate is RED, not green") or sys.exit(1))'
Comment thread
cursor[bot] marked this conversation as resolved.
Comment thread
cursor[bot] marked this conversation as resolved.
```

If Docker is unavailable, warn and ask to skip or abort — do not skip silently. Expected: `test_hello_world_smoke` passes with reward > 0 and non-empty trajectory.
The live smoke `skipif`s when Docker is down or the chosen model has no
credential, and pytest exits `0` on a skip. The JUnit check above turns that
silent pass into a hard failure so a skipped smoke cannot false-green the gate.
Expected: `test_hello_world_smoke` passes with reward > 0 and non-empty
trajectory.

If the only credential on the machine is non-Anthropic, point the smoke at an
agent/model it can authenticate instead of skipping (proven combo:
openhands + deepseek):

```bash
export BENCHFLOW_SMOKE_AGENT=openhands
export BENCHFLOW_SMOKE_MODEL=deepseek/deepseek-chat
export DEEPSEEK_API_KEY=... DEEPSEEK_BASE_URL=https://api.deepseek.com
```

`BENCHFLOW_SMOKE_AGENT` and `BENCHFLOW_SMOKE_MODEL` must be set together; the
skip reason names the exact missing credential for whichever model is selected.

---

Expand Down
Loading
Loading