fix(cli): 0.6.1 dogfood papercuts batch 2 (eval view, hub footer, check hint) by xdotli · Pull Request #756 · benchflow-ai/benchflow

xdotli · 2026-06-14T01:03:26Z

Second batch of 0.6.0 dogfood fast-follow papercuts (targets main).

eval view <job-dir> no longer shows a blank "No trajectory files found" — render_rollout now indexes the job dir's rollout subdirectories so you know to drill in. (Regression test added.)
hub env list prints a Showing N of M… footer with how to refine (--search/--owner/--limit/--json) — a 20-row page of a 1270-env catalog previously read as "the provider only has 20".
tasks check --level publication-grade verifier-missing errors now include a remediation hint (author verifier/verifier.md; tasks migrate doesn't generate it) instead of a dead-end.

Full suite 4070 passed; ruff/format/ty clean.

Deferred — the scaffold no-op verifier (needs your design call)

The highest-value papercut isn't a quick patch: verifier.md wires command: ./test.sh (script strategy), test.sh writes 0.0, and a separate test_outputs.py is orphaned. Connecting them is a verifier-model decision:

test.sh runs pytest test_outputs.py — requires pytest in the task's own Dockerfile (a real constraint on every scaffolded task).
Switch verifier.md to a pytest strategy on test_outputs.py — benchflow's harness runs pytest, no task-Dockerfile dependency (cleaner, if a pytest strategy exists).
Drop the orphaned test_outputs.py — author edits test.sh only (smallest change, removes the pytest option).

Which model do you want? I'll implement it once you pick.

Note

Low Risk
User-facing CLI messaging and HTML viewer behavior only; no changes to eval execution, auth, or data handling.

Overview
Three small CLI/UX fixes from 0.6.0 dogfood follow-ups.

bench eval view on a job directory (typical path from eval create artifacts) no longer shows an empty “No trajectory files found” page. render_rollout scans child dirs for turn*.txt or ACP trajectory files and returns HTML that lists rollout names and how to open bench eval view <job>/<rollout>.

bench hub env list adds a dim footer: Showing N (optionally of total when the API returns it) plus pointers to --search, --owner, --limit, and --json, so a short page is not mistaken for the full catalog.

bench tasks check --level publication-grade expands missing verifier/ and verifier/verifier.md errors with remediation text (author per docs; bench tasks migrate does not create the verifier package). CHANGELOG documents these; a regression test covers job-dir indexing.

^{Reviewed by Cursor Bugbot for commit 6295c8e. Bugbot is set up for automated code reviews on this repo. Configure here.}

…ck hint) Three contained dogfood papercuts: - `bench eval view <job-dir>` rendered a blank "No trajectory files found" when given a job directory (the natural value from `eval create`'s "Artifacts:" line) because render_rollout never looked below the top level. It now indexes the rollout subdirectories with a "drill into one" pointer. - `bench hub env list` now prints a footer (`Showing N of M…`) with how to refine (--search/--owner/--limit/--json). A 20-row page of a 1270-environment catalog previously read as "the provider only has 20". - `bench tasks check --level publication-grade` verifier-missing errors now carry a remediation hint (author verifier/verifier.md; note `tasks migrate` does not generate it), so the migrate→publication-grade path is no longer a dead-end. Regression test for the eval-view job-dir index. Full suite 4070 passed. Deferred (needs a verifier-model design decision, not a quick patch): the scaffold no-op verifier — verifier.md wires the `./test.sh` script strategy while a separate test_outputs.py is orphaned; connecting them means either running pytest in the task's own sandbox (needs pytest in its Dockerfile) or switching verifier.md to a pytest strategy.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6295c8ef64

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T01:06:38Z

+                f"View one with <code>bench eval view {html.escape(rollout_dir.name)}/"
+                f"&lt;rollout&gt;</code>:</p><ul>{items}</ul>"


Preserve the parent job path in the drill-in hint

When the user passes the job directory printed by eval create (for example jobs/<job>), this hint only interpolates rollout_dir.name, so it suggests bench eval view <job>/<rollout> instead of bench eval view jobs/<job>/<rollout>. Running the suggested command from the same CWD as the original run still points at a nonexistent path, so the new index leads users to a broken next step; construct the hint from the original rollout_dir path rather than just its basename.

Useful? React with 👍 / 👎.

cursor · 2026-06-14T01:19:12Z

Bugbot couldn't run - usage limit reached

Bugbot is counted against Cursor usage for this user or team, and this run hit a usage or spend limit.

A user or team admin can review and increase usage limits in the Cursor dashboard.

(requestId: serverGenReqId_e0fc55a7-d812-4fab-80f6-2c616e2316dd)

xdotli merged commit cfa1b4a into main Jun 14, 2026
3 of 4 checks passed

xdotli deleted the fix/scaffold-verifier-wiring branch June 14, 2026 01:06

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cli): 0.6.1 dogfood papercuts batch 2 (eval view, hub footer, check hint)#756

fix(cli): 0.6.1 dogfood papercuts batch 2 (eval view, hub footer, check hint)#756
xdotli merged 1 commit into
mainfrom
fix/scaffold-verifier-wiring

xdotli commented Jun 14, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Uh oh!

cursor Bot commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		f"View one with <code>bench eval view {html.escape(rollout_dir.name)}/"
		f"<rollout></code>:</p><ul>{items}</ul>"

Conversation

xdotli commented Jun 14, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deferred — the scaffold no-op verifier (needs your design call)

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot commented Jun 14, 2026

Bugbot couldn't run - usage limit reached

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xdotli commented Jun 14, 2026 •

edited by cursor Bot

Loading