Skip to content

fix(cli): 0.6.1 dogfood papercuts batch 2 (eval view, hub footer, check hint)#756

Merged
xdotli merged 1 commit into
mainfrom
fix/scaffold-verifier-wiring
Jun 14, 2026
Merged

fix(cli): 0.6.1 dogfood papercuts batch 2 (eval view, hub footer, check hint)#756
xdotli merged 1 commit into
mainfrom
fix/scaffold-verifier-wiring

Conversation

@xdotli

@xdotli xdotli commented Jun 14, 2026

Copy link
Copy Markdown
Member

Second batch of 0.6.0 dogfood fast-follow papercuts (targets main).

  • eval view <job-dir> no longer shows a blank "No trajectory files found" — render_rollout now indexes the job dir's rollout subdirectories so you know to drill in. (Regression test added.)
  • hub env list prints a Showing N of M… footer with how to refine (--search/--owner/--limit/--json) — a 20-row page of a 1270-env catalog previously read as "the provider only has 20".
  • tasks check --level publication-grade verifier-missing errors now include a remediation hint (author verifier/verifier.md; tasks migrate doesn't generate it) instead of a dead-end.

Full suite 4070 passed; ruff/format/ty clean.

Deferred — the scaffold no-op verifier (needs your design call)

The highest-value papercut isn't a quick patch: verifier.md wires command: ./test.sh (script strategy), test.sh writes 0.0, and a separate test_outputs.py is orphaned. Connecting them is a verifier-model decision:

  1. test.sh runs pytest test_outputs.py — requires pytest in the task's own Dockerfile (a real constraint on every scaffolded task).
  2. Switch verifier.md to a pytest strategy on test_outputs.py — benchflow's harness runs pytest, no task-Dockerfile dependency (cleaner, if a pytest strategy exists).
  3. Drop the orphaned test_outputs.py — author edits test.sh only (smallest change, removes the pytest option).

Which model do you want? I'll implement it once you pick.


Note

Low Risk
User-facing CLI messaging and HTML viewer behavior only; no changes to eval execution, auth, or data handling.

Overview
Three small CLI/UX fixes from 0.6.0 dogfood follow-ups.

bench eval view on a job directory (typical path from eval create artifacts) no longer shows an empty “No trajectory files found” page. render_rollout scans child dirs for turn*.txt or ACP trajectory files and returns HTML that lists rollout names and how to open bench eval view <job>/<rollout>.

bench hub env list adds a dim footer: Showing N (optionally of total when the API returns it) plus pointers to --search, --owner, --limit, and --json, so a short page is not mistaken for the full catalog.

bench tasks check --level publication-grade expands missing verifier/ and verifier/verifier.md errors with remediation text (author per docs; bench tasks migrate does not create the verifier package). CHANGELOG documents these; a regression test covers job-dir indexing.

Reviewed by Cursor Bugbot for commit 6295c8e. Bugbot is set up for automated code reviews on this repo. Configure here.

…ck hint)

Three contained dogfood papercuts:

- `bench eval view <job-dir>` rendered a blank "No trajectory files found" when
  given a job directory (the natural value from `eval create`'s "Artifacts:"
  line) because render_rollout never looked below the top level. It now indexes
  the rollout subdirectories with a "drill into one" pointer.
- `bench hub env list` now prints a footer (`Showing N of M…`) with how to refine
  (--search/--owner/--limit/--json). A 20-row page of a 1270-environment catalog
  previously read as "the provider only has 20".
- `bench tasks check --level publication-grade` verifier-missing errors now carry
  a remediation hint (author verifier/verifier.md; note `tasks migrate` does not
  generate it), so the migrate→publication-grade path is no longer a dead-end.

Regression test for the eval-view job-dir index. Full suite 4070 passed.

Deferred (needs a verifier-model design decision, not a quick patch): the
scaffold no-op verifier — verifier.md wires the `./test.sh` script strategy
while a separate test_outputs.py is orphaned; connecting them means either
running pytest in the task's own sandbox (needs pytest in its Dockerfile) or
switching verifier.md to a pytest strategy.
@xdotli xdotli merged commit cfa1b4a into main Jun 14, 2026
3 of 4 checks passed
@xdotli xdotli deleted the fix/scaffold-verifier-wiring branch June 14, 2026 01:06

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6295c8ef64

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +182 to +183
f"View one with <code>bench eval view {html.escape(rollout_dir.name)}/"
f"&lt;rollout&gt;</code>:</p><ul>{items}</ul>"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the parent job path in the drill-in hint

When the user passes the job directory printed by eval create (for example jobs/<job>), this hint only interpolates rollout_dir.name, so it suggests bench eval view <job>/<rollout> instead of bench eval view jobs/<job>/<rollout>. Running the suggested command from the same CWD as the original run still points at a nonexistent path, so the new index leads users to a broken next step; construct the hint from the original rollout_dir path rather than just its basename.

Useful? React with 👍 / 👎.

@cursor

cursor Bot commented Jun 14, 2026

Copy link
Copy Markdown

Bugbot couldn't run - usage limit reached

Bugbot is counted against Cursor usage for this user or team, and this run hit a usage or spend limit.

A user or team admin can review and increase usage limits in the Cursor dashboard.

(requestId: serverGenReqId_e0fc55a7-d812-4fab-80f6-2c616e2316dd)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant