Skip to content

Superseded by #760: task.md v0.6.2 cutover#651

Closed
Yiminnn wants to merge 11 commits into
mainfrom
rfc/skillsbench-taskmd-standard
Closed

Superseded by #760: task.md v0.6.2 cutover#651
Yiminnn wants to merge 11 commits into
mainfrom
rfc/skillsbench-taskmd-standard

Conversation

@Yiminnn

@Yiminnn Yiminnn commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Superseded by #760.

The original RFC branch is intentionally not the merge target: it is conflict-heavy and mixed with historical evidence work. The clean v0.6.2 implementation now lives in #760 and should be reviewed together with the SkillsBench native task.md cutover.

… openhands/DeepSeek run-through (DRAFT, do not merge)
@mintlify

mintlify Bot commented Jun 9, 2026

Copy link
Copy Markdown

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
benchflow-bff148e7 🔴 Failed Jun 9, 2026, 4:54 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

Yiminnn added 8 commits June 9, 2026 17:35
…rk-policy authority, agent-judge guard, restore continue, schema_version validation

- task_authoring: _promote_legacy_task_md_alias_dirs rewrites /tests/->/verifier/ and
  /solution/->/oracle/ in promoted scripts so a migrated verifier runs at its native
  mount (review: silent runtime verifier breakage)
- config(SandboxConfig): network_mode is authoritative — an explicit network_mode that
  contradicts allow_internet=False now raises instead of silently downgrading to
  no-network; re-validate network policy after reconciliation
- config(TaskConfig): schema_version rejects unknown-major / non-numeric versions
- verifier: agent-judge + ors-episode reject empty inputs at parse, plus a runtime
  backstop in _collect_agent_judge_inputs (no evidence-less graded reward)
- cli: restore register_continue(app) (benchflow continue/continue-batch were dead code)
- docs(skillsbench-scope): soften the inaccurate lint-enforced glossary claim to advisory
- tests: +20 covering all of the above

Full-suite failure set identical pre/post (74 pre-existing from the focused src/tests
overlay); no regressions.
…reward tests, task-standard doc

- evaluation: _is_task_dir discovers on a cheap marker (task.md OR task.toml) instead of
  full check_task, so a structurally-invalid task.md dir fails loudly at run time rather
  than being silently dropped from discovery (review MEDIUM)
- tests: verifier mount-routing (native /verifier vs legacy /tests) + reward-precedence
  (reward.json primary + strict reward.txt cross-check) — the zero-tests gap (review HIGH #10)
- docs: add docs/task-standard.md (root-key rule, agent/runtime policy, reward files),
  resolving the broken glossary.yaml/README xrefs (review MEDIUM)

+16 tests, all green; evaluation.py adds no new failures (verified vs HEAD).
…ate task-list header, robust frontmatter split

- run_matrix.sh: default MODEL to validated deepseek-v4-flash + verify the model id is in
  /models (not just key reachability) — removes the dead-model footgun
- simple_tasks.txt: correct the inaccurate all-network_mode=disabled header
- adapt_clean.py: line-based _split_frontmatter (a --- inside a YAML scalar can no longer
  truncate the body) + drop unused import
- .gitignore: dedup
…rontmatter directly

The renderer no longer emits a ## prompt heading: the agent instruction is the
Markdown body immediately after the closing frontmatter ---. The parse path
already treats the post-frontmatter preamble as the instruction, and the
no-section branch now unescapes reserved headings so an instruction containing a
literal ## prompt / ## role: line round-trips verbatim. A legacy ## prompt
heading is still accepted on read for backward compatibility. Updates demo_task,
the init scaffold, and docs/task-standard.md; adds tests/test_taskmd_prompt_format.py.
…es), CI green

Supersedes the focused-subset RFC with the complete, internally-consistent
task.md standard from the private fork, the PR review fixes applied, and the
fork test suite reconciled. Suite: 3256 passed / 6 xfailed (pre-existing
JS-agent launcher invariant) / lint+format+type clean.
…he 87x3 run

Normalizations the full openhands+deepseek-v4-flash run (261/261 healthy
slots) required, now encoded so a fresh adapt reproduces the exact task
packages the run used:

- strip stray legacy top-level version="0.0" so TaskConfig defaults to the
  current schema_version (jax-computing-basics, pddl-airport-planning,
  pddl-tpp-planning)
- rewrite non-standard CTRF outputs to /logs/verifier/ctrf.json, including
  relative --ctrf paths (hvac-control, r2r-mpc-control, fix-visual-stability)
- rewrite leftover legacy /tests references the migrator misses: [ -d /tests ]
  reward gates and relative tests/ paths in shell, Path("/tests") and
  X / "tests" constructions in promoted Python verifiers
  (exam-block-sequencing)

Validated: adapt --all over the 87 skillsbench tasks yields 0 issues and the
previously-failing verifier files byte-match the run-fixed versions.
@Yiminnn Yiminnn marked this pull request as ready for review June 10, 2026 21:54
Yiminnn added 2 commits June 10, 2026 22:07
- drop adapt_clean.py (duplicate of adapt.py; its /tests->/verifier fix is
  upstreamed into migrate_task_to_task_md) and run_matrix.sh (env-specific
  one-off runner) — README/VALIDATION updated to the canonical bench CLI flow
- drop tools/push-task-standard-handoff.sh (scratch handoff plumbing with
  hardcoded personal paths) and docs/reports/v05-skillsbench-192-validation.md
  (separate 0.5.2 workstream, unreferenced)
- revert experiments/skillsbench-fill/{publish.py,run_cell.sh} to main: the
  branch carried unrelated drive-by regressions (dropped TOKEN_COUNTER_KEYS
  scrubber whitelist, comment churn)
main content kept (TOKEN_COUNTER_KEYS whitelist etc.); only decorative
banner comments removed to satisfy test_comment_style.
@Yiminnn Yiminnn changed the title RFC: SkillsBench task.md standard — authoring stack + adapter + openhands/DeepSeek run-through (DRAFT) RFC (evidence anchor, do-not-merge): SkillsBench task.md standard — superseded by v0.6 line; run-validated outputs shipped via skillsbench#929 + #714 Jun 13, 2026
@Yiminnn

Yiminnn commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator Author

Status update (2026-06-12): this PR will not merge — it stays open as the evidence anchor until its successors land, then closes.

The task.md core here was re-engineered on the v0.6 line (release/v0.6.0, ships to main via #665), so this branch is permanently CONFLICTING. Everything of value has been extracted:

Close this once #929 and the leaderboard archive PR land.

@Yiminnn

Yiminnn commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator Author

Evidence-archive PR is up: https://huggingface.co/datasets/benchflow/skillsbench-leaderboard/discussions/13 (261 winning trials, 4.0GB, 3 conditions × 87 tasks). With benchflow#714 and benchflow-ai/skillsbench#929 open, all extraction from this RFC is complete — close this once #929 + the HF PR land.

xdotli added a commit that referenced this pull request Jun 13, 2026
fix(task): port guard fixes from #651 — schema_version major gate, judge empty-inputs, network-mode contradiction
bingran-you pushed a commit to bingran-you/benchflow that referenced this pull request Jun 14, 2026
…ate, judge empty-inputs, network contradiction
@bingran-you bingran-you changed the title RFC (evidence anchor, do-not-merge): SkillsBench task.md standard — superseded by v0.6 line; run-validated outputs shipped via skillsbench#929 + #714 Superseded by #760: task.md v0.6.2 cutover Jun 14, 2026
@bingran-you

Copy link
Copy Markdown
Collaborator

Closing as superseded by #760. The clean v0.6.2 implementation is now the review and merge target, while this RFC branch remains historical context only.

@xdotli xdotli deleted the rfc/skillsbench-taskmd-standard branch June 14, 2026 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants