Superseded by #760: task.md v0.6.2 cutover by Yiminnn · Pull Request #651 · benchflow-ai/benchflow

Yiminnn · 2026-06-09T16:37:34Z

Superseded by #760.

The original RFC branch is intentionally not the merge target: it is conflict-heavy and mixed with historical evidence work. The clean v0.6.2 implementation now lives in #760 and should be reviewed together with the SkillsBench native task.md cutover.

… openhands/DeepSeek run-through (DRAFT, do not merge)

mintlify · 2026-06-09T16:54:44Z

Preview deployment for your docs. Learn more about Mintlify Previews.

Project	Status	Preview	Updated (UTC)
benchflow-bff148e7	🔴 Failed	–	Jun 9, 2026, 4:54 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

…rk-policy authority, agent-judge guard, restore continue, schema_version validation - task_authoring: _promote_legacy_task_md_alias_dirs rewrites /tests/->/verifier/ and /solution/->/oracle/ in promoted scripts so a migrated verifier runs at its native mount (review: silent runtime verifier breakage) - config(SandboxConfig): network_mode is authoritative — an explicit network_mode that contradicts allow_internet=False now raises instead of silently downgrading to no-network; re-validate network policy after reconciliation - config(TaskConfig): schema_version rejects unknown-major / non-numeric versions - verifier: agent-judge + ors-episode reject empty inputs at parse, plus a runtime backstop in _collect_agent_judge_inputs (no evidence-less graded reward) - cli: restore register_continue(app) (benchflow continue/continue-batch were dead code) - docs(skillsbench-scope): soften the inaccurate lint-enforced glossary claim to advisory - tests: +20 covering all of the above Full-suite failure set identical pre/post (74 pre-existing from the focused src/tests overlay); no regressions.

…reward tests, task-standard doc - evaluation: _is_task_dir discovers on a cheap marker (task.md OR task.toml) instead of full check_task, so a structurally-invalid task.md dir fails loudly at run time rather than being silently dropped from discovery (review MEDIUM) - tests: verifier mount-routing (native /verifier vs legacy /tests) + reward-precedence (reward.json primary + strict reward.txt cross-check) — the zero-tests gap (review HIGH #10) - docs: add docs/task-standard.md (root-key rule, agent/runtime policy, reward files), resolving the broken glossary.yaml/README xrefs (review MEDIUM) +16 tests, all green; evaluation.py adds no new failures (verified vs HEAD).

…ate task-list header, robust frontmatter split - run_matrix.sh: default MODEL to validated deepseek-v4-flash + verify the model id is in /models (not just key reachability) — removes the dead-model footgun - simple_tasks.txt: correct the inaccurate all-network_mode=disabled header - adapt_clean.py: line-based _split_frontmatter (a --- inside a YAML scalar can no longer truncate the body) + drop unused import - .gitignore: dedup

…rontmatter directly The renderer no longer emits a ## prompt heading: the agent instruction is the Markdown body immediately after the closing frontmatter ---. The parse path already treats the post-frontmatter preamble as the instruction, and the no-section branch now unescapes reserved headings so an instruction containing a literal ## prompt / ## role: line round-trips verbatim. A legacy ## prompt heading is still accepted on read for backward compatibility. Updates demo_task, the init scaffold, and docs/task-standard.md; adds tests/test_taskmd_prompt_format.py.

…es), CI green Supersedes the focused-subset RFC with the complete, internally-consistent task.md standard from the private fork, the PR review fixes applied, and the fork test suite reconciled. Suite: 3256 passed / 6 xfailed (pre-existing JS-agent launcher invariant) / lint+format+type clean.

…he 87x3 run Normalizations the full openhands+deepseek-v4-flash run (261/261 healthy slots) required, now encoded so a fresh adapt reproduces the exact task packages the run used: - strip stray legacy top-level version="0.0" so TaskConfig defaults to the current schema_version (jax-computing-basics, pddl-airport-planning, pddl-tpp-planning) - rewrite non-standard CTRF outputs to /logs/verifier/ctrf.json, including relative --ctrf paths (hvac-control, r2r-mpc-control, fix-visual-stability) - rewrite leftover legacy /tests references the migrator misses: [ -d /tests ] reward gates and relative tests/ paths in shell, Path("/tests") and X / "tests" constructions in promoted Python verifiers (exam-block-sequencing) Validated: adapt --all over the 87 skillsbench tasks yields 0 issues and the previously-failing verifier files byte-match the run-fixed versions.

- drop adapt_clean.py (duplicate of adapt.py; its /tests->/verifier fix is upstreamed into migrate_task_to_task_md) and run_matrix.sh (env-specific one-off runner) — README/VALIDATION updated to the canonical bench CLI flow - drop tools/push-task-standard-handoff.sh (scratch handoff plumbing with hardcoded personal paths) and docs/reports/v05-skillsbench-192-validation.md (separate 0.5.2 workstream, unreferenced) - revert experiments/skillsbench-fill/{publish.py,run_cell.sh} to main: the branch carried unrelated drive-by regressions (dropped TOKEN_COUNTER_KEYS scrubber whitelist, comment churn)

main content kept (TOKEN_COUNTER_KEYS whitelist etc.); only decorative banner comments removed to satisfy test_comment_style.

Yiminnn · 2026-06-13T00:01:09Z

Status update (2026-06-12): this PR will not merge — it stays open as the evidence anchor until its successors land, then closes.

The task.md core here was re-engineered on the v0.6 line (release/v0.6.0, ships to main via #665), so this branch is permanently CONFLICTING. Everything of value has been extracted:

Guard fixes → fix(task): port guard fixes from #651 — schema_version major gate, judge empty-inputs, network-mode contradiction #714 (vs release/v0.6.0): schema_version major gate, agent-judge/ors-episode empty-inputs guards, network-mode contradiction error — all verified absent on 0.6.0-rc.6. Full suite 3834 passed.
The 87 run-validated task.md packages (+14 statically-checked extras) → feat(tasks)!: convert SkillsBench to native task.md packages skillsbench#929 (skillsbench@1.2 draft, gated on chore: release v0.6.0 #665 GA).
The 87×3 run evidence (261/261 healthy, openhands + deepseek-v4-flash, Daytona): trajectories + results staged for an evidence-archive PR to benchflow/skillsbench-leaderboard (1 trial/condition — explicitly not a leaderboard entry).
The adapter (experiments/skillsbench-taskmd-run/adapt.py) with its legacy-quirk normalizers remains referenced from this branch (v0.6 deleted experiments/; the normalizers are not in bench tasks migrate).

Close this once #929 and the leaderboard archive PR land.

Yiminnn · 2026-06-13T00:10:55Z

Evidence-archive PR is up: https://huggingface.co/datasets/benchflow/skillsbench-leaderboard/discussions/13 (261 winning trials, 4.0GB, 3 conditions × 87 tasks). With benchflow#714 and benchflow-ai/skillsbench#929 open, all extraction from this RFC is complete — close this once #929 + the HF PR land.

fix(task): port guard fixes from #651 — schema_version major gate, judge empty-inputs, network-mode contradiction

…ate, judge empty-inputs, network contradiction

bingran-you · 2026-06-14T16:58:05Z

Closing as superseded by #760. The clean v0.6.2 implementation is now the review and merge target, while this RFC branch remains historical context only.

RFC: SkillsBench task.md standard — authoring stack + clean adapter +…

cd38d57

… openhands/DeepSeek run-through (DRAFT, do not merge)

Yiminnn added 8 commits June 9, 2026 17:35

style: ruff lint + format (CI lint/format-check steps)

0c47b30

chore: ruff-format skillsbench adapter

14e0152

Yiminnn marked this pull request as ready for review June 10, 2026 21:54

Yiminnn added 2 commits June 10, 2026 22:07

fix: strip section-heading comments from restored skillsbench-fill files

f432fa3

main content kept (TOKEN_COUNTER_KEYS whitelist etc.); only decorative banner comments removed to satisfy test_comment_style.

This was referenced Jun 12, 2026

fix(task): port guard fixes from #651 — schema_version major gate, judge empty-inputs, network-mode contradiction #714

Merged

feat(tasks)!: convert SkillsBench to native task.md packages benchflow-ai/skillsbench#929

Merged

Yiminnn changed the title ~~RFC: SkillsBench task.md standard — authoring stack + adapter + openhands/DeepSeek run-through (DRAFT)~~ RFC (evidence anchor, do-not-merge): SkillsBench task.md standard — superseded by v0.6 line; run-validated outputs shipped via skillsbench#929 + #714 Jun 13, 2026

xdotli added a commit that referenced this pull request Jun 13, 2026

Merge pull request #714 from benchflow-ai/fix/v06-guard-ports

a8337c5

fix(task): port guard fixes from #651 — schema_version major gate, judge empty-inputs, network-mode contradiction

bingran-you pushed a commit to bingran-you/benchflow that referenced this pull request Jun 14, 2026

fix(task): port benchflow-ai#651 guard fixes — schema_version major g…

289258a

…ate, judge empty-inputs, network contradiction

bingran-you changed the title ~~RFC (evidence anchor, do-not-merge): SkillsBench task.md standard — superseded by v0.6 line; run-validated outputs shipped via skillsbench#929 + #714~~ Superseded by #760: task.md v0.6.2 cutover Jun 14, 2026

bingran-you closed this Jun 14, 2026

xdotli deleted the rfc/skillsbench-taskmd-standard branch June 14, 2026 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Superseded by #760: task.md v0.6.2 cutover#651

Superseded by #760: task.md v0.6.2 cutover#651
Yiminnn wants to merge 11 commits into
mainfrom
rfc/skillsbench-taskmd-standard

Yiminnn commented Jun 9, 2026 •

edited by bingran-you

Loading

Uh oh!

mintlify Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

Yiminnn commented Jun 13, 2026

Uh oh!

Yiminnn commented Jun 13, 2026

Uh oh!

bingran-you commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yiminnn commented Jun 9, 2026 • edited by bingran-you Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mintlify Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Yiminnn commented Jun 13, 2026

Uh oh!

Yiminnn commented Jun 13, 2026

Uh oh!

bingran-you commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yiminnn commented Jun 9, 2026 •

edited by bingran-you

Loading

mintlify Bot commented Jun 9, 2026 •

edited

Loading