Superseded by #760: task.md v0.6.2 cutover#651
Conversation
… openhands/DeepSeek run-through (DRAFT, do not merge)
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
…rk-policy authority, agent-judge guard, restore continue, schema_version validation - task_authoring: _promote_legacy_task_md_alias_dirs rewrites /tests/->/verifier/ and /solution/->/oracle/ in promoted scripts so a migrated verifier runs at its native mount (review: silent runtime verifier breakage) - config(SandboxConfig): network_mode is authoritative — an explicit network_mode that contradicts allow_internet=False now raises instead of silently downgrading to no-network; re-validate network policy after reconciliation - config(TaskConfig): schema_version rejects unknown-major / non-numeric versions - verifier: agent-judge + ors-episode reject empty inputs at parse, plus a runtime backstop in _collect_agent_judge_inputs (no evidence-less graded reward) - cli: restore register_continue(app) (benchflow continue/continue-batch were dead code) - docs(skillsbench-scope): soften the inaccurate lint-enforced glossary claim to advisory - tests: +20 covering all of the above Full-suite failure set identical pre/post (74 pre-existing from the focused src/tests overlay); no regressions.
…reward tests, task-standard doc - evaluation: _is_task_dir discovers on a cheap marker (task.md OR task.toml) instead of full check_task, so a structurally-invalid task.md dir fails loudly at run time rather than being silently dropped from discovery (review MEDIUM) - tests: verifier mount-routing (native /verifier vs legacy /tests) + reward-precedence (reward.json primary + strict reward.txt cross-check) — the zero-tests gap (review HIGH #10) - docs: add docs/task-standard.md (root-key rule, agent/runtime policy, reward files), resolving the broken glossary.yaml/README xrefs (review MEDIUM) +16 tests, all green; evaluation.py adds no new failures (verified vs HEAD).
…ate task-list header, robust frontmatter split - run_matrix.sh: default MODEL to validated deepseek-v4-flash + verify the model id is in /models (not just key reachability) — removes the dead-model footgun - simple_tasks.txt: correct the inaccurate all-network_mode=disabled header - adapt_clean.py: line-based _split_frontmatter (a --- inside a YAML scalar can no longer truncate the body) + drop unused import - .gitignore: dedup
…rontmatter directly The renderer no longer emits a ## prompt heading: the agent instruction is the Markdown body immediately after the closing frontmatter ---. The parse path already treats the post-frontmatter preamble as the instruction, and the no-section branch now unescapes reserved headings so an instruction containing a literal ## prompt / ## role: line round-trips verbatim. A legacy ## prompt heading is still accepted on read for backward compatibility. Updates demo_task, the init scaffold, and docs/task-standard.md; adds tests/test_taskmd_prompt_format.py.
…es), CI green Supersedes the focused-subset RFC with the complete, internally-consistent task.md standard from the private fork, the PR review fixes applied, and the fork test suite reconciled. Suite: 3256 passed / 6 xfailed (pre-existing JS-agent launcher invariant) / lint+format+type clean.
…he 87x3 run
Normalizations the full openhands+deepseek-v4-flash run (261/261 healthy
slots) required, now encoded so a fresh adapt reproduces the exact task
packages the run used:
- strip stray legacy top-level version="0.0" so TaskConfig defaults to the
current schema_version (jax-computing-basics, pddl-airport-planning,
pddl-tpp-planning)
- rewrite non-standard CTRF outputs to /logs/verifier/ctrf.json, including
relative --ctrf paths (hvac-control, r2r-mpc-control, fix-visual-stability)
- rewrite leftover legacy /tests references the migrator misses: [ -d /tests ]
reward gates and relative tests/ paths in shell, Path("/tests") and
X / "tests" constructions in promoted Python verifiers
(exam-block-sequencing)
Validated: adapt --all over the 87 skillsbench tasks yields 0 issues and the
previously-failing verifier files byte-match the run-fixed versions.
- drop adapt_clean.py (duplicate of adapt.py; its /tests->/verifier fix is
upstreamed into migrate_task_to_task_md) and run_matrix.sh (env-specific
one-off runner) — README/VALIDATION updated to the canonical bench CLI flow
- drop tools/push-task-standard-handoff.sh (scratch handoff plumbing with
hardcoded personal paths) and docs/reports/v05-skillsbench-192-validation.md
(separate 0.5.2 workstream, unreferenced)
- revert experiments/skillsbench-fill/{publish.py,run_cell.sh} to main: the
branch carried unrelated drive-by regressions (dropped TOKEN_COUNTER_KEYS
scrubber whitelist, comment churn)
main content kept (TOKEN_COUNTER_KEYS whitelist etc.); only decorative banner comments removed to satisfy test_comment_style.
|
Status update (2026-06-12): this PR will not merge — it stays open as the evidence anchor until its successors land, then closes. The task.md core here was re-engineered on the v0.6 line (
Close this once #929 and the leaderboard archive PR land. |
|
Evidence-archive PR is up: https://huggingface.co/datasets/benchflow/skillsbench-leaderboard/discussions/13 (261 winning trials, 4.0GB, 3 conditions × 87 tasks). With benchflow#714 and benchflow-ai/skillsbench#929 open, all extraction from this RFC is complete — close this once #929 + the HF PR land. |
fix(task): port guard fixes from #651 — schema_version major gate, judge empty-inputs, network-mode contradiction
…ate, judge empty-inputs, network contradiction
|
Closing as superseded by #760. The clean v0.6.2 implementation is now the review and merge target, while this RFC branch remains historical context only. |
Superseded by #760.
The original RFC branch is intentionally not the merge target: it is conflict-heavy and mixed with historical evidence work. The clean v0.6.2 implementation now lives in #760 and should be reviewed together with the SkillsBench native task.md cutover.