fix(task): port guard fixes from #651 — schema_version major gate, judge empty-inputs, network-mode contradiction#714
Merged
Conversation
…mpty-inputs, network contradiction
This was referenced Jun 13, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Ports four validation guards from PR #651 (main) onto
release/v0.6.0. A code audit of 0.6.0-rc.6 on 2026-06-12 verified all four guards are absent on this branch; each was a real hole, not a refactor artifact.What each guard closes
1.
schema_versionmajor-version gate (src/benchflow/task/config.py)v0.6
TaskConfig.schema_versionaccepted any string ("banana","99.0") and carried it through silently. Portedvalidate_schema_version+_SUPPORTED_SCHEMA_MAJORS = frozenset({1}): non-numeric versions and majors outside{1}are now a hardValueError. Minor versions stay permissive; the"1.3"default and existing"1.0"/"1.x"tasks are unaffected.2. Agent-judge empty-inputs runtime backstop (
src/benchflow/task/verifier_core.py)_collect_agent_judge_inputshad no emptiness check, unlike its_collect_ors_episode_inputssibling. This is the silent-zero-evidence judge hole: anagent-judgestrategy whoseinputsresolved empty would invoke the judge with no evidence at all and still produce a graded reward — a score backed by nothing, indistinguishable from a real one. Now raisesAgentJudgeInputError("agent-judge strategy ... must declare inputs") before any judge call.3.
verifier_documentempty-list fix (src/benchflow/task/verifier_document.py)The parse-time checks for
agent-judgeandors-episodeinputsusednot isinstance(inputs, list) or not all(...)— butall([]) is True, so an empty list passed a check whose error message says "must be a non-empty list of strings". Addedor not inputsto both branches, making the document-level validation match its own contract (and backstopping guard 2 at parse time).4. Network-mode contradiction hard-error (
src/benchflow/task/config.py)SandboxConfigsilently coerced an explicitly declarednetwork_mode(e.g.allowlist+allowed_hosts) down tono-networkwhenever the deprecatedallow_internet = falsewas also present — a silent, surprising network downgrade. Ported themodel_fields_set-based reconciliation: an explicit contradiction is now a hardValueErrordirecting authors to drop the deprecated flag, while the legacy path (allow_internet = falsewith no explicitnetwork_mode) still coerces tono-networkas before. Reconciliation also re-runs_validate_network_policy_fieldsat the end so it can never emit a self-contradictory config.Source
Reference implementation: PR #651 (
src/benchflow/task/config.py,src/benchflow/task/verifier.py,src/benchflow/task/verifier_document.py), adapted to the v0.6 module layout (verifier_corebehind thebenchflow.task.verifierfaçade;AgentJudgeInputErrorfromverifier_errors).Tests
tests/test_v06_guard_ports.py(15 tests) mirroring the Superseded by #760: task.md v0.6.2 cutover #651 tests (test_network_policy_reconcile.py,test_agent_judge_inputs.py): contradiction raises, legacy back-compat coercion preserved, post-reconcile invariant, schema major gate (reject99.0/banana, accept1.0/default), document-level empty-inputsrejection for both strategy types, and the async runtime backstop on_collect_agent_judge_inputs.test_task_config.py,test_verifier_document.py,test_internet_policy.py, plus the verifier strategy/output/judge suites (126 passed).pytest tests/ --ignore=tests/integration): 3834 passed, 49 skipped, 0 failures — no existing fixture trips any guard, so no compatibility accommodations were needed.ruff format/ruff checkclean on all changed files.🤖 Generated with Claude Code