Skip to content

fix: validate skills eval model count before running (#550)#650

Open
raywcm wants to merge 2 commits into
benchflow-ai:mainfrom
raywcm:fix/skills-eval-model-cardinality
Open

fix: validate skills eval model count before running (#550)#650
raywcm wants to merge 2 commits into
benchflow-ai:mainfrom
raywcm:fix/skills-eval-model-cardinality

Conversation

@raywcm

@raywcm raywcm commented Jun 9, 2026

Copy link
Copy Markdown

Summary

Closes #550.

benchflow skills eval currently validates the number of --model values inside SkillEvaluator.run(). When the user passes two models for one agent, the CLI has already printed the eval header and then surfaces a Rich traceback from a ValueError.

This PR moves that cardinality check to the Typer command boundary so the CLI fails before starting the eval and prints a concise validation message instead.

Changes

  • Validate --model count before constructing and running the skill evaluator.
  • Keep the existing accepted forms:
    • no --model
    • one --model broadcast to all agents
    • one model per --agent
  • Add a CLI regression test that asserts:
    • exit code is 1
    • the friendly validation message is shown
    • no Skill eval: header is printed
    • no traceback is printed

Testing

Reproduced the current failure before the fix:

Skill eval: citation-management (1 cases)
  Agents: gemini
  Environment: docker
  Baseline skipped (--no-baseline)
...
ValueError: models length (2) must match agents (1) or be 1

After the fix:

--model may be provided once for all agents or once per --agent; got 2 models for 1 agents

Ran:

python -m pytest -q tests/test_skill_eval_cli.py tests/test_agent_cli.py
python -m compileall -q src/benchflow/cli/main.py tests/test_skill_eval_cli.py

Result:

3 passed in 2.07s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

skills eval model/agent length mismatch raises traceback instead of CLI validation

1 participant