fix: validate skills eval model count before running (#550) by raywcm · Pull Request #650 · benchflow-ai/benchflow

raywcm · 2026-06-09T08:52:20Z

Summary

Closes #550.

benchflow skills eval currently validates the number of --model values inside SkillEvaluator.run(). When the user passes two models for one agent, the CLI has already printed the eval header and then surfaces a Rich traceback from a ValueError.

This PR moves that cardinality check to the Typer command boundary so the CLI fails before starting the eval and prints a concise validation message instead.

Changes

Validate --model count before constructing and running the skill evaluator.
Keep the existing accepted forms:
- no --model
- one --model broadcast to all agents
- one model per --agent
Add a CLI regression test that asserts:
- exit code is 1
- the friendly validation message is shown
- no Skill eval: header is printed
- no traceback is printed

Testing

Reproduced the current failure before the fix:

Skill eval: citation-management (1 cases)
  Agents: gemini
  Environment: docker
  Baseline skipped (--no-baseline)
...
ValueError: models length (2) must match agents (1) or be 1

After the fix:

--model may be provided once for all agents or once per --agent; got 2 models for 1 agents

Ran:

python -m pytest -q tests/test_skill_eval_cli.py tests/test_agent_cli.py
python -m compileall -q src/benchflow/cli/main.py tests/test_skill_eval_cli.py

Result:

3 passed in 2.07s

raywcm added 2 commits June 9, 2026 08:52

fix: validate skills eval model cardinality early

ece3ae9

fix: validate skills eval model cardinality early

86ecb8e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: validate skills eval model count before running (#550)#650

fix: validate skills eval model count before running (#550)#650
raywcm wants to merge 2 commits into
benchflow-ai:mainfrom
raywcm:fix/skills-eval-model-cardinality

raywcm commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

raywcm commented Jun 9, 2026

Summary

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant