You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(cli): incremental eval runs — resume, append, and aggregate (#1110)
* feat(cli): incremental eval runs — resume, append, and aggregate
Add three related capabilities for incremental eval runs:
1. `agentv eval aggregate <runDir>` subcommand
- Reads index.jsonl, deduplicates by (test_id, target) keeping last entry
- Recomputes benchmark.json and timing.json
- Prints summary to stdout
2. `--resume` flag on `agentv eval run`
- Skips already-completed (non-error) tests
- Appends new results to existing index.jsonl
- Aggregates with deduplication at the end
3. `--rerun-failed` flag on `agentv eval run`
- Like --resume but only skips tests with execution_status "ok"
- Reruns execution_error and quality_failure tests
- New results replace old ones via last-entry-wins deduplication
Key changes:
- artifact-writer.ts: Add deduplicateByTestIdTarget(), aggregateRunDir(),
writePerTestArtifacts()
- jsonl-writer.ts: Support append mode (flags: "a")
- output-writer.ts: Pass append option through
- commands/aggregate.ts: New subcommand
- commands/run.ts: Add --resume and --rerun-failed flags
- run-eval.ts: Resume/rerun skip logic, append writer, aggregate after run
Closes#1071
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix(cli): register aggregate in EVAL_SUBCOMMANDS for argv preprocessing
Without this, `agentv eval aggregate <dir>` was rewritten to
`agentv eval run aggregate <dir>` by preprocessArgv(), causing
aggregate to be treated as an eval file path.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* fix(cli): flush writer before summary & use full results for matrix display
- Close outputWriter before reading index.jsonl for summary computation
to avoid race condition with unflushed stream data
- Use summaryResults (all deduplicated) instead of allResults (new only)
for matrix summary in resume mode
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* refactor(cli): extract eval resume key helpers
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
0 commit comments