Skip to content

Commit 87d62a4

Browse files
christsoCopilotclaude
authored
feat(bench): autoresearch optimization loop (#958, #746, #748) (#1112)
* feat(agentv-bench): add automated keep/discard logic to Step 5 Add a new 'Automated keep/discard' subsection to the iteration loop in Step 5 (Improve). After each iteration, the agent can now automatically decide whether to keep or discard a change by running: agentv compare baseline.jsonl candidate.jsonl --json Decision rules: - wins > losses → keep, promote to new baseline - wins <= losses → discard, revert, try different mutation - meanDelta == 0 but simpler prompt → keep (simplicity criterion) Each decision is logged with rationale. Human checkpoints at iterations 3, 6, 9 still fire. Both manual and automated modes coexist. Closes #958 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * feat(bench): autoresearch optimization loop (#958, #746, #748) Add unattended eval-improve loop to agentv-bench: - #958: Automated keep/discard decision in Step 5 using agentv compare --json output (wins/losses/ties/meanDelta rules), preserving human checkpoints at 3/6/9 - #746: Mutator subagent (agents/mutator.md) that rewrites artifacts from failure analysis with hill-climbing ratchet, evidence-driven mutations, and simplicity criterion - #748 Phase 1: Autoresearch mode wired into SKILL.md with full procedure: eval → analyze → decide → mutate → repeat. Includes _autoresearch/ output folder (original.md, best.md, iterations.jsonl, trajectory.html), convergence detection (3 consecutive no-improvement cycles), and standalone Chart.js trajectory visualization Skill-only change — no CLI, schema, or core code modifications. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * docs(autoresearch): add guide, example, and fix trajectory bugs - Add autoresearch-automation.mdx guide with trajectory screenshot, ASCII output table, and incident classifier walkthrough - Add examples/features/autoresearch/ with working eval and prompt - Fix trajectory.html: add actual auto-refresh meta tag (was a non-functional comment), standardize badge text to "drop" (matching SKILL.md data format), guard cumulative cost against non-numeric values - Update SKILL.md completion step to match actual template markup - Update skill-improvement-workflow.mdx cross-reference to new guide Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(bench): align trajectory.html with Studio DESIGN.md and add data validation Rewrites the autoresearch trajectory visualization to match the Studio design system (bg-gray-950 canvas, cyan accent, emerald/red status, system sans-serif, font-medium max). Adds defensive validation: try/catch for unresolved placeholder, Array.isArray guard, per-iteration field checks with graceful error messages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(docs): retake trajectory screenshot at 960px viewport Previous screenshot had excessive empty margins from a 1920px viewport. Resized to 960px so content fills the frame. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(bench): git-based versioning and multi-file artifact support for autoresearch Replace file-copy backup/restore (original.md, best.md) with git commit/revert. HEAD always contains the best-known version: KEEP commits, DROP reverts working tree. The mutator can now optimize directories (multi-file artifacts like skills with references/) in addition to single files. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(docs): rename autoresearch slug, simplify title, retake full-page screenshot - Rename autoresearch-automation.mdx → autoresearch.mdx (slug: /docs/guides/autoresearch/) - Title: "Autoresearch" (was "Autoresearch — Automated Optimization") - Retake screenshot at 1280px full-page to show all content including iteration log - Fix stale best.md reference in example section Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * refactor(bench): context-safe orchestrator loop for unbounded autoresearch The orchestrator never reads eval results, artifacts, or transcripts into its own context — it can run indefinitely without exhausting the context window. - Extract scores via bash/jq (small structured outputs only) - trajectory.html now fetches iterations.jsonl via HTTP instead of requiring inline data injection — no file manipulation after setup - Mutator self-serves failure evidence from disk (grading.json, transcripts, iterations.jsonl) — orchestrator passes paths not content - iterations.jsonl appended via bash echo, not read-modify-write - Retake screenshot without bottom whitespace Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(docs): crop blank space from trajectory screenshot Trim 1280x1513 → 1280x1044 by removing empty background below the iteration log table. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat(bench): display scores as percentages in trajectory chart Format all scores as percentages (48%, 90%, +42%) instead of decimals (0.48, 0.90, +0.42) in summary cards, chart axes, tooltips, and iteration log table. Underlying data stays 0–1 in iterations.jsonl. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: reorder guides sidebar from basic to advanced Logical progression: conceptual foundations → eval authoring → improvement workflow (manual then automated) → advanced topics (external scorers, workspace infrastructure). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: move Skill Evals to Integrations, fix guide sidebar order Move agent-skills-evals.mdx from guides/ to integrations/ since it documents an external format integration, not a guide workflow. Update internal links and re-number remaining guide sidebar orders to close the gap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: move Autoevals Integration to Integrations section Autoevals documents an external library (Braintrust scorers), fitting the Integrations section alongside Langfuse and Skill Evals. Re-number remaining guide sidebar orders to close the gap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(docs): update broken autoresearch links to new slug The page was renamed from autoresearch-automation to autoresearch but two internal links in skill-improvement-workflow.mdx still pointed at the old slug. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(docs): rename git-cache-workspace to workspace-architecture, fix broken link Rename slug to match the page title "Workspace Architecture". Also fix the Workspace Pool link which was missing the /docs prefix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 0ff9c32 commit 87d62a4

17 files changed

Lines changed: 1349 additions & 28 deletions

File tree

167 KB
Loading

apps/web/src/content/docs/docs/guides/agent-eval-layers.mdx

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,8 @@ A complete `EVAL.yaml` covering all four layers:
130130

131131
```yaml
132132
description: Four-layer agent evaluation starter
133+
sidebar:
134+
order: 1
133135
134136
execution:
135137
target: default
Lines changed: 207 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,207 @@
1+
---
2+
title: Autoresearch
3+
description: Run an unattended eval-improve loop that iteratively optimizes agent skills
4+
sidebar:
5+
order: 5
6+
---
7+
8+
import { Image } from 'astro:assets';
9+
import trajectoryChart from '../../../../assets/screenshots/autoresearch-trajectory.png';
10+
11+
Autoresearch is an unattended optimization loop that **automatically improves your agent skills** through repeated eval cycles. It runs the same evaluate → analyze → improve loop described in the [Skill Improvement Workflow](/docs/guides/skill-improvement-workflow/), but does it hands-free — no human review between cycles.
12+
13+
<Image src={trajectoryChart} alt="Autoresearch trajectory chart showing score improvement from 0.48 to 0.90 over 9 cycles" />
14+
15+
The chart above shows a real optimization run: an incident severity classifier starts at 48% accuracy and reaches 90% after 9 automated cycles — each cycle taking seconds and costing fractions of a cent.
16+
17+
## How It Works
18+
19+
```
20+
┌──────────┐
21+
│ 1. EVAL │ ◄───────────────────────────────┐
22+
└─────┬─────┘ │
23+
▼ │
24+
┌──────────┐ │
25+
│ 2. ANALYZE│ dispatcher → analyzer subagent │
26+
└─────┬─────┘ │
27+
▼ │
28+
┌──────────┐ wins > losses → KEEP │
29+
│ 3. DECIDE │ else → DROP │
30+
└─────┬─────┘ │
31+
▼ │
32+
┌──────────┐ │
33+
│ 4. MUTATE │ dispatcher → mutator subagent ──┘
34+
└──────────┘
35+
36+
Stops after 3 consecutive no-improvement cycles
37+
or 10 total cycles (configurable).
38+
```
39+
40+
Each cycle:
41+
1. **Runs `agentv eval`** against the current version of the artifact
42+
2. **Analyzes** failures via the analyzer subagent
43+
3. **Decides** keep or discard using `agentv compare --json` (automated — no human needed)
44+
4. **Mutates** the artifact to address failing assertions, then loops back
45+
46+
The system uses a **hill-climbing ratchet**: each mutation builds on the best-scoring version, never a failed candidate. Improvements compound; regressions get discarded.
47+
48+
## What Gets Optimized
49+
50+
Any file or directory artifact: SKILL.md, prompt template, agent config, system prompt, or a directory of related files (e.g., a skill with `references/` and `agents/` subdirectories). The artifact mode is auto-detected — pass a file path for single-file optimization, or a directory path for multi-file optimization. The mutator rewrites artifacts in place while the eval stays fixed — same test cases, same assertions, different artifact versions.
51+
52+
## Prerequisites
53+
54+
- An eval file (EVAL.yaml or evals.json) that covers the behavior you care about
55+
- The artifact must be a file or directory within a git repository (autoresearch uses git for versioning)
56+
- Run at least one manual eval cycle first to validate your test cases
57+
58+
:::tip
59+
Autoresearch is only as good as your eval. If your assertions don't catch the failures you care about, the optimizer won't fix them. Start with the [manual improvement loop](/docs/guides/skill-improvement-workflow/) to build confidence in your eval quality before going unattended.
60+
:::
61+
62+
## Triggering Autoresearch
63+
64+
Autoresearch runs through the `agentv-bench` Claude Code skill. Trigger it with natural language:
65+
66+
```
67+
"Run autoresearch on my classifier prompt"
68+
"Optimize this skill unattended for 5 cycles"
69+
"Run autoresearch on examples/features/autoresearch/EVAL.yaml"
70+
```
71+
72+
No CLI flags or YAML schema changes needed — the skill handles everything.
73+
74+
## Output Structure
75+
76+
Each autoresearch session creates a self-contained experiment directory:
77+
78+
```
79+
.agentv/results/runs/autoresearch-<name>/
80+
├── _autoresearch/
81+
│ ├── iterations.jsonl # Per-cycle data (score, decision, mutation)
82+
│ └── trajectory.html # Live-updating Chart.js visualization
83+
├── 2026-04-15T10-30-00/ # Cycle 1 run artifacts
84+
│ ├── index.jsonl
85+
│ ├── grading.json
86+
│ └── timing.json
87+
├── 2026-04-15T10-35-00/ # Cycle 2 run artifacts
88+
│ └── ...
89+
└── ...
90+
```
91+
92+
Autoresearch uses **git-based versioning** instead of backup files. Each successful mutation is committed (`git add && git commit`), and failed mutations are reverted (`git checkout`). The optimized artifact lives in the working tree and the latest commit — no separate `best.md` to copy.
93+
94+
- **`_autoresearch/trajectory.html`** — Open in a browser to see the score trajectory, per-assertion breakdown, and cumulative cost. Auto-refreshes during the loop, becomes static on completion.
95+
- **`_autoresearch/iterations.jsonl`** — Machine-readable log of every cycle for downstream analysis.
96+
97+
Review the mutation history with `git log` after the run completes.
98+
99+
## The Keep/Drop Decision
100+
101+
After each eval cycle, autoresearch runs `agentv compare` between the current candidate and the best baseline:
102+
103+
```bash
104+
agentv compare <baseline>/index.jsonl <candidate>/index.jsonl --json
105+
```
106+
107+
The decision rule:
108+
109+
| Condition | Decision | Outcome |
110+
|-----------|----------|---------|
111+
| `wins > losses` | **KEEP** | Promote to new baseline, reset convergence counter |
112+
| `wins <= losses` | **DROP** | Revert to best version, increment convergence counter |
113+
| `mean_delta == 0`, simpler artifact | **KEEP** | Simpler is better at equal performance |
114+
115+
Three consecutive DROPs trigger convergence — the optimizer stops because it can't find improvements.
116+
117+
## Example: Incident Severity Classifier
118+
119+
Here's a real scenario showing autoresearch in action. We start with a minimal classifier prompt:
120+
121+
```markdown
122+
# classifier-prompt.md (initial version)
123+
Classify the incident into P0, P1, P2, or P3.
124+
Give your answer as JSON with severity and reasoning fields.
125+
```
126+
127+
And an eval with 7 test cases covering edge cases — payment failures, SSL cert expiry, gradual memory leaks:
128+
129+
```yaml
130+
# EVAL.yaml (stays fixed — only the prompt changes)
131+
tests:
132+
- id: total-outage
133+
assertions:
134+
- type: contains
135+
value: '"P0"'
136+
- type: is-json
137+
- "Reasoning mentions complete service outage"
138+
- id: payment-failures
139+
assertions:
140+
- type: contains
141+
value: '"P1"'
142+
- type: is-json
143+
- "Reasoning weighs revenue impact despite intermittent nature"
144+
# ... 5 more test cases
145+
```
146+
147+
Running autoresearch produces this trajectory:
148+
149+
```
150+
Cycle Score Decision Mutation
151+
───── ───── ──────── ──────────────────────────────────────
152+
1 0.48 KEEP initial baseline — no mutations applied
153+
2 0.62 KEEP added explicit JSON format, defined P0-P3 levels
154+
3 0.52 DROP added verbose rules — over-constrained reasoning
155+
4 0.71 KEEP added revenue-impact heuristic for P1
156+
5 0.81 KEEP enforced raw JSON output — removed code fences
157+
6 0.86 KEEP added time-urgency rule for SSL/cert cases
158+
7 0.90 KEEP improved reasoning template — cite impact metrics
159+
8 0.86 DROP attempted decision tree merge — regressed
160+
9 0.90 DROP minor wording cleanup — no meaningful change
161+
↳ 3 consecutive drops → CONVERGED
162+
```
163+
164+
**Result:** 0.48 → 0.90 (+42 points) in 9 cycles, $0.03 total cost. The optimized prompt is in the working tree (and the latest git commit).
165+
166+
Key observations:
167+
- **Cycle 3** shows a failed mutation (verbose rules hurt reasoning) — the ratchet discarded it and continued from the cycle 2 version
168+
- **Cycles 8–9** show convergence — the optimizer couldn't improve further and stopped automatically
169+
- **Per-assertion tracking** reveals which aspects improved: classification accuracy reached 100% by cycle 6, while JSON format compliance and reasoning quality improved more gradually
170+
171+
## Convergence
172+
173+
Autoresearch stops when either condition is met:
174+
175+
- **3 consecutive no-improvement cycles** (configurable) — the optimizer has converged
176+
- **10 total cycles** (configurable) — hard limit to bound cost
177+
178+
You can override both limits when triggering autoresearch:
179+
180+
```
181+
"Run autoresearch with max 20 cycles and convergence threshold of 5"
182+
```
183+
184+
## Best Practices
185+
186+
**Start manual, then automate.** Run 2-3 manual eval cycles to validate your test cases catch real issues. Once you trust the eval, switch to autoresearch.
187+
188+
**Same-model pairings work best.** The meta-agent running autoresearch should match the model used by the task agent (e.g., Claude optimizing a Claude agent). Same-model pairings produce better mutations because the optimizer has implicit knowledge of how the target model interprets instructions.
189+
190+
**Watch the per-assertion chart.** If one assertion is stuck at 0% while others improve, the eval may be too strict or testing something the prompt can't control. Consider adjusting the assertion.
191+
192+
**Review the optimized artifact.** Autoresearch improves scores, but always review the changes (`git diff <initial_sha>`) before adopting them. The optimizer may have found a valid but unexpected approach.
193+
194+
**Keep artifact directories focused.** For directory mode, keep artifacts to 5–15 files. The mutator works best when it can reason about the full scope without reading dozens of files. Split large skill directories if needed.
195+
196+
## Relationship to Manual Workflow
197+
198+
| Aspect | Manual Loop | Autoresearch |
199+
|--------|-------------|--------------|
200+
| Human checkpoints | Every iteration | None (opted in to unattended) |
201+
| Keep/discard | You decide | Automated via `agentv compare` |
202+
| Mutation | You edit the skill | Mutator subagent rewrites |
203+
| Max iterations | Unbounded | 10 cycles or convergence |
204+
| Best for | Building eval intuition | Scaling optimization |
205+
| Trajectory chart | Not included | Auto-generated with live refresh |
206+
207+
Start with the [manual loop](/docs/guides/skill-improvement-workflow/) to understand the workflow, then use autoresearch to scale it.

apps/web/src/content/docs/docs/guides/eval-authoring.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Eval Authoring Guide
33
description: Practical guidance for writing workspace-based evals that work reliably across providers.
44
sidebar:
5-
order: 7
5+
order: 3
66
---
77

88
## Workspace Setup: Skill Discovery Paths

apps/web/src/content/docs/docs/guides/evaluation-types.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Execution Quality vs Trigger Quality
33
description: Two distinct evaluation concerns for AI agents and skills — what AgentV measures, and what belongs to skill-creator tooling.
44
sidebar:
5-
order: 6
5+
order: 2
66
---
77

88
Agent evaluation has two fundamentally different concerns: **execution quality** and **trigger quality**. They require different tooling, different methodologies, and different optimization surfaces. Conflating them leads to eval configs that are noisy, hard to maintain, and unreliable.

apps/web/src/content/docs/docs/guides/skill-improvement-workflow.mdx

Lines changed: 10 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,14 @@
22
title: Skill Improvement Workflow
33
description: Iteratively evaluate and improve agent skills using AgentV
44
sidebar:
5-
order: 6
5+
order: 4
66
---
77

88
## Introduction
99

1010
AgentV supports a full evaluation-driven improvement loop for skills and agents. Instead of guessing whether a change makes things better, you run structured evaluations before and after, then compare.
1111

12-
This guide teaches the **core manual loop**. For automated iteration that runs the full cycle hands-free, see [agentv-bench](#automated-iteration).
12+
This guide teaches the **core manual loop**. For automated iteration that runs the full cycle hands-free, see [Autoresearch](/docs/guides/autoresearch/).
1313

1414
## The Core Loop
1515

@@ -244,7 +244,7 @@ After converting, you can:
244244
- Use `code-grader` for custom scoring logic
245245
- Define `tool-trajectory` assertions to check tool usage patterns
246246

247-
See [Skill Evals (evals.json)](/docs/guides/agent-skills-evals/) for the full field mapping and side-by-side comparison.
247+
See [Skill Evals (evals.json)](/docs/integrations/agent-skills-evals/) for the full field mapping and side-by-side comparison.
248248

249249
## Migration from Skill-Creator
250250

@@ -316,21 +316,14 @@ Start simple and add complexity only when the evaluation results demand it:
316316

317317
## Automated Iteration
318318

319-
For users who want the full automated improvement cycle, the `agentv-bench` skill runs a 5-phase optimization loop:
319+
When you're confident in your eval quality, graduate to **autoresearch** — an unattended optimization loop that runs the full evaluate → analyze → improve cycle hands-free.
320320

321-
1. **Analyze** — examines the current skill and evaluation results
322-
2. **Hypothesize** — generates improvement hypotheses from failure patterns
323-
3. **Implement** — applies targeted skill modifications
324-
4. **Evaluate** — re-runs the evaluation suite
325-
5. **Decide** — keeps improvements that help, reverts those that don't
321+
Autoresearch uses the same `agentv eval` and `agentv compare` primitives described above, but automates the human decision steps. A mutator subagent rewrites the artifact based on failure analysis, and an automated keep/discard rule promotes improvements and reverts regressions.
326322

327-
The optimizer uses the same core loop described in this guide but automates the human steps. Start with the manual loop to build intuition, then graduate to the optimizer when you're comfortable with the evaluation workflow.
328-
329-
Its bundled scripts map directly onto the workflow stages:
323+
```
324+
"Run autoresearch on my skill"
325+
```
330326

331-
- `run-eval.ts` and `compare-runs.ts` run and compare evaluations while still delegating to `agentv`
332-
- `run-loop.ts` repeats the evaluation loop without moving grader logic into the script layer
333-
- `aggregate-benchmark.ts` and `generate-report.ts` summarize AgentV artifacts into review-friendly output
334-
- `improve-description.ts` proposes follow-up description experiments once execution quality is stable
327+
One command starts the loop. It runs until the optimizer converges (3 consecutive no-improvement cycles) or hits the cycle limit. Typical runs: 5–10 cycles, under $0.05 total cost.
335328

336-
Code-grader execution, grading semantics, and artifact schemas still live in AgentV core. The scripts layer is orchestration glue over those existing primitives.
329+
See the full guide: [Autoresearch](/docs/guides/autoresearch/)

apps/web/src/content/docs/docs/guides/git-cache-workspace.mdx renamed to apps/web/src/content/docs/docs/guides/workspace-architecture.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Workspace Architecture
33
description: How AgentV clones and materializes working trees for eval runs, with performance guidance for large repos.
44
sidebar:
5-
order: 3
5+
order: 7
66
---
77

88
AgentV evaluations that use `workspace.repos` clone repositories directly from their source (git URL or local path) into a workspace directory. [Workspace pooling](/docs/guides/workspace-pool/) (enabled by default) eliminates repeated clone costs by reusing materialized workspaces across runs.
@@ -131,4 +131,4 @@ To disable pooling for a run:
131131
agentv eval evals/my-eval.yaml --no-pool
132132
```
133133

134-
See the [Workspace Pool](/guides/workspace-pool/) guide for details on pool configuration, clean modes, concurrency, and drift detection.
134+
See the [Workspace Pool](/docs/guides/workspace-pool/) guide for details on pool configuration, clean modes, concurrency, and drift detection.

apps/web/src/content/docs/docs/guides/workspace-pool.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Workspace Pool
33
description: Reuse materialized workspaces across eval runs with fingerprint-based pooling, eliminating repeated clone and checkout costs.
44
sidebar:
5-
order: 4
5+
order: 8
66
---
77

88
Workspace pooling keeps materialized workspaces on disk between eval runs. Instead of cloning repos and checking out files every time, pooled workspaces reset in-place — typically reducing setup from minutes to seconds for large repositories.

apps/web/src/content/docs/docs/guides/agent-skills-evals.mdx renamed to apps/web/src/content/docs/docs/integrations/agent-skills-evals.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Skill Evals (evals.json)
33
description: Run evals.json skill evaluations with AgentV, and graduate to EVAL.yaml when you need more power.
44
sidebar:
5-
order: 5
5+
order: 2
66
---
77

88
## Overview

apps/web/src/content/docs/docs/guides/autoevals-integration.mdx renamed to apps/web/src/content/docs/docs/integrations/autoevals-integration.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: Autoevals Integration
33
description: Use Braintrust's open-source autoevals scorers (Factuality, Faithfulness, etc.) as code_grader graders in AgentV.
44
sidebar:
5-
order: 2
5+
order: 3
66
---
77

88
## Overview

0 commit comments

Comments
 (0)