Skip to content

Commit 9d016b6

Browse files
christsoclaude
andauthored
refactor(bench): extract autoresearch to reference file (#1124)
Move autoresearch mode and automated keep/discard content from SKILL.md to references/autoresearch.md, reducing SKILL.md from 715 to 418 lines. Also rename SUBAGENT_EVAL_MODE to AGENT_EVAL_MODE across skill files. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 87d62a4 commit 9d016b6

3 files changed

Lines changed: 316 additions & 304 deletions

File tree

plugins/agentv-dev/skills/agentv-bench/SKILL.md

Lines changed: 6 additions & 303 deletions
Original file line numberDiff line numberDiff line change
@@ -131,12 +131,12 @@ Each run produces a new `.agentv/results/runs/<timestamp>/` directory automatica
131131

132132
If the user has not specified a mode, default to `subagent`.
133133

134-
| `SUBAGENT_EVAL_MODE` | Mode | How |
134+
| `AGENT_EVAL_MODE` | Mode | How |
135135
|----------------------|------|-----|
136136
| `subagent` (default) | **Subagent mode** | Subagent-driven eval — parses eval.yaml, spawns executor + grader subagents. Zero CLI dependency. |
137137
| `cli` | **AgentV CLI** | `agentv eval <path>` — end-to-end, multi-provider |
138138

139-
Set `SUBAGENT_EVAL_MODE` in `.env` at the project root as the default when no mode is specified. If absent, default to `subagent`. **User instruction always overrides this.**
139+
Set `AGENT_EVAL_MODE` in `.env` at the project root as the default when no mode is specified. If absent, default to `subagent`. **User instruction always overrides this.**
140140

141141
**`subagent`** — Parses eval.yaml directly, spawns executor subagents to run each test case in the current workspace, then spawns grader subagents to evaluate all assertion types natively. No CLI or external API calls required. Read `references/subagent-pipeline.md` for the detailed procedure.
142142

@@ -335,84 +335,7 @@ After improving:
335335

336336
### Automated keep/discard
337337

338-
After each iteration, you can automatically decide whether to keep or discard the change using structured comparison output. This replaces manual judgment at steps 3–4 of the iteration loop above, except at human checkpoint iterations (3, 6, 9) where you must still present results to the user.
339-
340-
#### 1. Run the comparison
341-
342-
After re-running test cases, compare the new results against the previous iteration's baseline:
343-
344-
```bash
345-
agentv compare <baseline>.jsonl <candidate>.jsonl --json
346-
```
347-
348-
Where `<baseline>.jsonl` is the `index.jsonl` from the previous best iteration and `<candidate>.jsonl` is the `index.jsonl` from the run you just completed.
349-
350-
#### 2. Parse the output
351-
352-
The `--json` flag produces structured output:
353-
354-
```json
355-
{
356-
"summary": {
357-
"wins": 3,
358-
"losses": 1,
359-
"ties": 6,
360-
"mean_delta": 0.05
361-
}
362-
}
363-
```
364-
365-
- **wins**: number of test cases where the candidate scored higher than the baseline
366-
- **losses**: number of test cases where the candidate scored lower
367-
- **ties**: number of test cases with no score change
368-
- **mean_delta**: average score difference across all test cases (positive = candidate is better)
369-
370-
#### 3. Apply decision rules
371-
372-
Use these rules in order:
373-
374-
| Condition | Decision | Action |
375-
|-----------|----------|--------|
376-
| `wins > losses` | **KEEP** | Promote the candidate to the new baseline. Copy or note its `index.jsonl` path as the baseline for the next iteration. |
377-
| `wins <= losses` | **DISCARD** | Revert the prompt/skill/config change. The previous baseline remains. Try a different mutation on the next iteration. |
378-
| `mean_delta == 0` AND candidate prompt is shorter (fewer lines) | **KEEP** | Simpler prompts are preferred when performance is equal. Promote the candidate as the new baseline. |
379-
380-
When `mean_delta == 0` and the candidate prompt is *not* shorter, treat it as a **DISCARD** — there's no reason to keep a change that adds complexity without improving results.
381-
382-
#### 4. Log the decision
383-
384-
Before proceeding to the next iteration, log the decision and rationale so the user can review later:
385-
386-
```
387-
Iteration 2: KEEP
388-
wins=3, losses=1, ties=6, meanDelta=+0.05
389-
Rationale: candidate wins outweigh losses (3 > 1)
390-
Baseline promoted: .agentv/results/runs/20250101-120000/index.jsonl
391-
```
392-
393-
```
394-
Iteration 3: DISCARD
395-
wins=1, losses=2, ties=7, meanDelta=-0.03
396-
Rationale: candidate losses outweigh wins (2 > 1)
397-
Reverted to baseline: .agentv/results/runs/20250101-110000/index.jsonl
398-
Next: try a different mutation
399-
```
400-
401-
Include this log in your progress summary. At human checkpoints (iterations 3, 6, 9), present the full log of automated decisions since the last checkpoint alongside the current results.
402-
403-
#### 5. Integration with the iteration loop
404-
405-
The automated keep/discard replaces the manual compare-and-present cycle (steps 3–4) during non-checkpoint iterations. The full flow becomes:
406-
407-
1. Apply change to prompts/skills/config
408-
2. Re-run all test cases
409-
3. Run `agentv compare baseline.jsonl candidate.jsonl --json`
410-
4. Apply keep/discard rules → promote or revert
411-
5. Log the decision
412-
6. If this is iteration 3, 6, or 9 → present progress to the user (human checkpoint)
413-
7. Check stop conditions → continue or stop
414-
415-
Both modes coexist: if the user is actively reviewing results, present to them as before. If the user has asked you to iterate autonomously, use automated keep/discard and only pause at human checkpoints.
338+
For autonomous iteration, use `agentv compare --json` to automatically decide whether to keep or discard each change based on wins/losses/ties. Read `references/autoresearch.md` for the full decision rules, logging format, and integration with the iteration loop.
416339

417340
---
418341

@@ -448,230 +371,9 @@ After the agent is working well, offer to optimize the skill's `description` fie
448371

449372
## Autoresearch Mode
450373

451-
Autoresearch is an unattended eval-improve loop that runs multiple optimize cycles without human intervention. The user triggers it with natural language (e.g., "run autoresearch on this skill", "optimize this skill unattended"). No YAML schema changes or CLI flags are needed.
452-
453-
### Prerequisites
454-
455-
- An eval file (`EVAL.yaml` or `evals.json`) must exist for the artifact being optimized.
456-
- The artifact must be a file or directory (SKILL.md, prompt template, agent config, or a directory of related files like a skill with references/).
457-
- The user should have run at least one interactive eval cycle to build confidence in eval quality before going unattended.
458-
459-
### The loop
460-
461-
```
462-
1. RUN EVAL — agentv eval with current artifact
463-
2. ANALYZE — dispatch analyzer subagent on results
464-
3. DECIDE — if score > best_score: KEEP, else DROP (automated keep/discard from Step 5)
465-
4. MUTATE — dispatch mutator subagent with failure analysis (agents/mutator.md)
466-
5. GOTO 1 — until convergence or max_cycles
467-
```
468-
469-
### Experiment naming
470-
471-
Derive the experiment name from the artifact: `autoresearch-<name>` (e.g., `autoresearch-pdf-skill`). The user can also provide a custom name.
472-
473-
### Artifact mutation flow
474-
475-
The mutator rewrites artifacts in the working tree in place. **Git is used for versioning** — HEAD always contains the best-known version:
476-
477-
1. Record the starting commit SHA before the first cycle: `initial_sha=$(git rev-parse HEAD)`.
478-
2. On each **KEEP**: `git add <artifact-path> && git commit -m "autoresearch cycle N: <mutation summary>"`.
479-
3. On each **DROP**: `git checkout -- <artifact-path>` (restores working tree to HEAD, the last KEEP commit).
480-
4. The eval always runs against the real file path — no temp files or indirection.
481-
5. The mutator can reference the original via `git show <initial_sha>:<path>`.
482-
483-
### How the skill invokes eval
484-
485-
Shell out to `agentv eval <eval-path> --experiment autoresearch-<name>` via the Bash tool, same as the existing interactive bench workflow.
486-
487-
### Artifact layout
488-
489-
Each cycle is a standard eval run. Autoresearch session metadata lives in `_autoresearch/` within the experiment directory:
490-
491-
```
492-
.agentv/results/runs/<experiment>/
493-
_autoresearch/
494-
iterations.jsonl # one line per cycle — data for chart + mutator
495-
trajectory.html # live-updating score trajectory chart
496-
2026-04-15T10-30-00/ # cycle 1 — standard run artifacts
497-
index.jsonl
498-
grading.json
499-
timing.json
500-
benchmark.json
501-
report.html
502-
2026-04-15T10-35-00/ # cycle 2 — standard run artifacts
503-
...
504-
```
505-
506-
No `original.md` or `best.md` files — git history serves as the backup. The `_` prefix convention distinguishes workflow folders from timestamped run dirs.
507-
508-
### iterations.jsonl
509-
510-
One JSON object per line, one line per cycle:
511-
512-
```jsonl
513-
{"cycle":1,"score":0.65,"decision":"keep","cost_usd":0.12,"assertions":{"IDENTIFIES_BUG":0.8,"SUGGESTS_FIX":0.4},"mutation":"added explicit null-check instruction","run_dir":"2026-04-15T10-30-00","timestamp":"2026-04-15T10:32:15Z"}
514-
```
515-
516-
Fields: `cycle` (1-indexed), `score` (overall pass rate 0–1), `decision` ("keep" or "drop"), `cost_usd` (eval run cost), `assertions` (per-assertion pass rates), `mutation` (one-line description of what changed), `run_dir` (timestamped directory name), `timestamp` (ISO 8601).
517-
518-
### trajectory.html
519-
520-
A standalone HTML chart file with embedded Chart.js. Copy the template from `scripts/trajectory.html` into the `_autoresearch/` directory. It fetches `iterations.jsonl` from the same directory on each auto-refresh — no data injection needed. Shows:
521-
522-
- Score over iterations (line chart) with KEEP (green) / DISCARD (red) markers
523-
- Per-assertion pass rates over iterations
524-
- Cumulative cost across iterations
525-
- Best vs original score summary
526-
527-
Auto-refreshes every 2 seconds during the loop. Becomes static after completion (remove the auto-refresh meta tag on final update).
528-
529-
### Convergence
530-
531-
Stop after **3** consecutive cycles with no improvement (no KEEP). Also stop at **max_cycles** (default 10). Either limit can be overridden by the user.
532-
533-
### Human checkpoints
534-
535-
Autoresearch mode **skips** human checkpoints at iterations 3/6/9. The user opted in to unattended operation by requesting autoresearch.
536-
537-
### Context hygiene
538-
539-
The orchestrator must run indefinitely without exhausting its context window. To do this:
540-
541-
- **Never read eval results, artifacts, or transcripts into your own context.** Use bash commands (jq, agentv CLI) that output small structured summaries.
542-
- **Delegate all heavy reading to subagents.** The mutator reads artifacts, grading results, and transcripts from disk — you pass it paths, not content.
543-
- **Use bash for all file I/O** in the loop body: appending to `iterations.jsonl`, git operations, score extraction. The only tool calls per cycle should be bash commands and one subagent dispatch (mutator).
544-
- **trajectory.html auto-loads `iterations.jsonl`** via fetch — no need to read or update the HTML file after initial copy.
545-
546-
### Procedure
547-
548-
Follow this step-by-step procedure to execute autoresearch:
549-
550-
#### 1. Setup
551-
552-
1. Determine the **artifact path** (file or directory to optimize) and **eval path** (EVAL.yaml or evals.json).
553-
2. Detect **artifact mode**: `file` if the artifact path is a file, `directory` if it's a directory.
554-
3. Derive the **experiment name**: `autoresearch-<name>` from the artifact filename/dirname, or use a user-provided name.
555-
4. Set the experiment directory: `.agentv/results/runs/<experiment>/`.
556-
5. Create the `_autoresearch/` subdirectory inside the experiment directory.
557-
6. Record `initial_sha=$(git rev-parse HEAD)` — the commit before any mutations.
558-
7. Copy `scripts/trajectory.html` to `_autoresearch/trajectory.html`.
559-
8. Initialize variables:
560-
- `best_score = 0`
561-
- `convergence_count = 0`
562-
- `cycle = 1`
563-
- `max_cycles = 10` (or user-specified)
564-
- `max_convergence = 3` (or user-specified)
565-
566-
#### 2. Main loop
567-
568-
Repeat while `cycle <= max_cycles` and `convergence_count < max_convergence`:
569-
570-
**a. Run eval**
571-
572-
```bash
573-
agentv eval <eval-path> --experiment autoresearch-<name>
574-
```
575-
576-
**b. Extract scores (bash only — do NOT read result files into your context)**
577-
578-
Find the latest timestamped directory in the experiment folder. Use bash/jq to extract small structured values:
579-
580-
```bash
581-
# Find latest run dir
582-
RUN_DIR=$(ls -td <experiment-dir>/20*/ | head -1)
583-
584-
# Overall score (mean of all scores in index.jsonl)
585-
SCORE=$(jq -sr '[.[].scores[].score] | add / length' "$RUN_DIR/index.jsonl")
586-
587-
# Per-assertion pass rates as JSON object
588-
PASS_RATES=$(jq -sr '[.[].scores[]] | group_by(.type) | map({key: .[0].type, value: (map(.score) | add / length)}) | from_entries' "$RUN_DIR/index.jsonl")
589-
590-
# Cost (if timing.json exists)
591-
COST=$(jq -r '.cost_usd // 0' "$RUN_DIR/timing.json" 2>/dev/null || echo 0)
592-
```
593-
594-
Capture only these small outputs (`SCORE`, `PASS_RATES`, `COST`) — never read the full JSONL into context.
595-
596-
**c. Update iterations.jsonl (bash only)**
597-
598-
After the KEEP/DROP decision (step e), append one JSON line via bash:
599-
600-
```bash
601-
echo '{"cycle":'$CYCLE',"score":'$SCORE',"decision":"'$DECISION'","cost_usd":'$COST',"assertions":'$PASS_RATES',"mutation":"'"$MUTATION_DESC"'","run_dir":"'"$(basename $RUN_DIR)"'","timestamp":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"}' >> <experiment-dir>/_autoresearch/iterations.jsonl
602-
```
603-
604-
**d. trajectory.html — no action needed**
605-
606-
The trajectory chart fetches `iterations.jsonl` directly via HTTP on each auto-refresh. No file manipulation required after the initial copy in setup.
607-
608-
**e. Decide: KEEP or DROP**
609-
610-
Apply the automated keep/discard rules from Step 5:
611-
612-
1. Run `agentv compare <baseline>.jsonl <candidate>.jsonl --json` where `<baseline>` is the best iteration's `index.jsonl` (or the first run's `index.jsonl` for cycle 1) and `<candidate>` is this cycle's `index.jsonl`.
613-
2. If `wins > losses`**KEEP**.
614-
3. If `wins <= losses`**DISCARD**.
615-
4. If `mean_delta == 0` and the artifact is simpler → **KEEP** (simpler is better at equal performance). Simplicity: for files, compare line count; for directories, compare total size via `du -sb`.
616-
617-
For cycle 1, there is no baseline to compare against — always **KEEP** the first cycle.
618-
619-
**f. If KEEP**
620-
621-
- Update `best_score` to this cycle's score.
622-
- Commit the artifact: `git add <artifact-path> && git commit -m "autoresearch cycle N: <mutation summary>"`.
623-
- Record the current `index.jsonl` path as the new baseline for future comparisons.
624-
- Reset `convergence_count = 0`.
625-
626-
**g. If DROP**
627-
628-
- Revert the working tree to HEAD: `git checkout -- <artifact-path>` (for files) or `git checkout -- <artifact-path>/` (for directories).
629-
- Increment `convergence_count`.
630-
631-
**h. Check stop conditions**
632-
633-
If `convergence_count >= max_convergence` or `cycle >= max_cycles` → break out of the loop.
634-
635-
**i. Mutate**
636-
637-
Dispatch the **mutator** subagent (`agents/mutator.md`) with:
638-
- `artifact-path`: the file or directory to mutate
639-
- `artifact-mode`: `file` or `directory`
640-
- `initial-sha`: the starting commit SHA (for referencing the original via `git show`)
641-
- `pass-rates`: the `$PASS_RATES` JSON object from step (b) (small — just assertion names and rates)
642-
- `run-dir`: path to this cycle's run directory (the mutator reads `grading.json` and transcripts itself)
643-
- `iterations-path`: path to `_autoresearch/iterations.jsonl` (the mutator reads mutation history itself)
644-
- For directory mode: `focus-files` (optional — files most likely contributing to failures, derived from assertion names)
645-
646-
**Do NOT pass failure descriptions, transcripts, or grading content** to the mutator — pass paths and let it read what it needs from disk. This keeps the orchestrator's context clean.
647-
648-
The mutator rewrites artifacts in place. Verify the artifact was modified (e.g., `git diff --stat`) before continuing.
649-
650-
**j. Continue**
651-
652-
Increment `cycle` and return to step (a).
653-
654-
#### 3. Completion
655-
656-
1. Finalize `trajectory.html`: remove the line containing `<!-- __AUTO_REFRESH__ -->` (which includes the `<meta http-equiv="refresh">` tag) so the chart becomes static.
657-
2. Log a final summary:
658-
- Total cycles run
659-
- Final best score vs original score (cycle 1)
660-
- Number of KEEPs and DROPs
661-
- Total cost across all cycles
662-
- The optimized artifact is in the working tree (and the latest commit)
663-
- Run `git diff <initial_sha>` to see total changes from the original
664-
- Run `git log --oneline <initial_sha>..HEAD` to see the mutation history
665-
- Path to `_autoresearch/trajectory.html` (the score chart)
666-
3. Present results to the user with a recommendation: adopt the optimized version, revert to original (`git checkout <initial_sha> -- <artifact-path>`), or continue iterating interactively.
667-
668-
### Interactive/autonomous hybrid
669-
670-
Users can start in interactive mode (the existing Step 3–5 loop with human checkpoints), build confidence in their eval quality, and then switch to autoresearch mode to run unattended. The two modes share the same eval infrastructure and artifact layout — autoresearch simply automates the keep/discard decisions and removes human checkpoints.
671-
672-
### Model empathy recommendation
374+
Autoresearch is an unattended eval-improve loop that runs multiple optimize cycles without human intervention. The user triggers it with natural language (e.g., "run autoresearch on this skill", "optimize this skill unattended"). It uses the mutator subagent (`agents/mutator.md`) to rewrite artifacts based on failure analysis, and automated keep/discard to decide whether to keep or revert each change.
673375

674-
For best results, use same-model pairings: the meta-agent running autoresearch should match the model used by the task agent being evaluated (e.g., Claude optimizing a Claude agent, GPT optimizing a GPT agent). Per AutoAgent research findings, same-model pairings produce better mutations because the optimizer has implicit knowledge of how the target model interprets instructions.
376+
Read `references/autoresearch.md` for the full procedure (prerequisites, artifact layout, keep/discard rules, the step-by-step loop, convergence criteria, and context hygiene).
675377

676378
---
677379

@@ -694,6 +396,7 @@ The `agents/` directory contains instructions for specialized subagents. Read th
694396
| mutator | `agents/mutator.md` | Rewrite artifact from failure analysis | Step 5 (autoresearch — dispatched per cycle) |
695397

696398
The `references/` directory has additional documentation:
399+
- `references/autoresearch.md` — Autoresearch unattended optimization loop and automated keep/discard rules
697400
- `references/eval-yaml-spec.md` — Eval YAML schema and assertion grading recipes
698401
- `references/subagent-pipeline.md` — Detailed subagent-mode pipeline commands and output structure
699402
- `references/description-optimization.md` — Skill description optimization workflow

0 commit comments

Comments
 (0)