You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: AGENTS.md
+17-17Lines changed: 17 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -19,13 +19,13 @@ AgentV's core should remain minimal. Complex or domain-specific logic belongs in
19
19
20
20
**Extension points (prefer these over adding built-ins):**
21
21
-`code-grader` scripts for custom evaluation logic
22
-
-`llm-grader`evaluators with custom prompt files for domain-specific LLM grading
22
+
-`llm-grader`graders with custom prompt files for domain-specific LLM grading
23
23
- CLI wrappers that consume AgentV's JSON/JSONL output for post-processing (aggregation, comparison, reporting)
24
24
25
-
**Ask yourself:** "Can this be achieved with existing primitives + a plugin or wrapper?" If yes, it should not be a built-in. This includes adding config overrides to existing evaluators — if a niche provider needs custom tool-name matching, that's a code-grader, not a new config field.
25
+
**Ask yourself:** "Can this be achieved with existing primitives + a plugin or wrapper?" If yes, it should not be a built-in. This includes adding config overrides to existing graders — if a niche provider needs custom tool-name matching, that's a code-grader, not a new config field.
26
26
27
27
### 2. Built-ins for Primitives Only
28
-
Built-in evaluators provide **universal primitives** that users compose. A primitive is:
28
+
Built-in graders provide **universal primitives** that users compose. A primitive is:
29
29
- Stateless and deterministic
30
30
- Has a single, clear responsibility
31
31
- Cannot be trivially composed from other primitives
@@ -77,11 +77,11 @@ AI agents are the primary users of AgentV—not humans reading docs. Design for
@@ -261,9 +261,9 @@ Tests should be lean and focused on what matters. Follow these principles:
261
261
-**Regression tests > comprehensive tests.** A test that would have caught the bug is worth more than five tests that exercise happy paths.
262
262
-**Tests are executable contracts.** When a module's behavioral contract changes, the tests must reflect the new contract — not just the happy path. If you change what a function promises, update its tests to assert the new promise.
263
263
264
-
### Verifying Evaluator Changes
264
+
### Verifying Grader Changes
265
265
266
-
Unit tests alone are insufficient for evaluator changes. After implementing or modifying evaluators:
266
+
Unit tests alone are insufficient for grader changes. After implementing or modifying graders:
267
267
268
268
1.**Copy `.env` to the worktree** if running in a git worktree (e2e tests need environment variables):
269
269
```bash
@@ -272,21 +272,21 @@ Unit tests alone are insufficient for evaluator changes. After implementing or m
272
272
```powershell
273
273
Copy-Item D:/path/to/main/.env .env
274
274
```
275
-
Do not claim e2e or evaluator verification results unless this preflight has passed.
275
+
Do not claim e2e or grader verification results unless this preflight has passed.
276
276
277
277
2.**Run an actual eval** with a real example file:
278
278
```bash
279
279
bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id <test-id>
280
280
```
281
281
282
282
3.**Inspect the results JSONL** to verify:
283
-
- The correct evaluator type is invoked (check `scores[].type`)
283
+
- The correct grader type is invoked (check `scores[].type`)
284
284
- Scores are calculated as expected
285
285
- Assertions array reflects the evaluation logic (each entry has `text`, `passed`, optional `evidence`)
286
286
287
287
4.**Update baseline files** if output format changes (e.g., type name renames). Baseline files live alongside eval YAML files as `*.baseline.jsonl` and contain expected `scores[].type` values. There are 30+ baseline files across `examples/`.
288
288
289
-
5.**Note:**`--dry-run` returns schema-valid mock responses (`{}` as output, zeroed `tokenUsage`). Built-in graders will not crash, but scores are meaningless. Use it for testing harness flow, not evaluator logic.
289
+
5.**Note:**`--dry-run` returns schema-valid mock responses (`{}` as output, zeroed `tokenUsage`). Built-in graders will not crash, but scores are meaningless. Use it for testing harness flow, not grader logic.
290
290
291
291
### Completing Work — E2E Checklist
292
292
@@ -307,11 +307,11 @@ Before marking any branch as ready for review, complete this checklist:
307
307
-**Green (with your changes):** Run the identical scenario with your branch. Confirm the fix or feature works correctly from the end user's perspective. Capture the output.
308
308
-**Document both** red and green results in the PR description or comments so reviewers can see the before/after evidence.
309
309
310
-
For evaluator changes, this means running a real eval (not `--dry-run`) and inspecting the output JSONL. For CLI/UX changes, this means running the CLI command and verifying the console output.
310
+
For grader changes, this means running a real eval (not `--dry-run`) and inspecting the output JSONL. For CLI/UX changes, this means running the CLI command and verifying the console output.
311
311
312
-
4.**Verify no regressions** in areas adjacent to your changes (e.g., if you changed evaluator parsing, run an eval that exercises different evaluator types).
312
+
4.**Verify no regressions** in areas adjacent to your changes (e.g., if you changed grader parsing, run an eval that exercises different grader types).
313
313
314
-
5.**Live eval verification**: For changes affecting scoring, thresholds, or evaluator behavior, run at least one real eval with a live provider (not `--dry-run`) and verify the output JSONL has correct scores, verdicts, and execution status.
314
+
5.**Live eval verification**: For changes affecting scoring, thresholds, or grader behavior, run at least one real eval with a live provider (not `--dry-run`) and verify the output JSONL has correct scores, verdicts, and execution status.
315
315
316
316
6.**Studio UX verification**: For changes affecting config, scoring display, or studio API, use `agent-browser` to verify the studio UI still renders and functions correctly (settings page loads, pass/fail indicators are correct, config saves work).
317
317
@@ -323,15 +323,15 @@ When making changes to functionality:
323
323
324
324
1.**Docs site** (`apps/web/src/content/docs/`): Update human-readable documentation on agentv.dev. This is the comprehensive reference.
325
325
326
-
2.**Skill files** (`plugins/agentv-dev/skills/agentv-eval-builder/`): Update the AI-focused reference card if the change affects YAML schema, evaluator types, or CLI commands. Keep concise — link to docs site for details.
326
+
2.**Skill files** (`plugins/agentv-dev/skills/agentv-eval-builder/`): Update the AI-focused reference card if the change affects YAML schema, grader types, or CLI commands. Keep concise — link to docs site for details.
327
327
328
328
3.**Examples** (`examples/`): Update any example code, scripts, or eval YAML files that exercise the changed functionality. Examples are both documentation and integration tests.
329
329
330
330
4.**README.md**: Keep minimal. Links point to agentv.dev.
331
331
332
-
## Evaluator Type System
332
+
## Grader Type System
333
333
334
-
Evaluator types use **kebab-case** everywhere (matching promptfoo convention):
334
+
Grader types use **kebab-case** everywhere (matching promptfoo convention):
**Source of truth:**`EVALUATOR_KIND_VALUES` array in `packages/core/src/evaluation/types.ts`
342
342
343
-
**Backward compatibility:** Snake_case is accepted in YAML (`llm_judge` → `llm-grader`) via `normalizeEvaluatorType()` in `evaluator-parser.ts`. Single-word types (`contains`, `equals`, `regex`, `latency`, `cost`) have no separator and are unchanged.
343
+
**Backward compatibility:** Snake_case is accepted in YAML (`llm_judge` → `llm-grader`) via `normalizeGraderType()` in `grader-parser.ts`. Single-word types (`contains`, `equals`, `regex`, `latency`, `cost`) have no separator and are unchanged.
344
344
345
345
**Two type definitions exist:**
346
346
-`EvaluatorKind` in `packages/core/src/evaluation/types.ts` — internal, canonical
0 commit comments