Skip to content

Commit 04206fc

Browse files
christsoCopilot
andauthored
refactor(core): rename Evaluator to Grader across codebase (#1111)
* refactor(core): rename evaluators to graders Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * docs: fix grader example links Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * docs: remove issue plan artifact Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent df002b3 commit 04206fc

192 files changed

Lines changed: 1601 additions & 1643 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

AGENTS.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -19,13 +19,13 @@ AgentV's core should remain minimal. Complex or domain-specific logic belongs in
1919

2020
**Extension points (prefer these over adding built-ins):**
2121
- `code-grader` scripts for custom evaluation logic
22-
- `llm-grader` evaluators with custom prompt files for domain-specific LLM grading
22+
- `llm-grader` graders with custom prompt files for domain-specific LLM grading
2323
- CLI wrappers that consume AgentV's JSON/JSONL output for post-processing (aggregation, comparison, reporting)
2424

25-
**Ask yourself:** "Can this be achieved with existing primitives + a plugin or wrapper?" If yes, it should not be a built-in. This includes adding config overrides to existing evaluators — if a niche provider needs custom tool-name matching, that's a code-grader, not a new config field.
25+
**Ask yourself:** "Can this be achieved with existing primitives + a plugin or wrapper?" If yes, it should not be a built-in. This includes adding config overrides to existing graders — if a niche provider needs custom tool-name matching, that's a code-grader, not a new config field.
2626

2727
### 2. Built-ins for Primitives Only
28-
Built-in evaluators provide **universal primitives** that users compose. A primitive is:
28+
Built-in graders provide **universal primitives** that users compose. A primitive is:
2929
- Stateless and deterministic
3030
- Has a single, clear responsibility
3131
- Cannot be trivially composed from other primitives
@@ -77,11 +77,11 @@ AI agents are the primary users of AgentV—not humans reading docs. Design for
7777

7878
## Project Structure
7979
- `packages/core/` - Evaluation engine, providers, grading
80-
- `src/evaluation/registry/` - Extensible evaluator registry (EvaluatorRegistry, assertion discovery)
80+
- `src/evaluation/registry/` - Extensible grader registry (EvaluatorRegistry, assertion discovery)
8181
- `src/evaluation/providers/provider-registry.ts` - Provider plugin registry
8282
- `src/evaluation/evaluate.ts` - `evaluate()` programmatic API
8383
- `src/evaluation/config.ts` - `defineConfig()` for typed agentv.config.ts
84-
- `packages/eval/` - Lightweight assertion SDK (`defineAssertion`, `defineCodeJudge`)
84+
- `packages/eval/` - Lightweight assertion SDK (`defineAssertion`, `defineCodeGrader`)
8585
- `apps/cli/` - Command-line interface (published as `agentv`)
8686
- `src/commands/create/` - Scaffold commands (`agentv create assertion/eval`)
8787
- `examples/features/sdk-*` - SDK usage examples (custom assertion, programmatic API, config file)
@@ -261,9 +261,9 @@ Tests should be lean and focused on what matters. Follow these principles:
261261
- **Regression tests > comprehensive tests.** A test that would have caught the bug is worth more than five tests that exercise happy paths.
262262
- **Tests are executable contracts.** When a module's behavioral contract changes, the tests must reflect the new contract — not just the happy path. If you change what a function promises, update its tests to assert the new promise.
263263

264-
### Verifying Evaluator Changes
264+
### Verifying Grader Changes
265265

266-
Unit tests alone are insufficient for evaluator changes. After implementing or modifying evaluators:
266+
Unit tests alone are insufficient for grader changes. After implementing or modifying graders:
267267

268268
1. **Copy `.env` to the worktree** if running in a git worktree (e2e tests need environment variables):
269269
```bash
@@ -272,21 +272,21 @@ Unit tests alone are insufficient for evaluator changes. After implementing or m
272272
```powershell
273273
Copy-Item D:/path/to/main/.env .env
274274
```
275-
Do not claim e2e or evaluator verification results unless this preflight has passed.
275+
Do not claim e2e or grader verification results unless this preflight has passed.
276276

277277
2. **Run an actual eval** with a real example file:
278278
```bash
279279
bun apps/cli/src/cli.ts eval examples/features/rubric/evals/dataset.eval.yaml --test-id <test-id>
280280
```
281281

282282
3. **Inspect the results JSONL** to verify:
283-
- The correct evaluator type is invoked (check `scores[].type`)
283+
- The correct grader type is invoked (check `scores[].type`)
284284
- Scores are calculated as expected
285285
- Assertions array reflects the evaluation logic (each entry has `text`, `passed`, optional `evidence`)
286286

287287
4. **Update baseline files** if output format changes (e.g., type name renames). Baseline files live alongside eval YAML files as `*.baseline.jsonl` and contain expected `scores[].type` values. There are 30+ baseline files across `examples/`.
288288

289-
5. **Note:** `--dry-run` returns schema-valid mock responses (`{}` as output, zeroed `tokenUsage`). Built-in graders will not crash, but scores are meaningless. Use it for testing harness flow, not evaluator logic.
289+
5. **Note:** `--dry-run` returns schema-valid mock responses (`{}` as output, zeroed `tokenUsage`). Built-in graders will not crash, but scores are meaningless. Use it for testing harness flow, not grader logic.
290290

291291
### Completing Work — E2E Checklist
292292

@@ -307,11 +307,11 @@ Before marking any branch as ready for review, complete this checklist:
307307
- **Green (with your changes):** Run the identical scenario with your branch. Confirm the fix or feature works correctly from the end user's perspective. Capture the output.
308308
- **Document both** red and green results in the PR description or comments so reviewers can see the before/after evidence.
309309

310-
For evaluator changes, this means running a real eval (not `--dry-run`) and inspecting the output JSONL. For CLI/UX changes, this means running the CLI command and verifying the console output.
310+
For grader changes, this means running a real eval (not `--dry-run`) and inspecting the output JSONL. For CLI/UX changes, this means running the CLI command and verifying the console output.
311311

312-
4. **Verify no regressions** in areas adjacent to your changes (e.g., if you changed evaluator parsing, run an eval that exercises different evaluator types).
312+
4. **Verify no regressions** in areas adjacent to your changes (e.g., if you changed grader parsing, run an eval that exercises different grader types).
313313

314-
5. **Live eval verification**: For changes affecting scoring, thresholds, or evaluator behavior, run at least one real eval with a live provider (not `--dry-run`) and verify the output JSONL has correct scores, verdicts, and execution status.
314+
5. **Live eval verification**: For changes affecting scoring, thresholds, or grader behavior, run at least one real eval with a live provider (not `--dry-run`) and verify the output JSONL has correct scores, verdicts, and execution status.
315315

316316
6. **Studio UX verification**: For changes affecting config, scoring display, or studio API, use `agent-browser` to verify the studio UI still renders and functions correctly (settings page loads, pass/fail indicators are correct, config saves work).
317317

@@ -323,15 +323,15 @@ When making changes to functionality:
323323

324324
1. **Docs site** (`apps/web/src/content/docs/`): Update human-readable documentation on agentv.dev. This is the comprehensive reference.
325325

326-
2. **Skill files** (`plugins/agentv-dev/skills/agentv-eval-builder/`): Update the AI-focused reference card if the change affects YAML schema, evaluator types, or CLI commands. Keep concise — link to docs site for details.
326+
2. **Skill files** (`plugins/agentv-dev/skills/agentv-eval-builder/`): Update the AI-focused reference card if the change affects YAML schema, grader types, or CLI commands. Keep concise — link to docs site for details.
327327

328328
3. **Examples** (`examples/`): Update any example code, scripts, or eval YAML files that exercise the changed functionality. Examples are both documentation and integration tests.
329329

330330
4. **README.md**: Keep minimal. Links point to agentv.dev.
331331

332-
## Evaluator Type System
332+
## Grader Type System
333333

334-
Evaluator types use **kebab-case** everywhere (matching promptfoo convention):
334+
Grader types use **kebab-case** everywhere (matching promptfoo convention):
335335

336336
- **YAML config:** `type: llm-grader`, `type: is-json`, `type: execution-metrics`
337337
- **Internal TypeScript:** `EvaluatorKind = 'llm-grader' | 'is-json' | ...`
@@ -340,7 +340,7 @@ Evaluator types use **kebab-case** everywhere (matching promptfoo convention):
340340

341341
**Source of truth:** `EVALUATOR_KIND_VALUES` array in `packages/core/src/evaluation/types.ts`
342342

343-
**Backward compatibility:** Snake_case is accepted in YAML (`llm_judge``llm-grader`) via `normalizeEvaluatorType()` in `evaluator-parser.ts`. Single-word types (`contains`, `equals`, `regex`, `latency`, `cost`) have no separator and are unchanged.
343+
**Backward compatibility:** Snake_case is accepted in YAML (`llm_judge``llm-grader`) via `normalizeGraderType()` in `grader-parser.ts`. Single-word types (`contains`, `equals`, `regex`, `latency`, `cost`) have no separator and are unchanged.
344344

345345
**Two type definitions exist:**
346346
- `EvaluatorKind` in `packages/core/src/evaluation/types.ts` — internal, canonical

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -107,7 +107,7 @@ console.log(`${summary.passed}/${summary.total} passed`);
107107
Full docs at [agentv.dev/docs](https://agentv.dev/docs/getting-started/introduction/).
108108

109109
- [Eval files](https://agentv.dev/docs/evaluation/eval-files/) — format and structure
110-
- [Custom evaluators](https://agentv.dev/docs/evaluators/custom-evaluators/) — code graders in any language
110+
- [Custom graders](https://agentv.dev/docs/graders/custom-graders/) — code graders in any language
111111
- [Rubrics](https://agentv.dev/docs/evaluation/rubrics/) — structured criteria scoring
112112
- [Targets](https://agentv.dev/docs/targets/configuration/) — configure agents and providers
113113
- [Compare results](https://agentv.dev/docs/tools/compare/) — A/B testing and regression detection

apps/cli/src/commands/eval/artifact-writer.ts

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ import path from 'node:path';
44
import {
55
DEFAULT_THRESHOLD,
66
type EvaluationResult,
7-
type EvaluatorResult,
7+
type GraderResult,
88
toTranscriptJsonLines,
99
} from '@agentv/core';
1010
import { toSnakeCaseDeep } from '../../utils/case-conversion.js';
@@ -227,9 +227,7 @@ function buildAssertions(result: EvaluationResult): GradingArtifact['assertions'
227227
// Build graders list
228228
// ---------------------------------------------------------------------------
229229

230-
function buildEvaluators(
231-
scores: readonly EvaluatorResult[] | undefined,
232-
): GradingArtifact['graders'] {
230+
function buildEvaluators(scores: readonly GraderResult[] | undefined): GradingArtifact['graders'] {
233231
if (!scores || scores.length === 0) {
234232
return undefined;
235233
}
@@ -370,7 +368,7 @@ export function buildBenchmarkArtifact(
370368
runSummary[target] = entry as (typeof runSummary)[string];
371369
}
372370

373-
// Per-evaluator summary across all results
371+
// Per-grader summary across all results
374372
const evaluatorScores = new Map<string, number[]>();
375373
for (const result of results) {
376374
if (result.scores) {

apps/cli/src/commands/eval/benchmark-writer.ts

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,10 @@ function computeStats(values: readonly number[]): BenchmarkStats {
3232
}
3333

3434
/**
35-
* Compute per-test pass_rate from evaluator scores.
35+
* Compute per-test pass_rate from grader scores.
3636
*
3737
* For each test, pass_rate = count(evaluator.score >= 0.8) / total_evaluators.
38-
* If no per-evaluator scores exist, falls back to the top-level result score
38+
* If no per-grader scores exist, falls back to the top-level result score
3939
* with the same threshold (>= 0.8 → 1.0, else 0.0).
4040
*/
4141
function computePassRate(result: EvaluationResult): number {

apps/cli/src/commands/eval/commands/assert.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ export const evalAssertCommand = command({
6262
process.exit(1);
6363
}
6464

65-
// Build payload matching CodeEvaluator's expected format (snake_case).
65+
// Build payload matching CodeGrader's expected format (snake_case).
6666
// Include all fields that defineCodeGrader validates as required.
6767
const payload = JSON.stringify(
6868
{

apps/cli/src/commands/eval/html-writer.ts

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -500,10 +500,10 @@ const SCRIPT = `
500500
h+='<div class="detail-block"><h4>Output</h4><pre class="detail-pre">'+esc(r.output?JSON.stringify(r.output,null,2):"")+"</pre></div>";
501501
h+="</div>";
502502
503-
/* evaluator results */
503+
/* grader results */
504504
if(r.scores&&r.scores.length>0){
505-
h+="<h4>Evaluator Results</h4>";
506-
h+='<table class="eval-table"><thead><tr><th>Evaluator</th><th>Score</th><th>Status</th><th>Assertions</th></tr></thead><tbody>';
505+
h+="<h4>Grader Results</h4>";
506+
h+='<table class="eval-table"><thead><tr><th>Grader</th><th>Score</th><th>Status</th><th>Assertions</th></tr></thead><tbody>';
507507
for(var i=0;i<r.scores.length;i++){
508508
var ev=r.scores[i],evS=ev.score>=0.5?"pass":"fail";
509509
var evAssertions=ev.assertions||[];

apps/cli/src/commands/inspect/score.ts

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@ import {
22
type EvalTest,
33
type EvaluationContext,
44
type EvaluationScore,
5-
type Evaluator,
6-
type EvaluatorConfig,
7-
type EvaluatorDispatchContext,
5+
type Grader,
6+
type GraderConfig,
7+
type GraderDispatchContext,
88
type Message,
99
type Provider,
1010
type ProviderRequest,
@@ -24,7 +24,7 @@ import {
2424
} from './utils.js';
2525

2626
/**
27-
* Evaluator types that work without an LLM provider.
27+
* Grader types that work without an LLM provider.
2828
*/
2929
const SUPPORTED_TYPES = [
3030
'contains',
@@ -52,7 +52,7 @@ function parseKeyValues(s: string): Record<string, string> {
5252
}
5353

5454
/**
55-
* Parse an inline evaluator spec string into an EvaluatorConfig.
55+
* Parse an inline evaluator spec string into an GraderConfig.
5656
*
5757
* Supported formats:
5858
* contains:value
@@ -64,7 +64,7 @@ function parseKeyValues(s: string): Record<string, string> {
6464
* token-usage:max_total=N,max_input=N,max_output=N
6565
* execution-metrics:max_tool_calls=N,max_tokens=N,max_llm_calls=N,...
6666
*/
67-
export function parseAssertSpec(spec: string): EvaluatorConfig {
67+
export function parseAssertSpec(spec: string): GraderConfig {
6868
const colonIdx = spec.indexOf(':');
6969
// Normalize snake_case to kebab-case for backward compat
7070
const type = (colonIdx === -1 ? spec : spec.slice(0, colonIdx)).replace(/_/g, '-');
@@ -73,31 +73,31 @@ export function parseAssertSpec(spec: string): EvaluatorConfig {
7373
switch (type) {
7474
case 'contains':
7575
if (!params) throw new Error('contains requires a value: contains:<value>');
76-
return { name: 'contains', type: 'contains', value: params } as EvaluatorConfig;
76+
return { name: 'contains', type: 'contains', value: params } as GraderConfig;
7777

7878
case 'regex':
7979
if (!params) throw new Error('regex requires a pattern: regex:<pattern>');
80-
return { name: 'regex', type: 'regex', value: params } as EvaluatorConfig;
80+
return { name: 'regex', type: 'regex', value: params } as GraderConfig;
8181

8282
case 'is-json':
83-
return { name: 'is-json', type: 'is-json' } as EvaluatorConfig;
83+
return { name: 'is-json', type: 'is-json' } as GraderConfig;
8484

8585
case 'equals':
8686
if (!params) throw new Error('equals requires a value: equals:<value>');
87-
return { name: 'equals', type: 'equals', value: params } as EvaluatorConfig;
87+
return { name: 'equals', type: 'equals', value: params } as GraderConfig;
8888

8989
case 'latency': {
9090
const threshold = Number(params);
9191
if (!params || Number.isNaN(threshold))
9292
throw new Error('latency requires a threshold in ms: latency:<ms>');
93-
return { name: 'latency', type: 'latency', threshold } as EvaluatorConfig;
93+
return { name: 'latency', type: 'latency', threshold } as GraderConfig;
9494
}
9595

9696
case 'cost': {
9797
const budget = Number(params);
9898
if (!params || Number.isNaN(budget))
9999
throw new Error('cost requires a budget in USD: cost:<usd>');
100-
return { name: 'cost', type: 'cost', budget } as EvaluatorConfig;
100+
return { name: 'cost', type: 'cost', budget } as GraderConfig;
101101
}
102102

103103
case 'token-usage': {
@@ -106,7 +106,7 @@ export function parseAssertSpec(spec: string): EvaluatorConfig {
106106
if (kv.max_total) config.max_total = Number(kv.max_total);
107107
if (kv.max_input) config.max_input = Number(kv.max_input);
108108
if (kv.max_output) config.max_output = Number(kv.max_output);
109-
return config as EvaluatorConfig;
109+
return config as GraderConfig;
110110
}
111111

112112
case 'execution-metrics': {
@@ -120,12 +120,12 @@ export function parseAssertSpec(spec: string): EvaluatorConfig {
120120
if (kv.max_tokens) config.max_tokens = Number(kv.max_tokens);
121121
if (kv.max_cost_usd) config.max_cost_usd = Number(kv.max_cost_usd);
122122
if (kv.max_duration_ms) config.max_duration_ms = Number(kv.max_duration_ms);
123-
return config as EvaluatorConfig;
123+
return config as GraderConfig;
124124
}
125125

126126
default:
127127
throw new Error(
128-
`Unsupported evaluator type: "${type}". Supported: ${SUPPORTED_TYPES.join(', ')}`,
128+
`Unsupported grader type: "${type}". Supported: ${SUPPORTED_TYPES.join(', ')}`,
129129
);
130130
}
131131
}
@@ -171,7 +171,7 @@ const stubProvider: Provider = {
171171
/**
172172
* A no-op evaluator stub used as the required llmGrader in the dispatch context.
173173
*/
174-
const stubLlmGrader: Evaluator = {
174+
const stubLlmGrader: Grader = {
175175
kind: 'llm-grader',
176176
evaluate(): EvaluationScore {
177177
throw new Error('trace score does not support LLM-based evaluators');
@@ -189,12 +189,12 @@ interface ScoreResult {
189189

190190
async function runScore(
191191
results: RawResult[],
192-
evaluatorConfig: EvaluatorConfig,
192+
evaluatorConfig: GraderConfig,
193193
testIdFilter?: string,
194194
): Promise<ScoreResult[]> {
195195
const registry = createBuiltinRegistry();
196196

197-
const dispatchContext: EvaluatorDispatchContext = {
197+
const dispatchContext: GraderDispatchContext = {
198198
llmGrader: stubLlmGrader,
199199
registry,
200200
};
@@ -308,7 +308,7 @@ export const traceScoreCommand = command({
308308
long: 'assert',
309309
short: 'a',
310310
description:
311-
'Evaluator spec: contains:<val>, regex:<pat>, is-json, equals:<val>, latency:<ms>, cost:<usd>, token-usage:<params>, execution-metrics:<params>',
311+
'Grader spec: contains:<val>, regex:<pat>, is-json, equals:<val>, latency:<ms>, cost:<usd>, token-usage:<params>, execution-metrics:<params>',
312312
}),
313313
testId: option({
314314
type: optional(string),
@@ -324,7 +324,7 @@ export const traceScoreCommand = command({
324324
},
325325
handler: async ({ file, assert: assertSpec, testId, format }) => {
326326
// Parse the evaluator spec
327-
let evaluatorConfig: EvaluatorConfig;
327+
let evaluatorConfig: GraderConfig;
328328
try {
329329
evaluatorConfig = parseAssertSpec(assertSpec);
330330
} catch (err) {

apps/cli/src/commands/inspect/show.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ function renderFlatTrace(result: RawResult): string {
4646
}
4747

4848
/**
49-
* Render per-evaluator scores inline.
49+
* Render per-grader scores inline.
5050
*/
5151
function renderScores(scores: { name: string; score: number; type: string }[]): string {
5252
return scores

apps/cli/src/commands/pipeline/bench.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ interface EvaluatorScore {
2828

2929
export const evalBenchCommand = command({
3030
name: 'bench',
31-
description: 'Merge evaluator scores and produce benchmark artifacts',
31+
description: 'Merge grader scores and produce benchmark artifacts',
3232
args: {
3333
exportDir: positional({
3434
type: string,

0 commit comments

Comments
 (0)