Skip to content

Commit 2df533a

Browse files
christsoclaude
andauthored
feat(core): auto-discover test cases from directory structure (#1142)
* feat(core): auto-discover test cases from directory structure (#1141) When `tests:` points to a directory, scan subdirectories for `case.yaml` files. Directory name becomes the test `id` unless overridden. A `workspace/` subdirectory auto-sets the workspace template. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * style: fix biome formatting for chained method calls Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix(core): address code review findings for directory discovery - Update eval-validator to recognize directory paths (no false warning) - Use lexicographic sort instead of locale-dependent localeCompare - Use strict null check for id injection (not falsy check) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent b4adcd3 commit 2df533a

8 files changed

Lines changed: 428 additions & 18 deletions

File tree

apps/web/src/content/docs/docs/evaluation/eval-files.mdx

Lines changed: 41 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ tests:
4040
| `suite` | Optional suite identifier |
4141
| `execution` | Default execution config (`target`, `fail_on_error`, `threshold`, etc.) |
4242
| `workspace` | Suite-level workspace config — inline object or string path to an [external workspace file](/docs/guides/workspace-pool/#external-workspace-config) |
43-
| `tests` | Array of individual tests, or a string path to an external file |
43+
| `tests` | Array of individual tests, or a string path to an external file or directory |
4444
| `assertions` | Suite-level graders appended to each test unless `execution.skip_defaults: true` is set on the test |
4545
| `input` | Suite-level input messages prepended to each test's input unless `execution.skip_defaults: true` is set on the test |
4646

@@ -178,6 +178,46 @@ tests: ./cases.yaml
178178

179179
The path is resolved relative to the eval file's directory. The external file should contain a YAML array of test objects or a JSONL file with one test per line.
180180

181+
### Tests as Directory Path
182+
183+
When `tests` points to a directory, AgentV auto-discovers test cases from subdirectories. Each subdirectory containing a `case.yaml` (or `case.yml`) becomes a test case:
184+
185+
```
186+
my-eval/
187+
EVAL.yaml
188+
cases/
189+
fix-null-check/
190+
case.yaml
191+
add-greeting/
192+
case.yaml
193+
workspace/ # optional per-case workspace template
194+
setup-files...
195+
```
196+
197+
```yaml
198+
# EVAL.yaml
199+
name: my-benchmark
200+
tests: ./cases/
201+
```
202+
203+
Each `case.yaml` is a single YAML object (not an array) with the same fields as an inline test:
204+
205+
```yaml
206+
# cases/fix-null-check/case.yaml
207+
criteria: Fixes the null reference bug in the parser module
208+
input: Fix the null check bug in parser.ts
209+
```
210+
211+
**Behavior:**
212+
213+
- **Directory name as `id`:** If `case.yaml` doesn't specify an `id`, the directory name is used (e.g., `fix-null-check`)
214+
- **Alphabetical ordering:** Subdirectories are sorted alphabetically for deterministic order
215+
- **Per-case workspace:** A `workspace/` subdirectory inside the case directory automatically sets `workspace.template` to that path, unless the case already defines a `workspace` field
216+
- **Skipped directories:** Subdirectories without `case.yaml` are skipped with a warning
217+
- **Suite-level config applies:** Suite-level `assertions`, `input`, `workspace`, and `execution` still apply to directory-discovered cases
218+
219+
This pattern is useful for benchmarks with many cases, where each case benefits from its own directory for workspace templates, supporting files, or documentation.
220+
181221
## Environment Variable Interpolation
182222

183223
All string fields in eval files support `${{ VAR }}` syntax for environment variable interpolation. This enables portable eval configs that work across machines and CI environments without hardcoded paths.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
name: directory-discovery
2+
description: Demonstrates auto-discovering test cases from a directory structure
3+
4+
tests: ./cases/
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
criteria: Adds a greeting message that displays the user's name
2+
input: |
3+
Add a greeting feature to the homepage. When a user logs in,
4+
display "Welcome back, {name}!" at the top of the page.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
criteria: Identifies and fixes the null reference bug in the parser module
2+
input: |
3+
Fix the null check bug in parser.ts. The function `parseToken` crashes
4+
when given an empty string because it doesn't check for null before
5+
accessing `.length`.

packages/core/src/evaluation/loaders/case-file-loader.ts

Lines changed: 83 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
import { readFile } from 'node:fs/promises';
1+
import { readFile, readdir, stat } from 'node:fs/promises';
22
import path from 'node:path';
33
import fg from 'fast-glob';
44
import { parse as parseYaml } from 'yaml';
@@ -158,6 +158,88 @@ export async function resolveFileReference(
158158
return loadCasesFromFile(absolutePattern);
159159
}
160160

161+
/**
162+
* Load test cases from a directory structure.
163+
* Scans immediate subdirectories for case.yaml/case.yml files.
164+
* Each subdirectory becomes a test case, with the directory name used as `id`
165+
* if the case file doesn't specify one. A `workspace/` subdirectory in the
166+
* case directory sets the workspace template automatically.
167+
*/
168+
export async function loadCasesFromDirectory(dirPath: string): Promise<JsonObject[]> {
169+
const entries = await readdir(dirPath, { withFileTypes: true });
170+
const subdirs = entries
171+
.filter((e) => e.isDirectory())
172+
.sort((a, b) => (a.name < b.name ? -1 : a.name > b.name ? 1 : 0));
173+
174+
const results: JsonObject[] = [];
175+
for (const subdir of subdirs) {
176+
const subdirPath = path.join(dirPath, subdir.name);
177+
178+
// Look for case.yaml or case.yml
179+
let caseFilePath: string | undefined;
180+
for (const filename of ['case.yaml', 'case.yml']) {
181+
const candidate = path.join(subdirPath, filename);
182+
try {
183+
const s = await stat(candidate);
184+
if (s.isFile()) {
185+
caseFilePath = candidate;
186+
break;
187+
}
188+
} catch {
189+
// File doesn't exist, try next
190+
}
191+
}
192+
193+
if (!caseFilePath) {
194+
console.warn(
195+
`${ANSI_YELLOW}Warning: Skipping directory '${subdir.name}' — no case.yaml found${ANSI_RESET}`,
196+
);
197+
continue;
198+
}
199+
200+
// Parse case.yaml as a single object (not array)
201+
let content: string;
202+
try {
203+
content = await readFile(caseFilePath, 'utf8');
204+
} catch (error) {
205+
const message = error instanceof Error ? error.message : String(error);
206+
throw new Error(`Cannot read case file: ${caseFilePath}\n ${message}`);
207+
}
208+
209+
const raw = parseYaml(content) as unknown;
210+
const parsed = interpolateEnv(raw, process.env);
211+
if (!isJsonObject(parsed)) {
212+
throw new Error(
213+
`Case file must contain a YAML object, got ${typeof parsed}: ${caseFilePath}`,
214+
);
215+
}
216+
217+
const caseObj = { ...parsed };
218+
219+
// Inject id from directory name if not specified
220+
if (caseObj.id === undefined || caseObj.id === null) {
221+
caseObj.id = subdir.name;
222+
}
223+
224+
// Check for workspace/ subdirectory
225+
if (!caseObj.workspace) {
226+
const workspaceDirPath = path.join(subdirPath, 'workspace');
227+
try {
228+
const s = await stat(workspaceDirPath);
229+
if (s.isDirectory()) {
230+
caseObj.workspace = { template: workspaceDirPath };
231+
}
232+
} catch {
233+
// No workspace directory, that's fine
234+
}
235+
}
236+
237+
results.push(caseObj);
238+
}
239+
240+
return results;
241+
}
242+
161243
/**
162244
* Process a tests array, expanding any file:// references into inline test objects.
163245
* Returns a flat array of JsonValue where all file:// strings are replaced

packages/core/src/evaluation/validation/eval-validator.ts

Lines changed: 44 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,9 @@
1-
import { readFile, readdir } from 'node:fs/promises';
1+
import { readFile, readdir, stat } from 'node:fs/promises';
22
import path from 'node:path';
33
import { parse } from 'yaml';
44

55
import { interpolateEnv } from '../interpolation.js';
6-
import { loadCasesFromFile } from '../loaders/case-file-loader.js';
6+
import { loadCasesFromDirectory, loadCasesFromFile } from '../loaders/case-file-loader.js';
77
import { isGraderKind } from '../types.js';
88
import type { ValidationError, ValidationResult } from './types.js';
99

@@ -234,20 +234,27 @@ export async function validateEvalFile(filePath: string): Promise<ValidationResu
234234

235235
const cases: JsonValue | undefined = parsed.tests;
236236

237-
// tests can be a string path (external file reference) or an array
237+
// tests can be a string path (external file/directory reference) or an array
238238
if (typeof cases === 'string') {
239-
validateTestsStringPath(cases, absolutePath, errors);
240239
await validateWorkspaceConfig(parsed.workspace, absolutePath, errors, 'workspace');
241240

242-
const ext = path.extname(cases).toLowerCase();
243-
if (VALID_TEST_FILE_EXTENSIONS.has(ext)) {
244-
const externalCasesPath = path.resolve(path.dirname(absolutePath), cases);
241+
const externalCasesPath = path.resolve(path.dirname(absolutePath), cases);
242+
let isDir = false;
243+
try {
244+
const pathStat = await stat(externalCasesPath);
245+
isDir = pathStat.isDirectory();
246+
} catch {
247+
// Path doesn't exist — fall through to file validation
248+
}
249+
250+
if (isDir) {
251+
// Directory path: load and validate discovered cases
245252
try {
246-
const externalCases = await loadCasesFromFile(externalCasesPath);
247-
for (let i = 0; i < externalCases.length; i++) {
248-
const externalCase = externalCases[i];
253+
const dirCases = await loadCasesFromDirectory(externalCasesPath);
254+
for (let i = 0; i < dirCases.length; i++) {
255+
const dirCase = dirCases[i];
249256
await validateWorkspaceConfig(
250-
externalCase.workspace,
257+
dirCase.workspace,
251258
absolutePath,
252259
errors,
253260
`tests[${i}].workspace`,
@@ -262,6 +269,32 @@ export async function validateEvalFile(filePath: string): Promise<ValidationResu
262269
message,
263270
});
264271
}
272+
} else {
273+
// File path: validate extension and load
274+
validateTestsStringPath(cases, absolutePath, errors);
275+
const ext = path.extname(cases).toLowerCase();
276+
if (VALID_TEST_FILE_EXTENSIONS.has(ext)) {
277+
try {
278+
const externalCases = await loadCasesFromFile(externalCasesPath);
279+
for (let i = 0; i < externalCases.length; i++) {
280+
const externalCase = externalCases[i];
281+
await validateWorkspaceConfig(
282+
externalCase.workspace,
283+
absolutePath,
284+
errors,
285+
`tests[${i}].workspace`,
286+
);
287+
}
288+
} catch (error) {
289+
const message = error instanceof Error ? error.message : String(error);
290+
errors.push({
291+
severity: 'error',
292+
filePath: absolutePath,
293+
location: 'tests',
294+
message,
295+
});
296+
}
297+
}
265298
}
266299

267300
return {

packages/core/src/evaluation/yaml-parser.ts

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,16 @@
1-
import { readFile } from 'node:fs/promises';
1+
import { readFile, stat } from 'node:fs/promises';
22
import path from 'node:path';
33
import micromatch from 'micromatch';
44
import { parse } from 'yaml';
55

66
import { collectResolvedInputFilePaths } from './input-message-utils.js';
77
import { interpolateEnv } from './interpolation.js';
88
import { loadTestsFromAgentSkills } from './loaders/agent-skills-parser.js';
9-
import { expandFileReferences, loadCasesFromFile } from './loaders/case-file-loader.js';
9+
import {
10+
expandFileReferences,
11+
loadCasesFromDirectory,
12+
loadCasesFromFile,
13+
} from './loaders/case-file-loader.js';
1014
import {
1115
extractBudgetUsd,
1216
extractCacheConfig,
@@ -332,12 +336,22 @@ async function loadTestsFromYaml(
332336
// Parse suite-level workspace config (default for all cases)
333337
const evalFileDir = path.dirname(absoluteTestPath);
334338

335-
// Resolve tests: string path to external file, inline array, or error
339+
// Resolve tests: string path to external file/directory, inline array, or error
336340
let expandedTestCases: readonly JsonValue[];
337341
if (typeof rawTestCases === 'string') {
338-
// String path: load tests from external file (YAML, JSONL)
339342
const externalPath = path.resolve(evalFileDir, rawTestCases);
340-
expandedTestCases = await loadCasesFromFile(externalPath);
343+
let isDir = false;
344+
try {
345+
const pathStat = await stat(externalPath);
346+
isDir = pathStat.isDirectory();
347+
} catch {
348+
// Path doesn't exist — fall through to loadCasesFromFile for its error message
349+
}
350+
if (isDir) {
351+
expandedTestCases = await loadCasesFromDirectory(externalPath);
352+
} else {
353+
expandedTestCases = await loadCasesFromFile(externalPath);
354+
}
341355
} else if (Array.isArray(rawTestCases)) {
342356
// Inline array: expand any file:// references
343357
expandedTestCases = await expandFileReferences(rawTestCases, evalFileDir);

0 commit comments

Comments
 (0)