A harness for benchmarking coding LLMs against YOUR stack. Not a generic benchmark -- a framework for building stack-specific coding evaluations. Test Claude Code, opencode, Aider, or any CLI against tasks that mirror your actual development work.
SWE-bench is Python-only, has contamination issues, and doesn't test your stack. This tool lets you write tasks based on your real bugs, your real patterns, your real framework. You bring the tasks, the harness handles execution, evaluation, and reporting.
git clone <repo> && cd llm-benchmark
pnpm install
# List available tasks
pnpm list
# Dry run (no LLM calls)
pnpm benchmark --cli claude-code --model claude-sonnet-4-6 --tasks all --dry-run
# Run for real
pnpm benchmark --cli claude-code --model claude-sonnet-4-6 --tasks all --runs 1
# With opencode
pnpm benchmark --cli opencode --model github-copilot/gpt-5.3-codex --tasks all --runs 1
# Generate report
npx tsx scripts/report.tstasks/my-task/
├── task.json # Metadata and evaluation config
├── prompt.md # What the LLM sees
├── repo/ # Starting codebase
│ ├── src/
│ ├── package.json
│ └── tsconfig.json
└── eval/ # Hidden tests (copied in AFTER the LLM finishes)
├── tests.test.ts
└── rubric.json
This is what the LLM agent receives. It should:
- Include a
<!-- benchmark-canary: do-not-memorize -->comment for contamination detection - Clearly state the requirements
- Specify which files to modify and which are readonly
- Provide enough context for the LLM to work without exploring the entire repo
The hidden test suite. Copied into the workspace after the LLM finishes, so the LLM never sees these files during execution. Tests should verify behavior (HTTP responses, function outputs), not implementation details (variable names, internal structure).
Uses vitest. Example:
import { describe, it, expect } from "vitest";
describe("greeting endpoint", () => {
it("returns greeting with name", async () => {
const res = await fetch("http://localhost:3000/greet?name=World");
expect(res.status).toBe(200);
const body = await res.json();
expect(body.message).toBe("Hello, World!");
});
});rubric.json -- automated pattern checks applied to the workspace:
{
"checks": [
{ "type": "file_exists", "path": "src/routes/greeting.ts", "points": 1 },
{ "type": "file_not_modified", "path": "src/index.ts", "points": 1 },
{ "type": "no_pattern", "glob": "src/**/*.ts", "pattern": "as any", "points": 1 },
{ "type": "pattern_present", "glob": "src/**/*.ts", "pattern": "export", "points": 1 }
]
}Check types:
| Type | Description |
|---|---|
file_exists |
File must exist at path |
file_not_modified |
File at path must match the original |
no_pattern |
Pattern must NOT appear in files matching glob |
pattern_present |
Pattern must appear in files matching glob |
- Create a directory under
tasks/matching your task ID - Write
task.jsonwith metadata, stack, and evaluation config - Build a synthetic repo in
repo/-- avoid using real production code to prevent contamination - Write the prompt in
prompt.md - Write eval tests in
eval/that verify expected behavior - Add the task entry to
tasks/task-manifest.json - Validate the task structure:
pnpm validate
Tips:
- Base tasks on real work you do, but use synthetic codebases (not copy-pasted production code) to prevent contamination
- Test HTTP behavior, function outputs, and file structure -- not internal method names or variable choices
- Include a mix of difficulties across categories
- Planning tasks can test written output with text assertions (use
test_weight: 0.9, lint_weight: 0.0, rubric_weight: 0.1)
| Category | Description |
|---|---|
api-endpoint |
Build or modify HTTP API routes |
react-frontend |
React component and UI tasks |
library-utility |
Standalone utility functions and modules |
database-operation |
Database queries, aggregations, migrations |
docker-infrastructure |
Dockerfiles, Compose, deployment configs |
bug-fix |
Find and fix bugs in existing code |
planning |
Produce written plans, architecture docs, or investigation reports |
| Adapter | Flag | How it runs | Model format |
|---|---|---|---|
| Claude Code | --cli claude-code |
claude --print --model X |
claude-opus-4-6 |
| opencode | --cli opencode |
opencode run -m X |
provider/model (e.g. github-copilot/gpt-5.3-codex) |
| Aider | --cli aider |
aider --yes-always --model X |
Model name |
| Manual | --cli manual |
Human runs the task | N/A |
Adapters live in harness/src/adapters/. Adding a new one means implementing the adapter interface and registering it in adapters/index.ts.
score = (test_pass_rate * test_weight) + (lint_pass * lint_weight) + (rubric_score * rubric_weight)
- test_pass_rate: Fraction of vitest tests that pass (0.0 -- 1.0)
- lint_pass: Binary (1.0 if
tsc --noEmitexits 0, else 0.0) - rubric_score: Fraction of rubric check points earned
Default weights: tests 0.7, lint 0.2, rubric 0.1. Planning tasks typically use tests 0.9, lint 0.0, rubric 0.1.
# Leaderboard + per-task breakdown from existing result data
npx tsx scripts/report.ts
# Re-run evaluation against existing workspaces (incremental -- only missing results)
npx tsx scripts/re-evaluate.ts
# Re-evaluate everything from scratch
npx tsx scripts/re-evaluate.ts --fullReports are generated as JSON and Markdown in the results/ directory.
harness/src/
├── cli.ts # CLI entry point (run, validate, list)
├── runner.ts # Benchmark orchestrator with rate-limit retry
├── evaluator.ts # Test runner, lint checker, rubric scorer
├── container.ts # Workspace lifecycle (copy repo, install, run, eval)
├── stats.ts # Bootstrap CI, Wilcoxon signed-rank test
├── reporter.ts # JSON + Markdown report generation
├── validate.ts # Task structure validation
└── adapters/ # CLI-specific adapters
├── index.ts
├── types.ts
├── claude-code.ts
├── opencode.ts
├── aider.ts
└── manual.ts
- SanityHarness — Tests 19 CLI agents on curated algorithmic tasks across 6 languages. Great for comparing agents on standardised problems, but tasks are fixed (not bring-your-own).
- Calibra — Custom task matrix with detailed reporting, but tests model APIs rather than CLI agents.
- SWE-bench — Industry-standard benchmark from real GitHub issues. Fixed dataset, primarily Python.
- Terminal-Bench — Terminal agent benchmark with Docker-containerised tasks. Community contributions welcome but not a bring-your-own-repo harness.
- Inspect AI — General-purpose LLM eval framework from UK AISI. Extensible but requires custom scaffolding for CLI agent testing.
We built this because we needed to test LLMs against our specific TypeScript/Ghost stack and none of the above quite fit. If you have similar needs, it might save you some time.
MIT
{ // Identity "id": "my-task", // Unique identifier, matches directory name "version": 1, // Bump when task changes materially "category": "api-endpoint", // See Categories below "difficulty": "medium", // easy | medium | hard "title": "Short human-readable title", "description": "What the task is about.", // Stack requirements "stack": { "runtime": "node:20-alpine", // Docker image for execution "package_manager": "npm", // npm | pnpm "test_runner": "vitest", // Test framework used in eval/ "key_dependencies": ["hono"] // Notable packages in the repo }, // Setup "setup": { "install_command": "npm install --silent", "pre_test_command": null, // Optional command before tests (e.g. build step) "environment": { "NODE_ENV": "test" }, "services": [] // Future: database containers, etc. }, // Evaluation "evaluation": { "test_command": "node_modules/.bin/vitest run eval/tests.test.ts 2>&1", "lint_command": "node_modules/.bin/tsc --noEmit", "test_weight": 0.7, // Weight of test pass rate in final score "lint_weight": 0.2, // Weight of lint pass (binary) in final score "rubric_weight": 0.1, // Weight of rubric checks in final score "timeout_seconds": 60 // Max time for LLM to work on the task }, // Files "prompt_file": "prompt.md", // Prompt sent to the LLM "files_to_modify": ["src/routes/greeting.ts"], // Files the LLM should change "files_readonly": ["src/index.ts"], // Files the LLM must not touch "tags": ["hono", "api"] // For filtering }