llm-benchmark

A harness for benchmarking coding LLMs against YOUR stack. Not a generic benchmark -- a framework for building stack-specific coding evaluations. Test Claude Code, opencode, Aider, or any CLI against tasks that mirror your actual development work.

Why?

SWE-bench is Python-only, has contamination issues, and doesn't test your stack. This tool lets you write tasks based on your real bugs, your real patterns, your real framework. You bring the tasks, the harness handles execution, evaluation, and reporting.

Quick Start

git clone <repo> && cd llm-benchmark
pnpm install

# List available tasks
pnpm list

# Dry run (no LLM calls)
pnpm benchmark --cli claude-code --model claude-sonnet-4-6 --tasks all --dry-run

# Run for real
pnpm benchmark --cli claude-code --model claude-sonnet-4-6 --tasks all --runs 1

# With opencode
pnpm benchmark --cli opencode --model github-copilot/gpt-5.3-codex --tasks all --runs 1

# Generate report
npx tsx scripts/report.ts

Task Structure

tasks/my-task/
├── task.json        # Metadata and evaluation config
├── prompt.md        # What the LLM sees
├── repo/            # Starting codebase
│   ├── src/
│   ├── package.json
│   └── tsconfig.json
└── eval/            # Hidden tests (copied in AFTER the LLM finishes)
    ├── tests.test.ts
    └── rubric.json

task.json

{
  // Identity
  "id": "my-task",                    // Unique identifier, matches directory name
  "version": 1,                       // Bump when task changes materially
  "category": "api-endpoint",         // See Categories below
  "difficulty": "medium",             // easy | medium | hard
  "title": "Short human-readable title",
  "description": "What the task is about.",

  // Stack requirements
  "stack": {
    "runtime": "node:20-alpine",      // Docker image for execution
    "package_manager": "npm",         // npm | pnpm
    "test_runner": "vitest",          // Test framework used in eval/
    "key_dependencies": ["hono"]      // Notable packages in the repo
  },

  // Setup
  "setup": {
    "install_command": "npm install --silent",
    "pre_test_command": null,         // Optional command before tests (e.g. build step)
    "environment": { "NODE_ENV": "test" },
    "services": []                    // Future: database containers, etc.
  },

  // Evaluation
  "evaluation": {
    "test_command": "node_modules/.bin/vitest run eval/tests.test.ts 2>&1",
    "lint_command": "node_modules/.bin/tsc --noEmit",
    "test_weight": 0.7,              // Weight of test pass rate in final score
    "lint_weight": 0.2,              // Weight of lint pass (binary) in final score
    "rubric_weight": 0.1,            // Weight of rubric checks in final score
    "timeout_seconds": 60            // Max time for LLM to work on the task
  },

  // Files
  "prompt_file": "prompt.md",        // Prompt sent to the LLM
  "files_to_modify": ["src/routes/greeting.ts"],  // Files the LLM should change
  "files_readonly": ["src/index.ts"],              // Files the LLM must not touch
  "tags": ["hono", "api"]            // For filtering
}

prompt.md

This is what the LLM agent receives. It should:

Include a  comment for contamination detection
Clearly state the requirements
Specify which files to modify and which are readonly
Provide enough context for the LLM to work without exploring the entire repo

eval/

The hidden test suite. Copied into the workspace after the LLM finishes, so the LLM never sees these files during execution. Tests should verify behavior (HTTP responses, function outputs), not implementation details (variable names, internal structure).

Uses vitest. Example:

import { describe, it, expect } from "vitest";

describe("greeting endpoint", () => {
  it("returns greeting with name", async () => {
    const res = await fetch("http://localhost:3000/greet?name=World");
    expect(res.status).toBe(200);
    const body = await res.json();
    expect(body.message).toBe("Hello, World!");
  });
});

rubric.json -- automated pattern checks applied to the workspace:

{
  "checks": [
    { "type": "file_exists", "path": "src/routes/greeting.ts", "points": 1 },
    { "type": "file_not_modified", "path": "src/index.ts", "points": 1 },
    { "type": "no_pattern", "glob": "src/**/*.ts", "pattern": "as any", "points": 1 },
    { "type": "pattern_present", "glob": "src/**/*.ts", "pattern": "export", "points": 1 }
  ]
}

Check types:

Type	Description
`file_exists`	File must exist at `path`
`file_not_modified`	File at `path` must match the original
`no_pattern`	Pattern must NOT appear in files matching `glob`
`pattern_present`	Pattern must appear in files matching `glob`

Creating Your Own Tasks

Create a directory under tasks/ matching your task ID
Write task.json with metadata, stack, and evaluation config
Build a synthetic repo in repo/ -- avoid using real production code to prevent contamination
Write the prompt in prompt.md
Write eval tests in eval/ that verify expected behavior
Add the task entry to tasks/task-manifest.json
Validate the task structure: pnpm validate

Tips:

Base tasks on real work you do, but use synthetic codebases (not copy-pasted production code) to prevent contamination
Test HTTP behavior, function outputs, and file structure -- not internal method names or variable choices
Include a mix of difficulties across categories
Planning tasks can test written output with text assertions (use test_weight: 0.9, lint_weight: 0.0, rubric_weight: 0.1)

CLI Adapters

Adapter	Flag	How it runs	Model format
Claude Code	`--cli claude-code`	`claude --print --model X`	`claude-opus-4-6`
opencode	`--cli opencode`	`opencode run -m X`	`provider/model` (e.g. `github-copilot/gpt-5.3-codex`)
Aider	`--cli aider`	`aider --yes-always --model X`	Model name
Manual	`--cli manual`	Human runs the task	N/A

Adapters live in harness/src/adapters/. Adding a new one means implementing the adapter interface and registering it in adapters/index.ts.

Scoring

score = (test_pass_rate * test_weight) + (lint_pass * lint_weight) + (rubric_score * rubric_weight)

test_pass_rate: Fraction of vitest tests that pass (0.0 -- 1.0)
lint_pass: Binary (1.0 if tsc --noEmit exits 0, else 0.0)
rubric_score: Fraction of rubric check points earned

Default weights: tests 0.7, lint 0.2, rubric 0.1. Planning tasks typically use tests 0.9, lint 0.0, rubric 0.1.

Reporting

# Leaderboard + per-task breakdown from existing result data
npx tsx scripts/report.ts

# Re-run evaluation against existing workspaces (incremental -- only missing results)
npx tsx scripts/re-evaluate.ts

# Re-evaluate everything from scratch
npx tsx scripts/re-evaluate.ts --full

Reports are generated as JSON and Markdown in the results/ directory.

Project Structure

harness/src/
├── cli.ts          # CLI entry point (run, validate, list)
├── runner.ts       # Benchmark orchestrator with rate-limit retry
├── evaluator.ts    # Test runner, lint checker, rubric scorer
├── container.ts    # Workspace lifecycle (copy repo, install, run, eval)
├── stats.ts        # Bootstrap CI, Wilcoxon signed-rank test
├── reporter.ts     # JSON + Markdown report generation
├── validate.ts     # Task structure validation
└── adapters/       # CLI-specific adapters
    ├── index.ts
    ├── types.ts
    ├── claude-code.ts
    ├── opencode.ts
    ├── aider.ts
    └── manual.ts

Related Projects

SanityHarness — Tests 19 CLI agents on curated algorithmic tasks across 6 languages. Great for comparing agents on standardised problems, but tasks are fixed (not bring-your-own).
Calibra — Custom task matrix with detailed reporting, but tests model APIs rather than CLI agents.
SWE-bench — Industry-standard benchmark from real GitHub issues. Fixed dataset, primarily Python.
Terminal-Bench — Terminal agent benchmark with Docker-containerised tasks. Community contributions welcome but not a bring-your-own-repo harness.
Inspect AI — General-purpose LLM eval framework from UK AISI. Extensible but requires custom scaffolding for CLI agent testing.

We built this because we needed to test LLMs against our specific TypeScript/Ghost stack and none of the above quite fit. If you have similar needs, it might save you some time.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docker		docker
harness		harness
scripts		scripts
tasks		tasks
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.base.json		tsconfig.base.json

Category	Description
`api-endpoint`	Build or modify HTTP API routes
`react-frontend`	React component and UI tasks
`library-utility`	Standalone utility functions and modules
`database-operation`	Database queries, aggregations, migrations
`docker-infrastructure`	Dockerfiles, Compose, deployment configs
`bug-fix`	Find and fix bugs in existing code
`planning`	Produce written plans, architecture docs, or investigation reports

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-benchmark

Why?

Quick Start

Task Structure

task.json

prompt.md

eval/

Creating Your Own Tasks

Categories

CLI Adapters

Scoring

Reporting

Project Structure

Related Projects

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llm-benchmark

Why?

Quick Start

Task Structure

task.json

prompt.md

eval/

Creating Your Own Tasks

Categories

CLI Adapters

Scoring

Reporting

Project Structure

Related Projects

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages