Video walkthrough: https://youtu.be/rVIvr8eDdVM 60-second overview: https://youtu.be/Rv5FMktWmzE
Open-source agent evaluation harness: run agents against datasets, capture traces, score with rubric-based LLM-judges, and view regressions in a web dashboard.
AgentEval Lab is a local-first harness for testing LLM agents against structured datasets. You define tasks in YAML — each with a prompt, expected outcome, and rubric — point the CLI at an agent file, and get back a SQLite database of structured traces and per-axis rubric scores produced by a separate LLM judge. No external infrastructure required.
The web dashboard gives you three views: a runs list, a step-by-step trace inspector that shows every thought and tool call, and a compare view that diffs two runs side by side and flags any rubric axis that regressed by one point or more. The goal is to make "does this agent actually work?" a repeatable, auditable question rather than a vibe check.
git clone https://github.com/RitikPatill/agent-eval-lab.git
cd agent-eval-lab
# Install the Python package and dev tools (requires Python 3.11+)
pip install -e ".[dev]"
# Set your Anthropic API key — required for the LLM judge
export ANTHROPIC_API_KEY=sk-...
# Seed the DB with two pre-built demo runs and start both servers
# Requires Node 18+ for the Next.js dashboard
bash record_demo.shThe script installs Node dependencies, seeds SQLite with a baseline run (r_001) and a regression run (r_002), and starts the FastAPI backend on port 8000 and the Next.js dashboard on port 3000. Press Ctrl-C to stop both servers.
Run any dataset against any agent from the CLI:
aelab run datasets/research_qa.yaml --agent agents/v1_researcher.py
# → prints a run id, e.g. r_abc123Open http://localhost:3000/runs to see the new run. Click it to walk through the trace — each row is one agent step showing the thought, the tool called, and the judge's score with a written justification. To compare two runs, navigate to http://localhost:3000/compare?a=r_001&b=r_002. Tasks where the second run scored lower on any rubric axis receive a REGRESSION badge, with the judge justification shown inline so you know why the score dropped.
The FastAPI backend also exposes interactive docs at http://localhost:8000/docs.
flowchart LR
CLI[aelab CLI] --> Harness[Agent Harness]
Harness -->|tool calls| Tools[calc · web-fetch · file-read]
Harness -->|trace| DB[(SQLite)]
Harness --> Judge[LLM Judge]
Judge -->|rubric scores| DB
DB --> API[FastAPI]
API --> UI[Next.js Dashboard]
UI -->|runs / traces / compare| User((User))
Datasets[YAML datasets] --> Harness
See docs/architecture.md for a detailed design walkthrough.
agent-eval-lab/
aelab/ # Python package — core logic
cli.py # `aelab` entry-point (Typer)
agent.py # ReAct loop: thought → tool call → observation
tools/ # calculator, web-fetch, file-read implementations
runner.py # orchestrates dataset → agent → judge → db
judge.py # LLM-as-judge with structured JSON output
regression.py # pure score-delta and regression-flag logic
db.py # SQLite schema and query helpers
api.py # FastAPI app: /api/runs, /api/traces, /api/compare
agents/ # example agent configs (v1 baseline, v2 regression)
datasets/ # research_qa.yaml, tool_use.yaml
scripts/ # seed_demo.py — idempotent DB seeder
web/ # Next.js 14 + Tailwind dashboard
src/app/runs/ # runs list with regression badges
src/app/compare/ # side-by-side score delta view
src/components/ # CompareTable, ScoreDelta, RunSelectorForm
docs/ # architecture.md, quickstart.md, screenshot.png, demo.gif
tests/ # pytest: scaffold, regression logic, API integration
record_demo.sh # end-to-end demo: seed → start servers → print URLs
pyproject.toml
- v0.2 — Multi-judge ensembling: run N independent judge calls per trace, aggregate scores by mean, and surface standard deviation in the dashboard to flag statistically uncertain results
- v0.3 — OpenTelemetry export: map trace events to OTEL spans and add
--otel-endpointtoaelab runso traces are visible in Jaeger, Zipkin, or Grafana Tempo - v0.4 — Langfuse sink: add
--sink langfuseto post completed runs and rubric scores to a Langfuse instance alongside production traces
MIT — see LICENSE.
Built autonomously by autodev, a multi-agent orchestrator I designed. Each commit in this repo was authored by me; the implementation work was performed by Sonnet under the orchestrator's control. Read the orchestrator's README to see how.
