A curated list of benchmarks, harnesses, leaderboards, and tools for evaluating AI coding agents (CLI-based and autonomous).
Why this list? Coding agents (Claude Code, Codex, Aider, etc.) are exploding, but comparing them objectively is hard. This list collects every public benchmark, eval harness, and related resource in one place.
The foundational datasets that define "what to test" for coding agents.
| Project | Stars | Description |
|---|---|---|
| SWE-bench | 5.1k | The original benchmark β real GitHub issues from popular Python repos. ICLR 2024 Oral. Includes SWE-bench Verified (500 human-validated subset) and SWE-bench Multimodal. |
| Multi-SWE-bench | 337 | Multilingual extension of SWE-bench covering Java, TypeScript, Rust, Go, C++, JavaScript. 1.6k+ instances across 7 languages. |
| SWE-bench-Live | 194 | Microsoft. NeurIPS 2025. A live benchmark that continuously collects new GitHub issues to prevent data contamination. |
| SWE-bench-Mutated | 4 | Microsoft/CAIN 2026. Rewrites SWE-bench prompts via LLM to create mutation-hardened variants for more realistic agent evaluation. |
| SWE-EVO | 46 | Benchmark for evaluating coding agents on autonomous software evolution tasks (not just isolated bug fixes). |
| FrontierSWE | 127 | Ultra long-horizon coding agent benchmark testing implementation, performance engineering, and ML research tasks. |
Tools that run the benchmarks and collect results.
| Project | Stars | Description |
|---|---|---|
| SWE-agent | 19.5k | Princeton NLP. The reference agent+evaluator that takes a GitHub issue and tries to fix it. Now includes EnIGMA mode for cybersecurity CTF challenges. |
| coding-agent-eval (cae) | 1 | Public, reproducible benchmark for CLI coding agents (Claude Code, Codex, Aider) on SWE-bench Verified. Static leaderboard, cost/time/token tracking. |
| Workshop | 869 | Give your coding agent the power to write and run agent evals. Agentic eval framework. |
| SanityHarness | 227 | Lightweight, universal harness compatible with any coding agent. Evaluates over a broad set of tasks. |
| SanityBoard | 22 | Leaderboard website for SanityHarness results. |
| Strands Evaluation | 134 | Comprehensive evaluation framework for AI agents and LLM apps β from simple output validation to complex multi-agent interaction analysis and trajectory evaluation. |
| EvalMonkey | 38 | CLI for agent builders to benchmark & chaos test AI agents. Text, Voice, Code supported. |
| agentic-coding-tool-eval | 41 | Simple framework to compare agentic coding tools head-to-head. |
| agent-eval-harness | 1 | Live, open-source benchmark for comparing AI coding agents on real GitHub issues. Auto-updated. |
| halton/coding-agent-eval | β | Evaluation framework comparing Claude CLI, GitHub Copilot CLI, and Gemini CLI on coding tasks. |
| kortix-ai/swe | 5 | SWE-bench runner with Docker-based evaluation. |
Live or static sites that rank coding agents.
| Project | Stars | Description |
|---|---|---|
| ai-agent-benchmark | 25 | Comprehensive comparison of 80+ AI coding agents. SWE-bench leaderboard with pricing. Covers Devin, Cursor, Claude Code, Codex, etc. |
| cae leaderboard | β | Live static leaderboard from coding-agent-eval. Shows pass rate, cost, time, tokens per agent per task. |
| SanityBoard | 22 | Web-based leaderboard for SanityHarness results. |
The coding agents these benchmarks measure.
| Project | Stars | Description |
|---|---|---|
| Claude Code | 131k | Anthropic's agentic coding tool. Lives in your terminal, understands your codebase. |
| Codex | 90k | OpenAI's lightweight coding agent that runs in your terminal. |
| Aider | 46k | AI pair programming in your terminal. Long-standing open-source coding agent. |
| SWE-agent | 19.5k | Agent that automatically solves GitHub issues using an LM. Serves as both tool and benchmark. |
Academic and community-maintained collections.
| Project | Stars | Description |
|---|---|---|
| Awesome-Repo-Level-Code-Generation | 304 | Must-read papers on repository-level code generation & issue resolution. |
| Awesome-Code-as-Agent-Harness-Papers | 353 | Curated papers on code-as-agent harnesses, benchmarks (Terminal-Bench, AppWorld, etc.). |
| Awesome-Issue-Solving | 9 | Survey: Agentic Software Issue Resolution with Large Language Models. |
| Awesome-LLM-SWE-Bench | 5 | Reading list on LLM for SWE-bench style tasks. |
- Anthropic: Demystifying Evals for AI Agents β How to structure agent evals
- LiveCodeBench β Contamination-free, continuously updated code evaluation benchmark
- LiveBench β ICLR 2025 Spotlight. Continuously updated LLM benchmark
- Terminal-Bench β Benchmarking agents on hard, realistic CLI tasks (arXiv 2026)
- SWE-PolyBench β Multi-language benchmark for repository-level evaluation of coding agents
Found a project that's missing? Open an issue or PR!
Criteria for inclusion:
- Directly related to evaluating or benchmarking coding agents
- Public GitHub repo or well-known paper with public dataset
- Not a general LLM benchmark unless it has a specific coding-agent track
Last updated: 2026-06-08