Awesome Coding Agent Evaluation

A curated list of benchmarks, harnesses, leaderboards, and tools for evaluating AI coding agents (CLI-based and autonomous).

Why this list? Coding agents (Claude Code, Codex, Aider, etc.) are exploding, but comparing them objectively is hard. This list collects every public benchmark, eval harness, and related resource in one place.

📊 Benchmarks & Datasets

The foundational datasets that define "what to test" for coding agents.

Project	Stars	Description
SWE-bench	5.1k	The original benchmark — real GitHub issues from popular Python repos. ICLR 2024 Oral. Includes SWE-bench Verified (500 human-validated subset) and SWE-bench Multimodal.
Multi-SWE-bench	337	Multilingual extension of SWE-bench covering Java, TypeScript, Rust, Go, C++, JavaScript. 1.6k+ instances across 7 languages.
SWE-bench-Live	194	Microsoft. NeurIPS 2025. A live benchmark that continuously collects new GitHub issues to prevent data contamination.
SWE-bench-Mutated	4	Microsoft/CAIN 2026. Rewrites SWE-bench prompts via LLM to create mutation-hardened variants for more realistic agent evaluation.
SWE-EVO	46	Benchmark for evaluating coding agents on autonomous software evolution tasks (not just isolated bug fixes).
FrontierSWE	127	Ultra long-horizon coding agent benchmark testing implementation, performance engineering, and ML research tasks.

🔧 Evaluation Harnesses & Tools

Tools that run the benchmarks and collect results.

Project	Stars	Description
SWE-agent	19.5k	Princeton NLP. The reference agent+evaluator that takes a GitHub issue and tries to fix it. Now includes EnIGMA mode for cybersecurity CTF challenges.
coding-agent-eval (cae)	1	Public, reproducible benchmark for CLI coding agents (Claude Code, Codex, Aider) on SWE-bench Verified. Static leaderboard, cost/time/token tracking.
Workshop	869	Give your coding agent the power to write and run agent evals. Agentic eval framework.
SanityHarness	227	Lightweight, universal harness compatible with any coding agent. Evaluates over a broad set of tasks.
SanityBoard	22	Leaderboard website for SanityHarness results.
Strands Evaluation	134	Comprehensive evaluation framework for AI agents and LLM apps — from simple output validation to complex multi-agent interaction analysis and trajectory evaluation.
EvalMonkey	38	CLI for agent builders to benchmark & chaos test AI agents. Text, Voice, Code supported.
agentic-coding-tool-eval	41	Simple framework to compare agentic coding tools head-to-head.
agent-eval-harness	1	Live, open-source benchmark for comparing AI coding agents on real GitHub issues. Auto-updated.
halton/coding-agent-eval	—	Evaluation framework comparing Claude CLI, GitHub Copilot CLI, and Gemini CLI on coding tasks.
kortix-ai/swe	5	SWE-bench runner with Docker-based evaluation.

🏆 Leaderboards & Comparisons

Live or static sites that rank coding agents.

Project	Stars	Description
ai-agent-benchmark	25	Comprehensive comparison of 80+ AI coding agents. SWE-bench leaderboard with pricing. Covers Devin, Cursor, Claude Code, Codex, etc.
cae leaderboard	—	Live static leaderboard from coding-agent-eval. Shows pass rate, cost, time, tokens per agent per task.
SanityBoard	22	Web-based leaderboard for SanityHarness results.

🤖 The Agents Being Evaluated

The coding agents these benchmarks measure.

Project	Stars	Description
Claude Code	131k	Anthropic's agentic coding tool. Lives in your terminal, understands your codebase.
Codex	90k	OpenAI's lightweight coding agent that runs in your terminal.
Aider	46k	AI pair programming in your terminal. Long-standing open-source coding agent.
SWE-agent	19.5k	Agent that automatically solves GitHub issues using an LM. Serves as both tool and benchmark.

📚 Surveys & Awesome Lists

Academic and community-maintained collections.

Project	Stars	Description
Awesome-Repo-Level-Code-Generation	304	Must-read papers on repository-level code generation & issue resolution.
Awesome-Code-as-Agent-Harness-Papers	353	Curated papers on code-as-agent harnesses, benchmarks (Terminal-Bench, AppWorld, etc.).
Awesome-Issue-Solving	9	Survey: Agentic Software Issue Resolution with Large Language Models.
Awesome-LLM-SWE-Bench	5	Reading list on LLM for SWE-bench style tasks.

🔗 Related Resources

Anthropic: Demystifying Evals for AI Agents — How to structure agent evals
LiveCodeBench — Contamination-free, continuously updated code evaluation benchmark
LiveBench — ICLR 2025 Spotlight. Continuously updated LLM benchmark
Terminal-Bench — Benchmarking agents on hard, realistic CLI tasks (arXiv 2026)
SWE-PolyBench — Multi-language benchmark for repository-level evaluation of coding agents

Contributing

Found a project that's missing? Open an issue or PR!

Criteria for inclusion:

Directly related to evaluating or benchmarking coding agents
Public GitHub repo or well-known paper with public dataset
Not a general LLM benchmark unless it has a specific coding-agent track

Last updated: 2026-06-08

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Coding Agent Evaluation

📊 Benchmarks & Datasets

🔧 Evaluation Harnesses & Tools

🏆 Leaderboards & Comparisons

🤖 The Agents Being Evaluated

📚 Surveys & Awesome Lists

🔗 Related Resources

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Coding Agent Evaluation

📊 Benchmarks & Datasets

🔧 Evaluation Harnesses & Tools

🏆 Leaderboards & Comparisons

🤖 The Agents Being Evaluated

📚 Surveys & Awesome Lists

🔗 Related Resources

Contributing

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages