Skip to content

ttxs69/awesome-coding-agent-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 

Repository files navigation

Awesome Coding Agent Evaluation

A curated list of benchmarks, harnesses, leaderboards, and tools for evaluating AI coding agents (CLI-based and autonomous).

Why this list? Coding agents (Claude Code, Codex, Aider, etc.) are exploding, but comparing them objectively is hard. This list collects every public benchmark, eval harness, and related resource in one place.


πŸ“Š Benchmarks & Datasets

The foundational datasets that define "what to test" for coding agents.

Project Stars Description
SWE-bench 5.1k The original benchmark β€” real GitHub issues from popular Python repos. ICLR 2024 Oral. Includes SWE-bench Verified (500 human-validated subset) and SWE-bench Multimodal.
Multi-SWE-bench 337 Multilingual extension of SWE-bench covering Java, TypeScript, Rust, Go, C++, JavaScript. 1.6k+ instances across 7 languages.
SWE-bench-Live 194 Microsoft. NeurIPS 2025. A live benchmark that continuously collects new GitHub issues to prevent data contamination.
SWE-bench-Mutated 4 Microsoft/CAIN 2026. Rewrites SWE-bench prompts via LLM to create mutation-hardened variants for more realistic agent evaluation.
SWE-EVO 46 Benchmark for evaluating coding agents on autonomous software evolution tasks (not just isolated bug fixes).
FrontierSWE 127 Ultra long-horizon coding agent benchmark testing implementation, performance engineering, and ML research tasks.

πŸ”§ Evaluation Harnesses & Tools

Tools that run the benchmarks and collect results.

Project Stars Description
SWE-agent 19.5k Princeton NLP. The reference agent+evaluator that takes a GitHub issue and tries to fix it. Now includes EnIGMA mode for cybersecurity CTF challenges.
coding-agent-eval (cae) 1 Public, reproducible benchmark for CLI coding agents (Claude Code, Codex, Aider) on SWE-bench Verified. Static leaderboard, cost/time/token tracking.
Workshop 869 Give your coding agent the power to write and run agent evals. Agentic eval framework.
SanityHarness 227 Lightweight, universal harness compatible with any coding agent. Evaluates over a broad set of tasks.
SanityBoard 22 Leaderboard website for SanityHarness results.
Strands Evaluation 134 Comprehensive evaluation framework for AI agents and LLM apps β€” from simple output validation to complex multi-agent interaction analysis and trajectory evaluation.
EvalMonkey 38 CLI for agent builders to benchmark & chaos test AI agents. Text, Voice, Code supported.
agentic-coding-tool-eval 41 Simple framework to compare agentic coding tools head-to-head.
agent-eval-harness 1 Live, open-source benchmark for comparing AI coding agents on real GitHub issues. Auto-updated.
halton/coding-agent-eval β€” Evaluation framework comparing Claude CLI, GitHub Copilot CLI, and Gemini CLI on coding tasks.
kortix-ai/swe 5 SWE-bench runner with Docker-based evaluation.

πŸ† Leaderboards & Comparisons

Live or static sites that rank coding agents.

Project Stars Description
ai-agent-benchmark 25 Comprehensive comparison of 80+ AI coding agents. SWE-bench leaderboard with pricing. Covers Devin, Cursor, Claude Code, Codex, etc.
cae leaderboard β€” Live static leaderboard from coding-agent-eval. Shows pass rate, cost, time, tokens per agent per task.
SanityBoard 22 Web-based leaderboard for SanityHarness results.

πŸ€– The Agents Being Evaluated

The coding agents these benchmarks measure.

Project Stars Description
Claude Code 131k Anthropic's agentic coding tool. Lives in your terminal, understands your codebase.
Codex 90k OpenAI's lightweight coding agent that runs in your terminal.
Aider 46k AI pair programming in your terminal. Long-standing open-source coding agent.
SWE-agent 19.5k Agent that automatically solves GitHub issues using an LM. Serves as both tool and benchmark.

πŸ“š Surveys & Awesome Lists

Academic and community-maintained collections.

Project Stars Description
Awesome-Repo-Level-Code-Generation 304 Must-read papers on repository-level code generation & issue resolution.
Awesome-Code-as-Agent-Harness-Papers 353 Curated papers on code-as-agent harnesses, benchmarks (Terminal-Bench, AppWorld, etc.).
Awesome-Issue-Solving 9 Survey: Agentic Software Issue Resolution with Large Language Models.
Awesome-LLM-SWE-Bench 5 Reading list on LLM for SWE-bench style tasks.

πŸ”— Related Resources


Contributing

Found a project that's missing? Open an issue or PR!

Criteria for inclusion:

  • Directly related to evaluating or benchmarking coding agents
  • Public GitHub repo or well-known paper with public dataset
  • Not a general LLM benchmark unless it has a specific coding-agent track

Last updated: 2026-06-08

Releases

No releases published

Packages

 
 
 

Contributors