agent-benchmark

Here are 15 public repositories matching this topic...

hidai25 / eval-view

Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.

python testing cli mcp evaluation pytest regression-testing ai-agents autogen llm anthropic langchain-agent openai-assistants crewai langgraph agentic-ai agent-evaluation agent-benchmark

Updated Apr 2, 2026
Python

Cre4T3Tiv3 / ai-agents-reality-check

Sponsor

Star

Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistical validation.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-workflow agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Apr 2, 2026
Python

collinear-ai / tau-trait

Star

TraitBasis applied to TauBench

rl-envs rl-training agent-benchmark

Updated Nov 11, 2025
Python

justindobbs / Tracecore

Star

Deterministic runtime for agent evaluation

reliability-engineering specification ai-agents benchmarking-framework autogen fastapi langchain observability-platform ai-evaluation-framework agent-testing agent-benchmark deterministic-testing autoresearch

Updated Mar 25, 2026
Python

dataanswer / awesome-agent-benchmarks

Star

A curated collection of the world’s most advanced benchmark datasets for evaluating Large Language Model (LLM) Agents.

agent benchmarks awesome-list agent-based-modeling awesome-list-awesome-list ai-agent llm-agent llm-evaluation llm-agents agentic-ai guiagents agent-benchmark evaluation-dataset

Updated Dec 21, 2025

edholofy / dojo.md

Star

University for AI agents. 92 courses, 4400+ scenarios, any model via OpenRouter. Auto-training loops generate per-model SKILL.md documents. Works with Claude Code, OpenClaw, Cursor, Windsurf. No fine-tuning required.

Updated Mar 13, 2026
TypeScript

ArshVermaGit / open-ev-code-handler

Star

Deterministic evaluation environment for AI code reviewers covering bugs, security (OWASP), and architecture via FastAPI + OpenEnv.

security-audit ai static-analysis owasp code-review software-architecture evaluation-framework ai-agents fastapi llm llm-evaluation agent-benchmark openenv

Updated Apr 5, 2026
Python

jackjin1997 / AgentBench-Live

Star

The open benchmark for AI agent task execution. Claude Code vs Gemini CLI — who wins? Live leaderboard inside.

benchmark leaderboard evaluation ai-agents gemini-cli claude-code agent-benchmark cli-agents

Updated Apr 5, 2026
Python

axxafo / awesome-agent-benchmarks

Star

🧠 Discover and evaluate advanced benchmark datasets for Large Language Model agents to enhance performance assessment in real-world tasks.

search awesome ai benchmarks rl agent-based-modeling reasoning awesome-list-awesome-list ai-models ai-agent for-devs llm-agent agentic llm-evaluation llm-agents agentic-ai guiagents agent-benchmark evaluation-dataset

Updated Apr 5, 2026

justindobbs / awesome-certified-agents

Star

A community catalog of autonomous agents and bundles certified by passing TraceCore deterministic episode runs in public CI

open-source benchmarking evaluation multi-agent deterministic ai-agents developer-tools-test agent-benchmark tracecore

Updated Mar 7, 2026
Python

HoangLayor / OWLViz

Star

OWLViz: An Open-World Benchmark for Visual Question Answering

benchmark dataset vlm-benchmark agent-benchmark

Updated Jun 29, 2025

SahilKumar75 / mario-the-plumber

Star

OpenEnv benchmark for broken ELT/ETL pipeline repair, online recovery, and temporal orchestration.

reinforcement-learning etl data-engineering fastapi huggingface-space agent-benchmark openenv

Updated Apr 5, 2026
Python

someonehereexists / AI-Arena---Benchmarking-Platform-for-Autonomous-AI-Agents

Star

AI Arena is a competitive evaluation framework where multiple AI agents answer the same set of questions under identical conditions. Their performance is scored, ranked, and tracked over time using two complementary metrics AIQ and ELO