agent-evals

Star

Here are 16 public repositories matching this topic...

The-Swarm-Corporation / StatisticalModelEvaluator

Star

An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

ai ml multiagent agents llms evals llm-evals agent-evals multi-agent-eval

Updated Oct 6, 2025
Python

MrTsepa / autoevolve

Star

AI agent evolving strategies through automated self-play overnight. Generic framework with GEPA-inspired feedback loop and Elo tracking.

python genetic-algorithm evolutionary-algorithms game-ai autonomous-agents ai-agents self-play prompt-optimization llm-agents agent-evals

Updated Mar 24, 2026
Python

iMeanAI / open-source-operator

Star

Create your self-hosted, open-source Operator model.

training-infra agent-evals gui-agent browseruse native-agent-model

Updated Apr 10, 2025
Python

shubchat / loab

Star

LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending operations under real regulatory constraints.

multi-agent lending ai-safety llm-agents llm-benchmarking agent-evals tool-use-ai

Updated Mar 31, 2026
Python

8Dionysus / aoa-evals

Star

Portable evaluation bundles for agents and agent-shaped workflows: bounded, reproducible, regression-aware proof surfaces for quality claims.

regression-testing aoa agent-evals boundary-testing workflow-evaluation safety-evals scoring-rubrics comparative-evaluation longitudinal-evaluation agents-of-abyss eval-bundles

Updated Apr 21, 2026
Python

s1liconcow / repogauge

Star

Build a private evaluation dataset to optimize your organization's token costs.

token-cost agent-evals swe-bench

Updated Apr 21, 2026
Python

aak204 / llm-coordination-harness

Star

Reproducible evaluation harness for hidden coordination variables in multi-agent LLM systems.

benchmark research evaluation coordination multi-agent reproducibility negative-results llm openrouter agent-evals

Updated Apr 5, 2026
Python

vksundararajan / cross-check

Star

𝘈 𝘔𝘶𝘭𝘵𝘪-𝘈𝘨𝘦𝘯𝘵 𝘚𝘺𝘴𝘵𝘦𝘮 𝘧𝘰𝘳 𝘊𝘳𝘰𝘴𝘴-𝘊𝘩𝘦𝘤𝘬𝘪𝘯𝘨 𝘗𝘩𝘪𝘴𝘩𝘪𝘯𝘨 𝘜𝘙𝘓𝘴.

dockerfile pytest cybersecurity adk mesop agent-development agent-evals adk-python agent-testing

Updated Dec 17, 2025
Python

kallemickelborg / nodetracer

Star

The node-level tracing library for agentic software.

agent evaluation developer-tools observability traceability multi-agent-systems evals agentic-workflow agentic-ai agent-evals agent-orchestration agent-observability

Updated Mar 9, 2026
Python

bigkan8 / legal-action-boundary-eval

Star

Legal Action Boundary Eval (LABE): public proxy eval for legal AI workflows at the action boundary

ai-safety legal-ai agent-evals contract-ai compliance-ai

Updated Apr 19, 2026
Python

mverab / Reposcale

Star

Alpha benchmark for repo continuation intelligence

python open-source benchmark evaluation developer-tools ai-agents ai-engineering llm-evals agent-evals llm-benchmark

Updated Apr 10, 2026
Python

codernate92 / Horizon-Eval

Star

Horizon-Eval: evaluation-integrity framework for portable long-horizon agent benchmarks, with QA gates, trajectory auditing, replayable run bundles, and safety-gap analysis.

evaluation alignment reproducibility trajectory-analysis ai-safety safety-engineering llm-evals agent-evals benchmark-integrity

Updated Mar 21, 2026
Python

AaronZhou-THU / agent-eval-workbench

Star

A practical workbench for prompt, model, and mocked workflow evaluation with repeatable benchmarks, structured graders, and agent episode traces.

benchmarking ai-agents structured-output prompt-engineering llm-evals agent-evals

Updated Mar 15, 2026
Python

manishklach / TraceEval

Star

Trace-to-eval control plane that turns production failures into promptfoo-ready eval packs.

debugging tracing observability llm-evals agent-evals

Updated Mar 10, 2026
JavaScript

Aditya-Khankar / xaudit

Star

Reasoning quality analysis for autonomous agents — detect silent reasoning failures using optimal transport, information theory, and algorithmic complexity.

information-theory agents optimal-transport autonomous-agents ai-safety algorithmic-complexity langchain llm-evaluation langgraph agent-evals reasoning-quality

Updated Apr 20, 2026
Python

devinaexcogitative908 / autoevolve

Star

Automate AI agent behavior tuning with human feedback, test small mutations, and keep what improves performance over time

python genetic-algorithm evolutionary-algorithms game-ai autonomous-agents ai-agents self-play prompt-optimization llm-agents agent-evals

Updated Apr 21, 2026
Python

Improve this page

Add a description, image, and links to the agent-evals topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the agent-evals topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-evals

Here are 16 public repositories matching this topic...

The-Swarm-Corporation / StatisticalModelEvaluator

MrTsepa / autoevolve

iMeanAI / open-source-operator

shubchat / loab

8Dionysus / aoa-evals

s1liconcow / repogauge

aak204 / llm-coordination-harness

vksundararajan / cross-check

kallemickelborg / nodetracer

bigkan8 / legal-action-boundary-eval

mverab / Reposcale

codernate92 / Horizon-Eval

AaronZhou-THU / agent-eval-workbench

manishklach / TraceEval

Aditya-Khankar / xaudit

devinaexcogitative908 / autoevolve

Improve this page

Add this topic to your repo