An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"
-
Updated
Oct 6, 2025 - Python
An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"
AI agent evolving strategies through automated self-play overnight. Generic framework with GEPA-inspired feedback loop and Elo tracking.
Create your self-hosted, open-source Operator model.
LOAB: A benchmark for evaluating LLM agents on end-to-end mortgage lending operations under real regulatory constraints.
Portable evaluation bundles for agents and agent-shaped workflows: bounded, reproducible, regression-aware proof surfaces for quality claims.
Build a private evaluation dataset to optimize your organization's token costs.
Reproducible evaluation harness for hidden coordination variables in multi-agent LLM systems.
𝘈 𝘔𝘶𝘭𝘵𝘪-𝘈𝘨𝘦𝘯𝘵 𝘚𝘺𝘴𝘵𝘦𝘮 𝘧𝘰𝘳 𝘊𝘳𝘰𝘴𝘴-𝘊𝘩𝘦𝘤𝘬𝘪𝘯𝘨 𝘗𝘩𝘪𝘴𝘩𝘪𝘯𝘨 𝘜𝘙𝘓𝘴.
The node-level tracing library for agentic software.
Legal Action Boundary Eval (LABE): public proxy eval for legal AI workflows at the action boundary
Alpha benchmark for repo continuation intelligence
Horizon-Eval: evaluation-integrity framework for portable long-horizon agent benchmarks, with QA gates, trajectory auditing, replayable run bundles, and safety-gap analysis.
A practical workbench for prompt, model, and mocked workflow evaluation with repeatable benchmarks, structured graders, and agent episode traces.
Trace-to-eval control plane that turns production failures into promptfoo-ready eval packs.
Reasoning quality analysis for autonomous agents — detect silent reasoning failures using optimal transport, information theory, and algorithmic complexity.
Automate AI agent behavior tuning with human feedback, test small mutations, and keep what improves performance over time
Add a description, image, and links to the agent-evals topic page so that developers can more easily learn about it.
To associate your repository with the agent-evals topic, visit your repo's landing page and select "manage topics."