Evaluate any AI guardrail on 80+ safety benchmarks — jailbreaks, toxicity, prompt injection — with one command.
-
Updated
Jun 11, 2026 - Python
Evaluate any AI guardrail on 80+ safety benchmarks — jailbreaks, toxicity, prompt injection — with one command.
Local Codex MCP harness: contracts, persistent RAG memory, raw traces, verification records, governance policy and PASS/FLAG/BLOCK audits, observability reports, harness profiles, eval runs, Meta-Harness-lite promotion records, natural-language harness specs, MCP resources/prompts, multi-client installer, and completion gates.
Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.
LLM-powered clinical extraction + structured evals. Prompt strategies, hallucination detection, and per-field F1 scoring.
Deterministic synthetic two-party conversation corpus generator for testing AI scoring systems.
Production-minded LLM eval harness for safety, reliability, cost, and latency analysis.
Prototype adapter for CLI agents to play Crusader Kings III through a constrained, auditable CK3 mod bridge.
Document processing pipeline engine — adapters, contracts, domain packs, eval harnesses
Verification-native local coding agent runtime with eval gates, memory, subagents, and model profiles.
Codex-native OZM skill pack for AI coding agent governance, agentic coding loops, claim ceilings, and AGENTS.md-aware workflows.
Can Computer-Use Agents manage buy-side procurement operations? A benchmark across live e-commerce, with multi-model adapters (Northstar, OpenAI, Claude, Gemini), Kernel-hosted browsers, and Harbor/ATIF-v1.6 trajectory export.
Production-style LLM evaluation harness for structured clinical extraction — compares prompt strategies across accuracy, cost, and hallucination.
YAML-driven evaluation harness for WhatsApp RAG bots
Open evaluation harness for mental health LLM responses. 5 clinically-grounded rubrics, LLM-as-judge with bias controls, crisis-detection routing to 988 protocols.
Tolerant apply_patch for LLM-generated diffs, plus an eval harness for code agents.
Form ADV Part 2A intelligence + peer benchmarking — LangGraph, hybrid retrieval, eval harness
agent-convergence-scorer is a CLI and Python library that scores how lexically similar N agent or LLM outputs are: exact-match rate, Jaccard token overlap, divergence point, and a composite convergence score over any list of runs. An eval primitive for measuring reproducibility and fan-out collapse. Lexical, not semantic. Zero deps.
Slack Q&A bot template — ingests Slack threads + markdown docs into Pinecone, then an n8n workflow retrieves, re-ranks on metadata, and posts answers back in-thread. Includes contextual retrieval and an eval harness.
Side-by-side eval harness for video-understanding models — retrieval, reasoning, and structured extraction — with an LLM-as-judge and cost-aware scoring. A Solutions-Architect reference scaffold.
An eval and observability cockpit for coding agents. It runs policy-controlled coding agents in sandboxed toy repos, tool-use traces, MCP tools, compares harness policies, scores recovery and safety behavior with Python evals, and supports CI-gated verification and Braintrust/Weave export.
Add a description, image, and links to the eval-harness topic page so that developers can more easily learn about it.
To associate your repository with the eval-harness topic, visit your repo's landing page and select "manage topics."