#

eval-harness

Here are 25 public repositories matching this topic...

guard-eval-harness

Virtue-Research / guard-eval-harness

Evaluate any AI guardrail on 80+ safety benchmarks — jailbreaks, toxicity, prompt injection — with one command.

cli benchmark ai-safety guardrails llm-evaluation llm-safety safety-evaluation eval-harness

Updated Jun 11, 2026
Python

chapzin / codex-harness-mcp

Local Codex MCP harness: contracts, persistent RAG memory, raw traces, verification records, governance policy and PASS/FLAG/BLOCK audits, observability reports, harness profiles, eval runs, Meta-Harness-lite promotion records, natural-language harness specs, MCP resources/prompts, multi-client installer, and completion gates.

mcp observability governance codex ai-agents rag model-context-protocol agent-ops codex-cli agent-harness eval-harness skills-sh harness-engineering

Updated May 12, 2026
JavaScript

plaited / agent-eval-harness

Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.

cli typescript grader ai-agents bun jsonl llm-evaluation agent-evaluation unix-pipeline agent-comparison trajectory-capture eval-harness pass-at-k headless-adapter

Updated May 7, 2026
TypeScript

adityaanand0001 / healos-ai-agent

LLM-powered clinical extraction + structured evals. Prompt strategies, hallucination detection, and per-field F1 scoring.

typescript nextjs postgresql hono bun llm anthropic drizzle-orm eval-harness

Updated May 1, 2026
TypeScript

ResonantIQ / resonantforge

Deterministic synthetic two-party conversation corpus generator for testing AI scoring systems.

synthetic-data llm-evaluation customer-intelligence eval-harness corpus-generation

Updated Jun 10, 2026
Python

codychampion / llm-eval-workbench

Production-minded LLM eval harness for safety, reliability, cost, and latency analysis.

python evaluation observability model-evaluation ai-safety red-teaming llm-evals eval-harness safety-evals

Updated May 25, 2026
Python

Kleptobyte / AGI-CK3

Prototype adapter for CLI agents to play Crusader Kings III through a constrained, auditable CK3 mod bridge.

prototype ai-agents ck3 crusader-kings-3 eval-harness game-agents

Updated Jun 10, 2026
Python

qte77 / doc-pipeline-engine

Document processing pipeline engine — adapters, contracts, domain packs, eval harnesses

pipeline document-processing rag document-ai contract-first llm eval-harness

Updated Jun 11, 2026
Python

2830500285 / omni-agent

Verification-native local coding agent runtime with eval gates, memory, subagents, and model profiles.

Updated May 16, 2026
TypeScript

OZ-50 / ozm-codex-agent-governance-skills

Codex-native OZM skill pack for AI coding agent governance, agentic coding loops, claim ceilings, and AGENTS.md-aware workflows.

Updated May 30, 2026
Python

neverSettles / opencua_hackathon

Can Computer-Use Agents manage buy-side procurement operations? A benchmark across live e-commerce, with multi-model adapters (Northstar, OpenAI, Claude, Gemini), Kernel-hosted browsers, and Harbor/ATIF-v1.6 trajectory export.

benchmark kernel procurement cua lightcone computer-use-agent eval-harness tzafon-northstar openai-computer-use

Updated May 10, 2026
Python

shashidharReddy866 / llm-evaluation-system

Production-style LLM evaluation harness for structured clinical extraction — compares prompt strategies across accuracy, cost, and hallucination.

nlp json-schema nextjs model-evaluation hono structured-output few-shot-learning ai-evaluation prompt-engineering anthropic llm-evaluation hallucination-detection llm-reliability eval-harness prompt-comparison

Updated May 1, 2026
TypeScript

sarteta / whatsapp-rag-eval-kit

YAML-driven evaluation harness for WhatsApp RAG bots

python yaml ai twilio chatbot evaluation whatsapp observability rag llm eval-harness

Updated Apr 29, 2026
Python

KarmaEnchanter / mental-health-llm-eval

Open evaluation harness for mental health LLM responses. 5 clinically-grounded rubrics, LLM-as-judge with bias controls, crisis-detection routing to 988 protocols.

psychology cbt ai-safety conversational-ai clinical-ai cohen-kappa ollama llm-evaluation llm-as-judge mental-health-ai ai-eval inter-rater-reliability eval-harness lifeline-988 open-source-eval

Updated May 29, 2026
Python

Judysonnen / patchwise

Tolerant apply_patch for LLM-generated diffs, plus an eval harness for code agents.

code-generation llm-agents eval-harness apply-patch

Updated May 6, 2026
Python

adv-lens

rscolling / adv-lens

Form ADV Part 2A intelligence + peer benchmarking — LangGraph, hybrid retrieval, eval harness

compliance claude wealth-management rag pydantic fastapi langfuse langgraph eval-harness

Updated Apr 28, 2026
Python

hermes-labs-ai / agent-convergence-scorer

agent-convergence-scorer is a CLI and Python library that scores how lexically similar N agent or LLM outputs are: exact-match rate, Jaccard token overlap, divergence point, and a composite convergence score over any list of runs. An eval primitive for measuring reproducibility and fan-out collapse. Lexical, not semantic. Zero deps.

cli benchmark consistency evaluation similarity multi-agent convergence reproducibility agents jaccard divergence llm llm-evaluation ai-reliability eval-harness agent-eval

Updated Jun 7, 2026
Python

da-troll / rag-sherpa-template

Slack Q&A bot template — ingests Slack threads + markdown docs into Pinecone, then an n8n workflow retrieves, re-ranks on metadata, and posts answers back in-thread. Includes contextual retrieval and an eval harness.

starter-template openai slack-bot knowledge-base pinecone rag n8n vector-database llm retrieval-augmented-generation llamaparse contextual-retrieval chatbot-template eval-harness

Updated Jun 9, 2026
Python

stephenpadgett1 / video-understanding-eval-harness

Side-by-side eval harness for video-understanding models — retrieval, reasoning, and structured extraction — with an LLM-as-judge and cost-aware scoring. A Solutions-Architect reference scaffold.

pegasus clip video-understanding claude video-ai anthropic llm-as-judge twelve-labs eval-harness marengo

Updated May 30, 2026
Python

WillLewis / agent-harness-environment

An eval and observability cockpit for coding agents. It runs policy-controlled coding agents in sandboxed toy repos, tool-use traces, MCP tools, compares harness policies, scores recovery and safety behavior with Python evals, and supports CI-gated verification and Braintrust/Weave export.

pytest cursor ai-agents weights-and-biases trace-analysis braintrust mcp-tools agent-evaluation eval-harness

Updated Jun 9, 2026
Python

Improve this page

Add a description, image, and links to the eval-harness topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the eval-harness topic, visit your repo's landing page and select "manage topics."