ai-benchmarking

Here are 15 public repositories matching this topic...

Cre4T3Tiv3 / ai-agents-reality-check

Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistical validation.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-workflow agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Apr 2, 2026
Python

HiThink-Research / FinMTM

Star

[ACL 2026] FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

finance benchmark financial-analysis ai-evaluation ai-benchmarking financial-llm

Updated Apr 8, 2026
Python

greynewell / matchspec

Sponsor

Star

Eval framework. Define correct, test against it, get results.

Updated Feb 17, 2026
Go

jabberjabberjabber / Context-Tester

Star

LLM Benchmark

tokenizer llm koboldcpp ai-benchmarking

Updated Dec 17, 2025
Python

jonradoff / hiddenbench

Star

HiddenBench: Benchmark for evaluating collective reasoning in multi-agent LLM systems

benchmarking ai ai-agents ai-benchmarking

Updated Feb 2, 2026
Python

AndresCotton / agentic-eval-multi-participant-coordination

Star

AI agent evaluation framework for multi-participant coordination tasks. Built with LangGraph, custom MCP tools, and LLM-as-a-Judge evaluation. MSc dissertation project (University of Edinburgh, 2025).

nlp mcp multi-agent-systems ai-agents langchain llm-evaluation langgraph model-context-protocol agent-orchestration ai-benchmarking

Updated Oct 27, 2025
Python

xerk-dot / medical-coding-ai

Star

A comprehensive benchmarking platform for CPT, ICD-10, and HCPCS coding questions. Identifies the most reliable models for healthcare applications. Evaluates multiple AI models on medical coding expertise through iterative consensus-building.

healthcare icd-10 multi-agent-systems medical-ai medical-coding consensus-algorithms openrouter cpt-codes ai-benchmarking

Updated Jul 17, 2025
Python

MONSTER4REX / ARTEMIS

Star

🛡️ Artemis: A high-fidelity OpenEnv benchmark for evaluating AI agents in realistic SOC triage and cybersecurity incident response scenarios.

python cybersecurity threat-detection fastapi soc-automation agentic-ai ai-benchmarking openenv

Updated Apr 10, 2026
Python

ImBIOS / ide-ai-benchmark

Sponsor

Star

Comprehensive multi-IDE AI model benchmarking framework supporting Cursor, Windsurf, VSCode, and other IDEs with automated testing and performance comparison capabilities

testing performance vscode openai cursor copilot claude windsurf ai-benchmarking ide-automation

Updated Jul 14, 2025
Python

MohamedEmad219 / ai-agents-reality-check

Star

🤖 Benchmark AI agent capabilities, bridging the gap between hype and reality with clear metrics and insights for informed development decisions.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Apr 22, 2026
Python

ameerkhan9394 / ide-ai-benchmark

Star

🚀 Evaluate and compare AI models across multiple IDEs with a comprehensive benchmarking framework for accurate performance insights.

testing performance openai cursor copilot claude windsurf ai-benchmarking ide-automation

Updated Apr 22, 2026
Python

ssarimm / AutoCrew-Meta-Agent-Research

Star

🔬 Research Project: An automated framework to generate, configure, and evaluate multi-agent AI crews for financial modeling using a Meta-Agent pipeline. This study evaluates the performance of dynamically synthesized MAS (Multi-Agent Systems) against manual expert-defined benchmarks in financial risk contexts.

research multi-agent-systems automated-machine-learning fast-nuces financial-ml ai-benchmarking fast-nuces-khi nlp-to-code agentic-ai-research meta-agent-orchestration

Updated Apr 13, 2026
Python

AiBenchLab / aibenchlab-verify

Star

Standalone open-source verifier for MBX v2 — AiBenchLab's tamper-evident benchmark export format. Three dependencies, zero network access, reproduces the SHA-256 content hash to confirm an .mbx.json file hasn't been altered since export.

rust benchmark sha256 cli-tool mbx tamper-evidence integrity-verification llm-evaluation ai-benchmarking

Updated Apr 16, 2026
Rust

aiexponenthq / rag-benchmarking

Star

Systematic RAG evaluation framework for accuracy requirements under EU AI Act Article 15

python open-source benchmarking evaluation rag responsible-ai ai-governance enterprise-ai llm-evaluation rag-evaluation eu-ai-act ai-benchmarking article-15

Updated Apr 20, 2026
Python

JayneeR / ai-expert-benchmark-suite

Star

High-fidelity LLM evaluation framework focusing on expert-level domain knowledge and reasoning verification.

research pytorch fastapi llm-evaluation ai-benchmarking

Updated Mar 15, 2026
Python

Improve this page

Add a description, image, and links to the ai-benchmarking topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-benchmarking topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-benchmarking

Here are 15 public repositories matching this topic...

Cre4T3Tiv3 / ai-agents-reality-check

HiThink-Research / FinMTM

greynewell / matchspec

jabberjabberjabber / Context-Tester

jonradoff / hiddenbench

AndresCotton / agentic-eval-multi-participant-coordination

xerk-dot / medical-coding-ai

MONSTER4REX / ARTEMIS

ImBIOS / ide-ai-benchmark

MohamedEmad219 / ai-agents-reality-check

ameerkhan9394 / ide-ai-benchmark

ssarimm / AutoCrew-Meta-Agent-Research

AiBenchLab / aibenchlab-verify

aiexponenthq / rag-benchmarking

JayneeR / ai-expert-benchmark-suite

Improve this page

Add this topic to your repo