You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistical validation.
AI agent evaluation framework for multi-participant coordination tasks. Built with LangGraph, custom MCP tools, and LLM-as-a-Judge evaluation. MSc dissertation project (University of Edinburgh, 2025).
A comprehensive benchmarking platform for CPT, ICD-10, and HCPCS coding questions. Identifies the most reliable models for healthcare applications. Evaluates multiple AI models on medical coding expertise through iterative consensus-building.
Comprehensive multi-IDE AI model benchmarking framework supporting Cursor, Windsurf, VSCode, and other IDEs with automated testing and performance comparison capabilities
🔬 Research Project: An automated framework to generate, configure, and evaluate multi-agent AI crews for financial modeling using a Meta-Agent pipeline. This study evaluates the performance of dynamically synthesized MAS (Multi-Agent Systems) against manual expert-defined benchmarks in financial risk contexts.
Standalone open-source verifier for MBX v2 — AiBenchLab's tamper-evident benchmark export format. Three dependencies, zero network access, reproduces the SHA-256 content hash to confirm an .mbx.json file hasn't been altered since export.