Agents that run the same in evaluation and in production — closing the gap between the two.
The eval ↔ prod gap · Agents · Parity · Adapt a new agent · Contributing
Most teams build an agent for production (a CLI, a TUI, an app) and then re-implement or approximate it to evaluate it — a different harness, a different scaffold, different tool plumbing. So the benchmark measures something other than what ships, and the numbers don't transfer.
This repo's premise is the opposite: one agent, used both ways. Every agent here runs in production (an interactive TUI, the Vercel AI SDK, a coding CLI) and as an evaluation harness on BenchFlow over ACP — with no reimplementation. What you benchmark is what you ship. That's the gap we're closing.
And not just coding agents: mini-swe is a coding harness, but the Vercel AI SDK agents are general, tool-using agent frameworks (build any agent), and BenchFlow evaluations span well beyond code. The repo is a home for agents of any kind that you want to both ship and benchmark.
| Family | Agents | Eval on BenchFlow |
|---|---|---|
| mini-swe | mini-swe-agent behind opencode's TUI (mini-swe-code) + an ACP shim (mini-swe-acp) | ✅ stable — faithful SWE-agent harness (>74% SWE-bench verified) |
| ai-sdk | the Vercel AI SDK agent surface — ToolLoopAgent (acp) and HarnessAgent × {pi, codex, claude-code} |
mixed — acp ✅ (parity byte-verified), harness-pi ✅ (file tasks), codex/claude-code 🧪 (need a Vercel sandbox). Per-agent maturity in ai-sdk/README. |
| omnigent | Databricks Omnigent pi meta-harness — the first non-ACP agent here: rides BenchFlow's Session path via a session_factory, shelling omnigent run inside the sandbox |
✅ reward 1.0 on hello-world and the real citation-check research task (DeepSeek/Daytona x86_64). Needs a BenchFlow with the session-factory seam — see omnigent/README. |
Each agent is a self-contained package: a production runtime + a thin adapter
(ACP, or BenchFlow's non-ACP Session path for omnigent) registered via the
public register_agent extension point.
The point isn't just "it runs in both places" — it's that it behaves the same
in both. For ai-sdk/acp this is verified at the wire level: driven inside
BenchFlow vs. standalone, the upstream model request is byte-identical (same
system+user prompt, tools, sampling params) apart from neutral gateway artifacts;
tool-use, file output, reward, and finish reason match. BenchFlow provides the
environment and captures the trajectory — it does not perturb the agent. The
adaptation-parity skill automates this check;
methodology in docs/parity.md.
Honesty matters more than a green checkmark. Toy tasks (a single file write) pass easily and prove little; real eval workloads — input files, real toolchains (
pytest, network), skills — expose the gaps. No agent here is "verified" beyond its quickstart; e.g.harness-pipasses hello-world but not the real SkillsBenchcitation-check. We need more tasks, of more variants, run end-to-end. Each package README states plainly what it has and hasn't been run against.
- Adapt — write an ACP server +
register.py(docs/adaptation.md). Scaffold fromai-sdk/acp:python skills/adaptation-parity/scripts/scaffold_ai_sdk_agent.py <name>. - Verify parity — inside vs. standalone, with the skill's
acp_capture.mjs+parity_diff.py(docs/parity.md).
Drive an agent interactively (mini-swe-code):
cd mini-swe-code
uv venv .venv && source .venv/bin/activate
uv pip install -e ".[opencode]"
export ANTHROPIC_API_KEY="<your-key>" # or OPENAI_API_KEY / GEMINI_API_KEY / ...
mkdir -p /tmp/mini-swe-scratch
mini-opencode --attach --cwd /tmp/mini-swe-scratchNote
The agent executes commands locally without confirmation inside --cwd —
point it at a scratch directory.
Benchmark an agent (mini-swe-acp):
pip install "mini-swe-acp @ git+https://github.com/benchflow-ai/agents#subdirectory=mini-swe-acp"import mini_swe_acp # registers "mini-swe" (aliases: mini, minisweagent, mini-swe-agent)
from benchflow import SDK
await SDK().run(task_path="...", agent="mini-swe", model="openai/gpt-4o-mini")Per-agent setup, design notes, and caveats live in each package's README: mini-swe-code · mini-swe-acp · ai-sdk/*.
mini-swe-code/ mini-swe-agent distribution + opencode TUI (CLIs: mini, mini-opencode)
mini-swe-acp/ mini-swe-agent as a BenchFlow ACP agent
ai-sdk/ Vercel AI SDK agents: acp, harness-pi, harness-codex, harness-claude-code
omnigent/ Databricks Omnigent pi meta-harness as a non-ACP (session-factory) BenchFlow agent
skills/ adaptation-parity skill — adapt an agent + verify eval/prod parity
docs/ adaptation.md, parity.md
.github/ CI: per-family tests (path-filtered), ruff lint, markdown link check
Each package builds, tests, and ships independently; add a new agent as a new
package + a per-package CI workflow (or extend the ai-sdk matrix).
Issues and PRs welcome — see CONTRIBUTING.md. High-value now:
- More benchmark tasks, of more variants — input-file / real-toolchain
(
pytest/build) / skill-based — run end-to-end against each agent to find and close eval↔prod behavior gaps. - New agent integrations (any production agent + a thin ACP adapter; scaffold + verify with the adaptation-parity skill).
- Parity reports: same agent, inside-BenchFlow vs. standalone, audited.
| Path | License |
|---|---|
repository root, mini-swe-acp/, ai-sdk/, omnigent/, skills/ |
Apache-2.0 |
mini-swe-code/ |
MIT (upstream mini-swe-agent license, kept verbatim) |
- mini-swe-agent and the SWE-bench / SWE-agent team. If useful in research, cite their SWE-agent paper.
- Vercel AI SDK — the toolkit behind the
ai-sdkagents. - opencode — the TUI that makes the agent a pleasure to drive.
- Agent Client Protocol — the editor/agent-agnostic protocol the eval shims speak.