BenchFlow Agents

Agents that run the same in evaluation and in production — closing the gap between the two.

The eval ↔ prod gap · Agents · Parity · Adapt a new agent · Contributing

The eval ↔ prod gap

Most teams build an agent for production (a CLI, a TUI, an app) and then re-implement or approximate it to evaluate it — a different harness, a different scaffold, different tool plumbing. So the benchmark measures something other than what ships, and the numbers don't transfer.

This repo's premise is the opposite: one agent, used both ways. Every agent here runs in production (an interactive TUI, the Vercel AI SDK, a coding CLI) and as an evaluation harness on BenchFlow over ACP — with no reimplementation. What you benchmark is what you ship. That's the gap we're closing.

And not just coding agents: mini-swe is a coding harness, but the Vercel AI SDK agents are general, tool-using agent frameworks (build any agent), and BenchFlow evaluations span well beyond code. The repo is a home for agents of any kind that you want to both ship and benchmark.

Agents

Family	Agents	Eval on BenchFlow
mini-swe	mini-swe-agent behind opencode's TUI (mini-swe-code) + an ACP shim (mini-swe-acp)	✅ stable — faithful SWE-agent harness (>74% SWE-bench verified)
ai-sdk	the Vercel AI SDK agent surface — `ToolLoopAgent` (acp) and `HarnessAgent` × {pi, codex, claude-code}	mixed — `acp` ✅ (parity byte-verified), `harness-pi` ✅ (file tasks), `codex`/`claude-code` 🧪 (need a Vercel sandbox). Per-agent maturity in ai-sdk/README.
omnigent	Databricks Omnigent `pi` meta-harness — the first non-ACP agent here: rides BenchFlow's Session path via a `session_factory`, shelling `omnigent run` inside the sandbox	✅ reward 1.0 on hello-world and the real `citation-check` research task (DeepSeek/Daytona x86_64). Needs a BenchFlow with the session-factory seam — see omnigent/README.

Each agent is a self-contained package: a production runtime + a thin adapter (ACP, or BenchFlow's non-ACP Session path for omnigent) registered via the public register_agent extension point.

Parity: the same agent in both

The point isn't just "it runs in both places" — it's that it behaves the same in both. For ai-sdk/acp this is verified at the wire level: driven inside BenchFlow vs. standalone, the upstream model request is byte-identical (same system+user prompt, tools, sampling params) apart from neutral gateway artifacts; tool-use, file output, reward, and finish reason match. BenchFlow provides the environment and captures the trajectory — it does not perturb the agent. The adaptation-parity skill automates this check; methodology in docs/parity.md.

Honesty matters more than a green checkmark. Toy tasks (a single file write) pass easily and prove little; real eval workloads — input files, real toolchains (pytest, network), skills — expose the gaps. No agent here is "verified" beyond its quickstart; e.g. harness-pi passes hello-world but not the real SkillsBench citation-check. We need more tasks, of more variants, run end-to-end. Each package README states plainly what it has and hasn't been run against.

Adapt & verify a new agent

Adapt — write an ACP server + register.py (docs/adaptation.md). Scaffold from ai-sdk/acp: python skills/adaptation-parity/scripts/scaffold_ai_sdk_agent.py <name>.
Verify parity — inside vs. standalone, with the skill's acp_capture.mjs + parity_diff.py (docs/parity.md).

Quickstarts

Drive an agent interactively (mini-swe-code):

cd mini-swe-code
uv venv .venv && source .venv/bin/activate
uv pip install -e ".[opencode]"
export ANTHROPIC_API_KEY="<your-key>"   # or OPENAI_API_KEY / GEMINI_API_KEY / ...
mkdir -p /tmp/mini-swe-scratch
mini-opencode --attach --cwd /tmp/mini-swe-scratch

Note

The agent executes commands locally without confirmation inside --cwd — point it at a scratch directory.

Benchmark an agent (mini-swe-acp):

pip install "mini-swe-acp @ git+https://github.com/benchflow-ai/agents#subdirectory=mini-swe-acp"

import mini_swe_acp  # registers "mini-swe" (aliases: mini, minisweagent, mini-swe-agent)
from benchflow import SDK
await SDK().run(task_path="...", agent="mini-swe", model="openai/gpt-4o-mini")

Per-agent setup, design notes, and caveats live in each package's README: mini-swe-code · mini-swe-acp · ai-sdk/*.

Repository layout

mini-swe-code/    mini-swe-agent distribution + opencode TUI (CLIs: mini, mini-opencode)
mini-swe-acp/     mini-swe-agent as a BenchFlow ACP agent
ai-sdk/           Vercel AI SDK agents: acp, harness-pi, harness-codex, harness-claude-code
omnigent/         Databricks Omnigent pi meta-harness as a non-ACP (session-factory) BenchFlow agent
skills/           adaptation-parity skill — adapt an agent + verify eval/prod parity
docs/             adaptation.md, parity.md
.github/          CI: per-family tests (path-filtered), ruff lint, markdown link check

Each package builds, tests, and ships independently; add a new agent as a new package + a per-package CI workflow (or extend the ai-sdk matrix).

Contributing

Issues and PRs welcome — see CONTRIBUTING.md. High-value now:

More benchmark tasks, of more variants — input-file / real-toolchain (pytest/build) / skill-based — run end-to-end against each agent to find and close eval↔prod behavior gaps.
New agent integrations (any production agent + a thin ACP adapter; scaffold + verify with the adaptation-parity skill).
Parity reports: same agent, inside-BenchFlow vs. standalone, audited.

License

Path	License
repository root, `mini-swe-acp/`, `ai-sdk/`, `omnigent/`, `skills/`	Apache-2.0
`mini-swe-code/`	MIT (upstream mini-swe-agent license, kept verbatim)

Acknowledgments

mini-swe-agent and the SWE-bench / SWE-agent team. If useful in research, cite their SWE-agent paper.
Vercel AI SDK — the toolkit behind the ai-sdk agents.
opencode — the TUI that makes the agent a pleasure to drive.
Agent Client Protocol — the editor/agent-agnostic protocol the eval shims speak.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BenchFlow Agents

The eval ↔ prod gap

Agents

Parity: the same agent in both

Adapt & verify a new agent

Quickstarts

Repository layout

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github		.github
ai-sdk		ai-sdk
docs		docs
mini-swe-acp		mini-swe-acp
mini-swe-code		mini-swe-code
omnigent		omnigent
skills/adaptation-parity		skills/adaptation-parity
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

BenchFlow Agents

The eval ↔ prod gap

Agents

Parity: the same agent in both

Adapt & verify a new agent

Quickstarts

Repository layout

Contributing

License

Acknowledgments

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages