feat(acp-registry): classify + adapt the ACP registry; wire + verify Qwen Code#5
feat(acp-registry): classify + adapt the ACP registry; wire + verify Qwen Code#5xdotli wants to merge 5 commits into
Conversation
…Qwen Code Add an `acp-registry` package mapping all 36 agents in the public Agent Client Protocol registry onto BenchFlow — which can run as faithful, model-enforced evals, and how. Every registry agent already speaks ACP, so adapting one is a thin registration (install + launch + env_mapping), not a server. - Registry-driven catalog (`catalog.py`, single source of truth) classifying every agent: native (5, already a benchflow built-in), wired (1, qwen-code), catalog (20, BYO-redirectable with exact per-agent recipes), vendor-locked (9, can't route through the gateway), out-of-scope (1). `AGENTS.md` is generated from it; CI fails if they drift. - Qwen Code wired and verified end-to-end on DeepSeek via Daytona: reward 1.0 on hello-world AND on the real skillsbench/citation-check task (33 tool calls) — the same task ai-sdk/harness-pi could not do (just-bash FS). - Verification surfaced a general benchflow gap: ACP agents validate model ids against their own catalogs and reject the benchmark id over ACP (-32603). The fix is a small `acp_model_via_env` flag (proposed; patch in the PR); register.py enables it via feature-detection and warns on builds without it. - `refresh_registry.py` diffs the live registry vs the vendored snapshot. - Docs: root README agent catalog + layout, CONTRIBUTING, docs/adaptation.md; per-package CI workflow (tests + ruff + AGENTS.md-in-sync gate). 50 key-free tests; ruff clean.
Experiment results — Qwen Code, end-to-end on DaytonaDriver: imported the shipped
|
| repository="https://github.com/QwenLM/qwen-code", | ||
| distribution=NPX, | ||
| package="@qwen-code/qwen-code@0.18.0", | ||
| acp_args="--acp", |
There was a problem hiding this comment.
Qwen launch omits skills flag
Medium Severity
The wired qwen-code entry sets acp_args to only --acp, while the vendored registry snapshot lists both --acp and --experimental-skills. register() builds the launch command from that field, so skill-dependent workloads (including the cited skillsbench/citation-check verification) may not run the same way as the official ACP distribution.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit a634c41. Configure here.
…rors from adversarial audit Acting on "make sure all agents work": live-verified BYO agents on DeepSeek/Daytona, wired the ones that genuinely run, and corrected the catalog from an adversarial verification pass. Verified working (live, end-to-end): - qwen-code (npx): reward 1.0 on hello-world AND real skillsbench/citation-check; also via the LiteLLM gateway with usage captured (provider_response). - goose (binary): NEW — Block's agent, per-arch Linux binary downloaded from the registry snapshot, all-env routing (GOOSE_PROVIDER/OPENAI_HOST/OPENAI_BASE_PATH/ OPENAI_API_KEY/GOOSE_MODEL). reward 1.0 hello-world via the shipped package; real citation-check ran clean (reward 0.0 — agent/model didn't solve, not an integration failure). Adds binary-install support to register.py. Probed, not yet wireable (honest findings, recorded in the catalog): - dirac: DOES speak ACP (registry's --acp correct; a README-only audit claim of "no ACP" was wrong), but closes stdout mid-run (pipe_closed). - github-copilot-cli: -32000 Authentication required on 1.0.61 — BYOK not honored in ACP mode. Catalog accuracy fixes (from the adversarial audit; all 4 headline claims held): - kimi license MIT->Apache-2.0; nova proprietary->MIT - kilo binary->npx (@kilocode/cli); vtcode acp_args ""->"acp"; junie ""->"--acp true" - poolside model_via env->flag; fast-agent env (OPENAI__BASE_URL nested, not plain) - dimcode recipe (interactive /connect + sqlite, no provider-add CLI) 53 key-free tests; ruff clean; AGENTS.md regenerated (CI-gated in sync). Tally: wired 2 / catalog 19 / native 5 / vendor-locked 9 / out-of-scope 1.
"Make all agents work" — live verification sweepRan candidate ACP agents through the shipped package (
goose is now a second wired agent. It required adding binary-install Catalog corrections from the adversarial auditThe verification workflow's catalog auditor caught 9 real errors in the What's NOT done (honest)
So: 2 of the BYO tier are wired+verified, 2 more probed with concrete findings, |
| _BIN_PREFIX = "/opt/benchflow/acp" | ||
| _SNAPSHOT = json.loads( | ||
| (Path(__file__).parents[2] / "registry.snapshot.json").read_text() | ||
| ) |
There was a problem hiding this comment.
Snapshot missing from wheel
High Severity
register.py loads registry.snapshot.json at import via Path(__file__).parents[2], but the wheel only ships src/acp_registry and does not bundle that file (unlike ai-sdk’s force-include for server.mjs). A normal pip install from the README leaves the path under site-packages, so import acp_registry fails before register() runs.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit d6a9a8a. Configure here.
| f"{_extract_cmd(x86, dest)}; " | ||
| f"chmod +x {bin_path} 2>/dev/null || true; " | ||
| f"[ -x {bin_path} ]" | ||
| ).replace("$BF_ARCHIVE", "/tmp/bf-acp-archive") |
There was a problem hiding this comment.
Binary extract uses x86 format
Medium Severity
_binary_install always runs _extract_cmd(x86, dest) after downloading the per-arch URL into the same archive path. Extraction is chosen from the x86_64 URL’s extension, not from the archive actually fetched on aarch64, so mixed formats per architecture can make ARM installs fail even when the download succeeds.
Reviewed by Cursor Bugbot for commit d6a9a8a. Configure here.
Ran every BYO-tier agent through the pipeline (spec-extraction -> install -> launch -> task) on DeepSeek/Daytona. Honest outcome: 2 work end-to-end (qwen-code, goose); the other 14 each hit a concrete, confirmed blocker. README "Live verification" is now a full 19-row truth table. Catalog reasons corrected from the deep spec research + live probes: - stakpak: HARD-BLOCK — ACP path model->provider routing defect on v0.3.88 (source-traced); advertises a validated model option (static models.dev catalog). - kimi: HARD-BLOCK — mandatory interactive OAuth, no headless path. - autohand: HARD-BLOCK — not headless-configurable for an arbitrary endpoint. - poolside: HARD-BLOCK — ACP path not headless-wirable to a custom endpoint. - vtcode: installs+launches, pipe_closed at the ACP session (closest binary config-file agent; base_url needs the /v1 suffix) — recorded as known_issue. The remaining⚠️ agents (vtcode/kilo/codebuddy/deepagents/crow-cli/mistral-vibe/ minion-code/junie/dirac) install + launch but fail at the ACP session — each a bounded per-agent fix now that the pipeline is built. ❌ agents (cline/dimcode/ github-copilot-cli/grok-build) hit auth/interactive/gated-install blockers. 53 tests; ruff clean; AGENTS.md in sync. Tally unchanged (wired 2).
Full BYO-tier sweep — every agent run through the pipelineBuilt a reusable pipeline (parallel spec-extraction → probe-gen →
Honest takeaways
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
There are 4 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8639dfc. Configure here.
| f'mkdir -p {dest}; curl -fsSL "$U" -o "$BF_ARCHIVE"; ' | ||
| f"{_extract_cmd(x86, dest)}; " | ||
| f"chmod +x {bin_path} 2>/dev/null || true; " | ||
| f"[ -x {bin_path} ]" |
There was a problem hiding this comment.
Raw binary wrong install path
Medium Severity
For extensionless registry archives, _extract_cmd copies the download to dest/ as the temp filename, while the install script verifies [ -x {dest}/{bin_name}]. When bin_name does not match the downloaded artifact name (e.g. versioned raw Linux binaries), install fails after curl succeeds.
Reviewed by Cursor Bugbot for commit 8639dfc. Configure here.
…t fix fan-out Fanned out one subagent per⚠️ agent to root-cause its ACP-session failure from the rollout artifacts + upstream source and return a corrected wiring spec, then re-probed live on DeepSeek/Daytona. That turned deepagents green. - deepagents (npx) NOW wired+verified: reward 1.0 hello-world via the shipped package; real citation-check ran ~10 tool calls then an in-run -32603. Fix: its provider SDK (@langchain/openai) must be installed alongside (new `npm_extra` spec field) and the model passed as `--model openai:<model>` (LangChain colon form), with base URL/key via OPENAI_* env_mapping. - register.py: support `npm_extra` for npx agents; broaden the `acp_model_via_env` enablement to ANY wired agent without ACP set_model (model is env/flag/config owned) — not just model_via=="env". - README truth table updated: 3 verified (qwen-code, goose, deepagents). The other fix-spec re-probes shifted failure modes but still fail at install/launch/runtime (rc=127 / rc=1); cline stays blocked (needs a separate `cline auth` step). Each remaining⚠️ is a bounded per-agent task; ❌ are real blockers. 56 tests; ruff clean; AGENTS.md in sync. Tally: wired 3 / catalog 18 / native 5 / vendor-locked 9 / out-of-scope 1.
Per-agent fix fan-out — deepagents joins the verified set (now 3)Fanned out one subagent per
Verified end-to-end: 3 — |
A second per-agent fix fan-out (subagents root-causing from rollout artifacts + upstream source) plus live re-probes. Net: pushed several⚠️ agents much closer, but only confirmed package-green ones stay wired. - kilo: PROBE-green (reward 1.0 hello-world) when its kilo.jsonc is written at INSTALL time, but the package's generic LAUNCH-time config-write regressed (pipe_closed) — kept as catalog with the precise finding (benchflow's launch context doesn't run the python config-write reliably; kilo's model id can't ride its {env:} substitution since it's a config key). Removed the unproven launch-time config-write infra rather than ship a mechanism no wired agent uses. - Round-2 root-causes recorded in the truth table: vtcode's rc=127 was the dynamic loader failing to find the bundled libghostty-vt.so (fixed via ldconfig) + a missing key mapping (now rc=255 at provider init); cline's -32000 cleared by a `cline auth` prelude (now runs, timed out); junie needs a JVM; crow-cli is really py3.14+uvx+crow-mcp; etc. Verified package-green stays 3: qwen-code, goose, deepagents. 56 tests; ruff clean; AGENTS.md in sync. Tally: wired 3 / catalog 18 / native 5 / vendor-locked 9 / oos 1.


What
Adds an
acp-registrypackage that maps all 36 agents in the publicAgent Client Protocol registry
onto BenchFlow — answering, for each, can it run as a faithful, model-enforced
eval, and how?
Every registry agent already speaks ACP over stdio, so adapting one isn't writing
a server (unlike
ai-sdk/) — it's a thin registration: install, launch in ACPmode, and route the model through BenchFlow's gateway via
env_mapping. The hardpart is that whether routing is even possible splits the registry, and most
coding-vendor CLIs are locked to their own backend.
The catalog (single source of truth:
catalog.py→ generatedAGENTS.md)claude-agent-acp,codex-acp,gemini,opencode,pi-acp) — not shadowedRouting facts (BYO-vs-locked, env vars, api_protocol) were researched per-agent
from upstream docs/source on 2026-06-13; each entry cites its source.
Qwen Code — wired and verified end-to-end
Run on DeepSeek (
deepseek-v4-flash) via Daytona, importing the shippedacp_registry.register():hello-world(toy sanity)skillsbench/citation-check(real)citation-checkships an input file (/root/test.bib) + a skill — the samereal task
ai-sdk/harness-picould not do (its just-bash sandbox hides taskfiles). qwen-code runs in a real Daytona sandbox, sees the file, and solves it.
(Full trajectory in a comment below.)
The BenchFlow gap this surfaced (and the proposed fix)
Verifying qwen-code exposed a general issue, not a qwen quirk:
The model is already delivered out-of-band via
OPENAI_MODEL, so the fix is tonot drive it over ACP. That's a small, general BenchFlow change — an
acp_model_via_envregistry flag that skips ACP model configuration entirely:This package enables the flag via feature-detection: on a BenchFlow build
that has it, qwen-code runs (verified above); on a build without it,
register()logs a clear warning and qwen-code fails at model configuration. The
pyprojectdep note and
AGENTS.mdstate this plainly. (Patch verified locally; happy toopen it as a separate BenchFlow PR.)
Also
scripts/refresh_registry.pydiffs the live registry vs the vendored snapshot(flags added/removed/bumped agents to reclassify) — never auto-edits classifications.
docs/adaptation.mdupdated.AGENTS.md-in-sync gate.Honesty bar
wired= registers + routes by construction;verified= actually run, on theexact tasks named (qwen-code only). The 📋 catalog recipes come from upstream
docs/source, not from a run — a tested-on-paper starting point.
A structural-quality (thermo-nuclear) review was run on the package; its
should-fix items (a latent shell-prefix bug, a fragile badge re-parse, the
dependency-floor honesty gap) are all addressed in this PR.
Note
Medium Risk
Large new surface area (binary downloads, shell install/launch, BenchFlow registry integration) and a dependency on proposed
acp_model_via_envfor qwen-code; tests are mostly key-free contract checks rather than live sandbox runs.Overview
Adds the
acp-registrypackage: a vendored ACP registry snapshot, acatalog.pyclassification of all 36 agents (native / wired / catalog / vendor-locked / out-of-scope), generatedAGENTS.md, andregister()that installs and registers BenchFlow agents viaenv_mapping(npx or per-arch binary) without a custom ACP server.Three agents are wired:
qwen-code,goose, anddeepagents, with install/launch commands and gateway routing profiles.register()setsacp_model_via_envwhen BenchFlow supports it so agents that own the model via env/flags are not configured over ACP (avoids-32603when agents validate ACPmodeloptions). Maintainer scriptsrefresh_registry.pyandgen_agents_md.pyplus CI (pytest, ruff, AGENTS.md drift check) ship with the package.Root README, CONTRIBUTING, and
docs/adaptation.mddocument the new family and ACP-native adaptation patterns.Reviewed by Cursor Bugbot for commit c32ca6a. Bugbot is set up for automated code reviews on this repo. Configure here.