feat(acp-registry): classify + adapt the ACP registry; wire + verify Qwen Code by xdotli · Pull Request #5 · benchflow-ai/agents

xdotli · 2026-06-13T16:53:03Z

What

Adds an acp-registry package that maps all 36 agents in the public
Agent Client Protocol registry
onto BenchFlow — answering, for each, can it run as a faithful, model-enforced
eval, and how?

Every registry agent already speaks ACP over stdio, so adapting one isn't writing
a server (unlike ai-sdk/) — it's a thin registration: install, launch in ACP
mode, and route the model through BenchFlow's gateway via env_mapping. The hard
part is that whether routing is even possible splits the registry, and most
coding-vendor CLIs are locked to their own backend.

The catalog (single source of truth: `catalog.py` → generated `AGENTS.md`)

Tier	Count	Meaning
🟦 native	5	already a BenchFlow built-in (`claude-agent-acp`, `codex-acp`, `gemini`, `opencode`, `pi-acp`) — not shadowed
✅ wired	1	registered here; routes correctly by construction — qwen-code
📋 catalog	20	BYO-redirectable; held back by a config-file writer / binary installer / uvx bootstrap / model-id format — each with the exact recipe
🔒 vendor-locked	9	authenticate only to their vendor backend; can't enforce the benchmark model
➖ out-of-scope	1	not an LLM coding/eval agent (an agent marketplace)

Routing facts (BYO-vs-locked, env vars, api_protocol) were researched per-agent
from upstream docs/source on 2026-06-13; each entry cites its source.

Qwen Code — wired and verified end-to-end

Run on DeepSeek (deepseek-v4-flash) via Daytona, importing the shipped
acp_registry.register():

Task	Result
`hello-world` (toy sanity)	✅ reward 1.0 — 1 tool call, file written
`skillsbench/citation-check` (real)	✅ reward 1.0 — 33 tool calls, 68 steps, no errors

citation-check ships an input file (/root/test.bib) + a skill — the same
real task ai-sdk/harness-pi could not do (its just-bash sandbox hides task
files). qwen-code runs in a real Daytona sandbox, sees the file, and solves it.
(Full trajectory in a comment below.)

The BenchFlow gap this surfaced (and the proposed fix)

Verifying qwen-code exposed a general issue, not a qwen quirk:

ACP agents commonly advertise a model session config option and validate
its value against their own model list. BenchFlow's capability-first dispatch
tries to set the benchmark's model id through it — which the agent rejects
with ACP -32603 (seen with both a gateway alias and a bare deepseek-v4-flash).
The agent never runs.

The model is already delivered out-of-band via OPENAI_MODEL, so the fix is to
not drive it over ACP. That's a small, general BenchFlow change — an
acp_model_via_env registry flag that skips ACP model configuration entirely:

# benchflow/agents/registry.py — AgentConfig
+    acp_model_via_env: bool = False  # model delivered via env_mapping, never over ACP

# benchflow/acp/runtime.py — _configure_acp_session
-    if model and _model_selection_owned_by_env(agent, model, agent_env):
+    if model and getattr(agent_cfg, "acp_model_via_env", False):
+        logger.info(f"Skipping ACP model configuration for {agent} — acp_model_via_env")
+    elif model and _model_selection_owned_by_env(agent, model, agent_env):
         logger.info(f"Skipping ACP model configuration for {agent} — launch/env config owns model selection")

This package enables the flag via feature-detection: on a BenchFlow build
that has it, qwen-code runs (verified above); on a build without it, register()
logs a clear warning and qwen-code fails at model configuration. The pyproject
dep note and AGENTS.md state this plainly. (Patch verified locally; happy to
open it as a separate BenchFlow PR.)

Also

scripts/refresh_registry.py diffs the live registry vs the vendored snapshot
(flags added/removed/bumped agents to reclassify) — never auto-edits classifications.
Root README agent catalog + layout, CONTRIBUTING, docs/adaptation.md updated.
Per-package CI workflow: 50 key-free tests + ruff + an AGENTS.md-in-sync gate.

Honesty bar

wired = registers + routes by construction; verified = actually run, on the
exact tasks named (qwen-code only). The 📋 catalog recipes come from upstream
docs/source, not from a run — a tested-on-paper starting point.

A structural-quality (thermo-nuclear) review was run on the package; its
should-fix items (a latent shell-prefix bug, a fragile badge re-parse, the
dependency-floor honesty gap) are all addressed in this PR.

Note

Medium Risk
Large new surface area (binary downloads, shell install/launch, BenchFlow registry integration) and a dependency on proposed acp_model_via_env for qwen-code; tests are mostly key-free contract checks rather than live sandbox runs.

Overview
Adds the acp-registry package: a vendored ACP registry snapshot, a catalog.py classification of all 36 agents (native / wired / catalog / vendor-locked / out-of-scope), generated AGENTS.md, and register() that installs and registers BenchFlow agents via env_mapping (npx or per-arch binary) without a custom ACP server.

Three agents are wired: qwen-code, goose, and deepagents, with install/launch commands and gateway routing profiles. register() sets acp_model_via_env when BenchFlow supports it so agents that own the model via env/flags are not configured over ACP (avoids -32603 when agents validate ACP model options). Maintainer scripts refresh_registry.py and gen_agents_md.py plus CI (pytest, ruff, AGENTS.md drift check) ship with the package.

Root README, CONTRIBUTING, and docs/adaptation.md document the new family and ACP-native adaptation patterns.

^{Reviewed by Cursor Bugbot for commit c32ca6a. Bugbot is set up for automated code reviews on this repo. Configure here.}

…Qwen Code Add an `acp-registry` package mapping all 36 agents in the public Agent Client Protocol registry onto BenchFlow — which can run as faithful, model-enforced evals, and how. Every registry agent already speaks ACP, so adapting one is a thin registration (install + launch + env_mapping), not a server. - Registry-driven catalog (`catalog.py`, single source of truth) classifying every agent: native (5, already a benchflow built-in), wired (1, qwen-code), catalog (20, BYO-redirectable with exact per-agent recipes), vendor-locked (9, can't route through the gateway), out-of-scope (1). `AGENTS.md` is generated from it; CI fails if they drift. - Qwen Code wired and verified end-to-end on DeepSeek via Daytona: reward 1.0 on hello-world AND on the real skillsbench/citation-check task (33 tool calls) — the same task ai-sdk/harness-pi could not do (just-bash FS). - Verification surfaced a general benchflow gap: ACP agents validate model ids against their own catalogs and reject the benchmark id over ACP (-32603). The fix is a small `acp_model_via_env` flag (proposed; patch in the PR); register.py enables it via feature-detection and warns on builds without it. - `refresh_registry.py` diffs the live registry vs the vendored snapshot. - Docs: root README agent catalog + layout, CONTRIBUTING, docs/adaptation.md; per-package CI workflow (tests + ruff + AGENTS.md-in-sync gate). 50 key-free tests; ruff clean.

xdotli · 2026-06-13T16:53:49Z

Experiment results — Qwen Code, end-to-end on Daytona

Driver: imported the shipped acp_registry.register() (not a hand-copy) →
benchflow.runtime.run(Agent("qwen-code", "deepseek/deepseek-v4-flash"), …).
Provider: DeepSeek (OpenAI-compatible), direct (usage_tracking="off").
Install in-sandbox: Node 22.14 bootstrap → npm i -g @qwen-code/qwen-code@0.18.0
→ launch qwen --acp. All via the package's own _install_cmd/_launch_cmd.

`skillsbench/citation-check` (the real one) — ✅ reward 1.0

reward:          {'reward': 1.0}
model:           deepseek/deepseek-v4-flash    agent: qwen-code
tool calls:      33  (19 completed, 14 failed-and-recovered)
steps:           68  (user_message 1, agent_thought 17, agent_message 17, tool_call 33)
error:           None
timing (s):      agent_setup 3.5 · agent_execution 150.0 · verifier 23.0 · total 198.2

First tool call in the trajectory: ReadFile: test.bib — i.e. the agent
sees /root/test.bib, the task's input file. This is the exact task
ai-sdk/harness-pi failed (its just-bash sandbox hides task files); qwen-code
runs in a real Daytona sandbox, reads the file, verifies the citations across 33
tool calls (including failed-then-retried web lookups), writes /root/answer.json,
and the verifier passes.

`hello-world` (toy sanity) — ✅ reward 1.0

reward: {'reward': 1.0}   tool calls: 1   steps: 6   error: None

The one caveat (already in the PR body)

Both runs required BenchFlow's proposed acp_model_via_env flag. Without it,
qwen-code fails at session setup:

ACP session/set_config_option failed for agent=qwen-code config=model
  value=benchflow-deepseek-deepseek-v4-flash: ACP error -32603: Internal error   # gateway alias
ACP session/set_config_option failed for agent=qwen-code config=model
  value=deepseek-v4-flash: ACP error -32603: Internal error                       # bare model id

qwen-code advertises a model config option and validates it against its own
model list, so BenchFlow's capability-first dispatch can't set a DeepSeek id over
ACP. With acp_model_via_env (model delivered via OPENAI_MODEL instead), both
tasks pass as shown. register() enables the flag by feature-detection and warns
on builds without it.

Honesty note: usage_source is unavailable here because I ran direct (usage
tracking off) to isolate the routing question from the gateway. Gateway +
usage-capture should work the same way it does for ai-sdk/acp (the alias now
rides OPENAI_MODEL rather than ACP), but I have not re-verified that path,
so I'm not claiming it.

cursor · 2026-06-13T16:55:25Z

+        repository="https://github.com/QwenLM/qwen-code",
+        distribution=NPX,
+        package="@qwen-code/qwen-code@0.18.0",
+        acp_args="--acp",


Qwen launch omits skills flag

Medium Severity

The wired qwen-code entry sets acp_args to only --acp, while the vendored registry snapshot lists both --acp and --experimental-skills. register() builds the launch command from that field, so skill-dependent workloads (including the cited skillsbench/citation-check verification) may not run the same way as the official ACP distribution.

Additional Locations (1)

acp-registry/src/acp_registry/register.py#L44-L45

^{Reviewed by Cursor Bugbot for commit a634c41. Configure here.}

…rors from adversarial audit Acting on "make sure all agents work": live-verified BYO agents on DeepSeek/Daytona, wired the ones that genuinely run, and corrected the catalog from an adversarial verification pass. Verified working (live, end-to-end): - qwen-code (npx): reward 1.0 on hello-world AND real skillsbench/citation-check; also via the LiteLLM gateway with usage captured (provider_response). - goose (binary): NEW — Block's agent, per-arch Linux binary downloaded from the registry snapshot, all-env routing (GOOSE_PROVIDER/OPENAI_HOST/OPENAI_BASE_PATH/ OPENAI_API_KEY/GOOSE_MODEL). reward 1.0 hello-world via the shipped package; real citation-check ran clean (reward 0.0 — agent/model didn't solve, not an integration failure). Adds binary-install support to register.py. Probed, not yet wireable (honest findings, recorded in the catalog): - dirac: DOES speak ACP (registry's --acp correct; a README-only audit claim of "no ACP" was wrong), but closes stdout mid-run (pipe_closed). - github-copilot-cli: -32000 Authentication required on 1.0.61 — BYOK not honored in ACP mode. Catalog accuracy fixes (from the adversarial audit; all 4 headline claims held): - kimi license MIT->Apache-2.0; nova proprietary->MIT - kilo binary->npx (@kilocode/cli); vtcode acp_args ""->"acp"; junie ""->"--acp true" - poolside model_via env->flag; fast-agent env (OPENAI__BASE_URL nested, not plain) - dimcode recipe (interactive /connect + sqlite, no provider-add CLI) 53 key-free tests; ruff clean; AGENTS.md regenerated (CI-gated in sync). Tally: wired 2 / catalog 19 / native 5 / vendor-locked 9 / out-of-scope 1.

xdotli · 2026-06-13T17:37:09Z

"Make all agents work" — live verification sweep

Ran candidate ACP agents through the shipped package (acp_registry.register())
on DeepSeek (deepseek-v4-flash) via Daytona. Honest truth table:

Agent	Dist	Result
`qwen-code`	npx	✅ verified — reward 1.0 hello-world & real citation-check (33 tools); gateway run captured usage (`provider_response`, 45,097 in/116 out)
`goose`	binary	✅ verified (runs) — reward 1.0 hello-world; real citation-check ran clean (`tools=2`, no error) but reward 0.0 — agent/model didn't solve it, not an integration failure
`dirac`	npx	⚠️ speaks ACP (registry's `--acp` is correct; an earlier README-only audit calling it "no ACP" was wrong) and entered the loop (1 tool), then closed stdout mid-run (`pipe_closed`)
`github-copilot-cli`	npx	❌ `-32000 Authentication required` on `@github/copilot@1.0.61` — BYOK not honored in ACP mode

goose is now a second wired agent. It required adding binary-install
support to the package (per-arch Linux download from the vendored registry
snapshot + a clean cd dir && env ./bin acp launch) — verified the shipped
_binary_install/_launch_cmd produce a working agent, not just a hand-written probe.

Catalog corrections from the adversarial audit

The verification workflow's catalog auditor caught 9 real errors in the
research-derived catalog (all 4 headline claims still held — refuted:false ×4).
Fixed in this push: kimi license MIT→Apache-2.0, nova proprietary→MIT, kilo
binary→npx, vtcode acp_args→acp, junie→--acp true, poolside model_via→flag,
fast-agent env (OPENAI__BASE_URL nested), dimcode recipe (interactive/sqlite),
dirac discrepancy resolved empirically.

What's NOT done (honest)

Config-file tier (stakpak/vtcode/crow-cli/kimi/mistral-vibe/kilo/
autohand/codebuddy/cline): each needs a config-file writer before a fair
run — not yet probed.
uvx (fast-agent/minion-code), proprietary-gated install
(junie/poolside/grok-build), colon-model (deepagents),
interactive-config (dimcode): recipes in AGENTS.md, not probed.
Vendor-locked (9) + the marketplace entry cannot run as model-enforced
evals at all — stated plainly per entry.

So: 2 of the BYO tier are wired+verified, 2 more probed with concrete findings,
the rest have recipes. Happy to continue the config-file tier (each is a
config-writer + a live run).

cursor · 2026-06-13T17:38:22Z

+_BIN_PREFIX = "/opt/benchflow/acp"
+_SNAPSHOT = json.loads(
+    (Path(__file__).parents[2] / "registry.snapshot.json").read_text()
+)


Snapshot missing from wheel

High Severity

register.py loads registry.snapshot.json at import via Path(__file__).parents[2], but the wheel only ships src/acp_registry and does not bundle that file (unlike ai-sdk’s force-include for server.mjs). A normal pip install from the README leaves the path under site-packages, so import acp_registry fails before register() runs.

Additional Locations (1)

acp-registry/pyproject.toml#L34-L36

^{Reviewed by Cursor Bugbot for commit d6a9a8a. Configure here.}

cursor · 2026-06-13T17:38:22Z

+        f"{_extract_cmd(x86, dest)}; "
+        f"chmod +x {bin_path} 2>/dev/null || true; "
+        f"[ -x {bin_path} ]"
+    ).replace("$BF_ARCHIVE", "/tmp/bf-acp-archive")


Binary extract uses x86 format

Medium Severity

_binary_install always runs _extract_cmd(x86, dest) after downloading the per-arch URL into the same archive path. Extraction is chosen from the x86_64 URL’s extension, not from the archive actually fetched on aarch64, so mixed formats per architecture can make ARM installs fail even when the download succeeds.

^{Reviewed by Cursor Bugbot for commit d6a9a8a. Configure here.}

Ran every BYO-tier agent through the pipeline (spec-extraction -> install -> launch -> task) on DeepSeek/Daytona. Honest outcome: 2 work end-to-end (qwen-code, goose); the other 14 each hit a concrete, confirmed blocker. README "Live verification" is now a full 19-row truth table. Catalog reasons corrected from the deep spec research + live probes: - stakpak: HARD-BLOCK — ACP path model->provider routing defect on v0.3.88 (source-traced); advertises a validated model option (static models.dev catalog). - kimi: HARD-BLOCK — mandatory interactive OAuth, no headless path. - autohand: HARD-BLOCK — not headless-configurable for an arbitrary endpoint. - poolside: HARD-BLOCK — ACP path not headless-wirable to a custom endpoint. - vtcode: installs+launches, pipe_closed at the ACP session (closest binary config-file agent; base_url needs the /v1 suffix) — recorded as known_issue. The remaining ⚠️ agents (vtcode/kilo/codebuddy/deepagents/crow-cli/mistral-vibe/ minion-code/junie/dirac) install + launch but fail at the ACP session — each a bounded per-agent fix now that the pipeline is built. ❌ agents (cline/dimcode/ github-copilot-cli/grok-build) hit auth/interactive/gated-install blockers. 53 tests; ruff clean; AGENTS.md in sync. Tally unchanged (wired 2).

xdotli · 2026-06-13T18:01:08Z

Full BYO-tier sweep — every agent run through the pipeline

Built a reusable pipeline (parallel spec-extraction → probe-gen →
live-verify on Daytona/DeepSeek) and ran all 20 BYO-tier agents through it.
Honest outcome: 2 work end-to-end, 14 hit a concrete confirmed blocker (4 of
those are upstream/auth dead-ends found by source-level research).

Agent	Result
`qwen-code`	✅ verified — reward 1.0 hello-world + real citation-check; gateway usage captured
`goose`	✅ verified-runs — reward 1.0 hello-world; real task ran clean (reward 0.0, agent didn't solve)
`vtcode`	⚠️ installs+launches, `pipe_closed` at ACP session (base_url needs `/v1`)
`dirac`	⚠️ speaks ACP, 1 tool, then `pipe_closed` mid-run
`kilo`	⚠️ installs clean, `pipe_closed` at ACP launch
`codebuddy-code`	⚠️ installs clean, `pipe_closed` at ACP launch
`deepagents`	⚠️ installs clean, ACP `-32603` (colon `provider:model` wiring)
`crow-cli`	⚠️ Python 3.14 + uvx + `crow-mcp` subprocess (not a static binary), `pipe_closed`
`mistral-vibe`	⚠️ `rc=127` at launch (runtime dep)
`minion-code`	⚠️ `rc=127` (uvx launch)
`junie`	⚠️ `rc=127` (JetBrains; needs a JVM)
`cline`	❌ `-32000` — needs a separate `cline auth` step (+ base-URL bugs)
`dimcode`	❌ `-32000` — interactive-only `/connect` + sqlite
`github-copilot-cli`	❌ `-32000` — BYOK not honored in ACP on 1.0.61
`grok-build`	❌ install `rc=1` — xAI-gated download
`stakpak`	❌ source-traced: ACP model→provider routing defect on v0.3.88
`kimi`	❌ mandatory interactive OAuth
`autohand`	❌ not headless-configurable for an arbitrary endpoint
`poolside`	❌ ACP path not headless-wirable to a custom endpoint

⚠️ = installs + launches but the ACP session fails — a bounded per-agent
fix now (exact config schema / base-URL suffix / runtime dep), not open research.
❌ = a real blocker (auth gate, interactive-only, gated install, or upstream
defect).

Honest takeaways

The pipeline genuinely runs every agent; 2 are production-wired (qwen-code
npx, goose binary). The binary-install + acp_model_via_env machinery from
this work is what makes adding the ⚠️ ones a finite task.
A real, recurring finding: many ACP agents advertise a validated model
option (reject the benchmark's model id) and/or aren't headless-configurable —
the reason a thin "it speaks ACP" assumption doesn't translate to "runs as an
eval." Each entry now says exactly why.
I did not fake green: the 14 non-working agents are reported with their precise
failure mode, and the 9 vendor-locked + 1 marketplace remain unrunnable by design.

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 8639dfc. Configure here.}

cursor · 2026-06-13T18:03:20Z

+        f'mkdir -p {dest}; curl -fsSL "$U" -o "$BF_ARCHIVE"; '
+        f"{_extract_cmd(x86, dest)}; "
+        f"chmod +x {bin_path} 2>/dev/null || true; "
+        f"[ -x {bin_path} ]"


Raw binary wrong install path

Medium Severity

For extensionless registry archives, _extract_cmd copies the download to dest/ as the temp filename, while the install script verifies [ -x {dest}/{bin_name}]. When bin_name does not match the downloaded artifact name (e.g. versioned raw Linux binaries), install fails after curl succeeds.

^{Reviewed by Cursor Bugbot for commit 8639dfc. Configure here.}

…t fix fan-out Fanned out one subagent per ⚠️ agent to root-cause its ACP-session failure from the rollout artifacts + upstream source and return a corrected wiring spec, then re-probed live on DeepSeek/Daytona. That turned deepagents green. - deepagents (npx) NOW wired+verified: reward 1.0 hello-world via the shipped package; real citation-check ran ~10 tool calls then an in-run -32603. Fix: its provider SDK (@langchain/openai) must be installed alongside (new `npm_extra` spec field) and the model passed as `--model openai:<model>` (LangChain colon form), with base URL/key via OPENAI_* env_mapping. - register.py: support `npm_extra` for npx agents; broaden the `acp_model_via_env` enablement to ANY wired agent without ACP set_model (model is env/flag/config owned) — not just model_via=="env". - README truth table updated: 3 verified (qwen-code, goose, deepagents). The other fix-spec re-probes shifted failure modes but still fail at install/launch/runtime (rc=127 / rc=1); cline stays blocked (needs a separate `cline auth` step). Each remaining ⚠️ is a bounded per-agent task; ❌ are real blockers. 56 tests; ruff clean; AGENTS.md in sync. Tally: wired 3 / catalog 18 / native 5 / vendor-locked 9 / out-of-scope 1.

xdotli · 2026-06-13T18:49:48Z

Per-agent fix fan-out — deepagents joins the verified set (now 3)

Fanned out one subagent per ⚠️ agent to root-cause its ACP-session failure from the
recorded rollout artifacts + upstream source, returned corrected wiring specs, and
re-probed live. Net delta:

✅ deepagents now wired+verified — reward 1.0 hello-world via the shipped
package. Root cause was not the model-option (my earlier guess): deepagents-acp
lazily init_chat_models a provider SDK the npm package doesn't bundle, so
@langchain/openai must be installed alongside (new npm_extra field), and the
model is passed as --model openai:<model>. On real citation-check it ran ~10
tools then an in-run -32603 (verified-runs, not verified-solves).
The other 9 corrected specs shifted failure modes but still fail at install/
launch/runtime (rc=127 / rc=1); cline stays blocked (needs a separate
cline auth step). Each is now a bounded per-agent task, not open research.

Verified end-to-end: 3 — qwen-code (npx), goose (binary), deepagents
(npx). Full truth table in the README. Generalization shipped: register.py now
supports binary installs, npm_extra, and acp_model_via_env for any non-set_model
agent — the rails for wiring the rest.

A second per-agent fix fan-out (subagents root-causing from rollout artifacts + upstream source) plus live re-probes. Net: pushed several ⚠️ agents much closer, but only confirmed package-green ones stay wired. - kilo: PROBE-green (reward 1.0 hello-world) when its kilo.jsonc is written at INSTALL time, but the package's generic LAUNCH-time config-write regressed (pipe_closed) — kept as catalog with the precise finding (benchflow's launch context doesn't run the python config-write reliably; kilo's model id can't ride its {env:} substitution since it's a config key). Removed the unproven launch-time config-write infra rather than ship a mechanism no wired agent uses. - Round-2 root-causes recorded in the truth table: vtcode's rc=127 was the dynamic loader failing to find the bundled libghostty-vt.so (fixed via ldconfig) + a missing key mapping (now rc=255 at provider init); cline's -32000 cleared by a `cline auth` prelude (now runs, timed out); junie needs a JVM; crow-cli is really py3.14+uvx+crow-mcp; etc. Verified package-green stays 3: qwen-code, goose, deepagents. 56 tests; ruff clean; AGENTS.md in sync. Tally: wired 3 / catalog 18 / native 5 / vendor-locked 9 / oos 1.

cursor Bot reviewed Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(acp-registry): classify + adapt the ACP registry; wire + verify Qwen Code#5

feat(acp-registry): classify + adapt the ACP registry; wire + verify Qwen Code#5
xdotli wants to merge 5 commits into
mainfrom
add-acp-registry-agents

xdotli commented Jun 13, 2026 •

edited by cursor Bot

Loading

Uh oh!

xdotli commented Jun 13, 2026

Uh oh!

cursor Bot Jun 13, 2026

Uh oh!

xdotli commented Jun 13, 2026

Uh oh!

cursor Bot Jun 13, 2026

Uh oh!

cursor Bot Jun 13, 2026

Uh oh!

xdotli commented Jun 13, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 13, 2026

Uh oh!

xdotli commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xdotli commented Jun 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

The catalog (single source of truth: catalog.py → generated AGENTS.md)

Qwen Code — wired and verified end-to-end

The BenchFlow gap this surfaced (and the proposed fix)

Also

Honesty bar

Uh oh!

xdotli commented Jun 13, 2026

Experiment results — Qwen Code, end-to-end on Daytona

skillsbench/citation-check (the real one) — ✅ reward 1.0

hello-world (toy sanity) — ✅ reward 1.0

The one caveat (already in the PR body)

Uh oh!

cursor Bot Jun 13, 2026

Choose a reason for hiding this comment

Qwen launch omits skills flag

Uh oh!

xdotli commented Jun 13, 2026

"Make all agents work" — live verification sweep

Catalog corrections from the adversarial audit

What's NOT done (honest)

Uh oh!

cursor Bot Jun 13, 2026

Choose a reason for hiding this comment

Snapshot missing from wheel

Uh oh!

cursor Bot Jun 13, 2026

Choose a reason for hiding this comment

Binary extract uses x86 format

Uh oh!

xdotli commented Jun 13, 2026

Full BYO-tier sweep — every agent run through the pipeline

Honest takeaways

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 13, 2026

Choose a reason for hiding this comment

Raw binary wrong install path

Uh oh!

xdotli commented Jun 13, 2026

Per-agent fix fan-out — deepagents joins the verified set (now 3)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xdotli commented Jun 13, 2026 •

edited by cursor Bot

Loading

The catalog (single source of truth: `catalog.py` → generated `AGENTS.md`)

`skillsbench/citation-check` (the real one) — ✅ reward 1.0

`hello-world` (toy sanity) — ✅ reward 1.0