feat(agents): Add MiMo Code (mimo) ACP agent by Yiminnn · Pull Request #679 · benchflow-ai/benchflow

Yiminnn · 2026-06-11T22:19:38Z

Summary

Adds mimo (Xiaomi MiMo Code, npm @mimo-ai/cli) as a first-class ACP agent, layered on the v0.6.0 agent stack (#665).

MiMo Code is an OpenCode fork that ships a native mimo acp JSON-RPC stdio server — its initialize handshake reports agentInfo.name="OpenCode" — so this lands as a registry-only change mirroring the existing opencode entry, exactly as the registry docstring prescribes ("adding a new agent is a registry-only change"). No core edits, no shim, no if agent == "mimo" special cases.

Changes

src/benchflow/agents/registry.py — AGENTS["mimo"]:
- install_cmd=_js_agent_install("mimo", "@mimo-ai/cli@0.1.0") (pinned; isolated /opt/benchflow node prefix)
- launch_cmd=_js_agent_launch("mimo", "acp"), protocol="acp", acp_model_format="provider/model" (models.dev ids, OpenCode lineage)
- env_mapping maps both BENCHFLOW_PROVIDER_BASE_URL→OPENAI_BASE_URL and BENCHFLOW_PROVIDER_API_KEY→OPENAI_API_KEY (codex-acp precedent) so the non-proxy path needs no core edit
- web tools disabled via ~/.config/mimocode/mimocode.json (tools.webfetch=false, opencode precedent)
tests/integration/configs/mimo.yaml — integration config; defaults to mimo/mimo-auto, MiMo's free no-account channel (works headless in-sandbox; the runtime skips the LiteLLM proxy and delivers the native model id via ACP set_model)

Evidence (live on a GCP VM, Daytona sandbox, v0.6.0rc6)

tests/test_registry_invariants.py + tests/test_agent_registry.py: 171 passed (the new entry is auto-covered by the executable registry schema, incl. the JS-isolation invariants)
ruff check / ruff format --check / ty check: clean; generated install_cmd passes dash -n
Stdio ACP handshake verified: initialize → protocolVersion 1, agentInfo "OpenCode"
Live rollouts on mimo/mimo-auto (SkillsBench task.md tasks): 4/4 healthy (tools>0, tokens>0, reward non-None), 2 solved — jax-computing-basics r=1.0 (8 tools, 26k tok), threejs-to-obj r=1.0 (6 tools); data-to-d3, weighted-gdp-calc healthy fails

Notes for reviewers

The free mimo-auto channel needs no credentials; for MiMo's flagship models, xiaomi/mimo-v2.5-pro routes through the existing xiaomi provider (XIAOMI_API_KEY/XIAOMI_BASE_URL).
Proxy-alias models (e.g. openai/benchflow-…) are rejected by the opencode family's models.dev validation (ProviderModelNotFoundError) — this affects opencode identically and predates this PR; the integration config therefore defaults to the native channel.

How to test

uv run python -m pytest tests/test_registry_invariants.py tests/test_agent_registry.py -q
uv run bench agent show mimo
uv run bench eval create --tasks-dir <tasks> --include jax-computing-basics \
  --agent mimo --model mimo/mimo-auto --sandbox daytona --skill-mode no-skill \
  --concurrency 1 --jobs-dir jobs/mimo-smoke

xdotli · 2026-06-12T08:08:33Z

Reviewing for the v0.6 release — the agent registry change itself looks good, but a review (incl. a tracing of the integration-suite wiring) found 3 gaps in tests/integration/run.sh that would keep mimo from actually running in the default suite and cause a config-audit mismatch. Flagging so this can land cleanly:

mimo missing from ALL_AGENTS (run.sh:49-58) — a no-arg run.sh never runs the new agent, so the shipped mimo.yaml is dead config in the normal suite path.
model_for_agent has no mimo case (run.sh:41-47 + check_results.py:728-730) — run.sh ignores the YAML model: and falls to the default gemini-3.1-flash-lite-preview, while config.json records xiaomi/mimo-v2.5-pro from the YAML → guaranteed model-mismatch audit failure.
has_creds_for has no mimo case (run.sh:95-109) — it falls to has_gemini_key/GEMINI_API_KEY, but mimo needs XIAOMI_API_KEY/XIAOMI_BASE_URL (per the YAML comment), so the suite green-lights launching mimo with the wrong credential gate and the eval then fails closed in resolve_agent_env.

Each is a small run.sh/model_for_agent/has_creds_for addition. Happy to help wire these if useful — leaving the PR to you since it's yours.

MiMo Code (Xiaomi) is an OpenCode fork that ships a native `mimo acp` JSON-RPC stdio server (its initialize handshake reports agentInfo.name= "OpenCode"), so it registers as a first-class ACP agent with a registry-only change mirroring opencode: - AGENTS["mimo"]: _js_agent_install("mimo", "@mimo-ai/cli@0.1.0") into the isolated /opt/benchflow prefix, launch "mimo acp", acp_model_format= provider/model, env_mapping maps BOTH base_url+api_key (codex-acp precedent) so the non-proxy path needs no core edit. - tests/integration/configs/mimo.yaml: defaults to mimo/mimo-auto, MiMo's free no-account channel (works headless in-sandbox; benchflow skips the LiteLLM proxy and sends the native model id via ACP set_model). Validated on the symphony VM (daytona, v0.6.0rc6): registry invariants + agent registry 171 passed, ruff/ty clean. Live rollouts on mimo-auto: 4/4 healthy, 2 solved (jax-computing-basics r=1.0/8 tools, threejs-to-obj r=1.0/6 tools; data-to-d3 + weighted-gdp-calc healthy fails).

The committed `model: mimo/mimo-auto` does not run: - there is no `mimo` provider registered (only `xiaomi`), so the model resolves to no provider; and - `mimo-auto` is not a served model — the endpoint returns `400 "Not supported model mimo-auto"`. Use the existing `xiaomi` provider with a real model id; set XIAOMI_BASE_URL + XIAOMI_API_KEY to point at the MiMo endpoint. Validated live on Daytona (agent=mimo, weighted-gdp-calc), graded by the agent-judge integration suite. Findings: - benchflow wiring is correct: the proxy injects OPENAI_BASE_URL=<proxy> + OPENAI_API_KEY=<master_key> for the mimo agent (its env_mapping maps both base_url and api_key), and the agent-judge realness gate behaves correctly. - REMAINING BLOCKER (agent-side, not benchflow): @mimo-ai/cli@0.1.0 completes the ACP handshake and accepts the model, then ends the turn with zero model requests (total_requests=0, 0 tool calls, empty agent log) — so every rollout is empty and the realness gate rejects it. The MiMo Code CLI needs its model resolution debugged before this suite can pass; this config fix is necessary but not sufficient.

@xdotli

Review follow-ups for #679 (thanks @xdotli), rebased onto latest main: - BLOCKER: non-proxy modelId was mangled — for --model xiaomi/mimo-v2.5-pro, strip_provider_prefix removed the registered xiaomi/ prefix and the models.dev heuristics fell through to anthropic/mimo-v2.5-pro (nonexistent route). Added the ("mimo","xiaomi") heuristic so the id passes through as xiaomi/<model> — the MiMo CLI catalog form. Proxy path untouched (openai/benchflow-<alias>); both shapes pinned in test_litellm_hardening. - run.sh: the 3 review gaps — mimo in ALL_AGENTS, model_for_agent emits xiaomi/mimo-v2.5-pro verbatim (check_results compares by string equality), has_creds_for gates on XIAOMI_API_KEY + XIAOMI_BASE_URL (both required; resolve_agent_env fails closed without either). - Conventional per-agent tests: mimo web-policy setup_cmd writes .config/mimocode/mimocode.json with tools.webfetch=false (+ snippet-list coverage in test_internet_policy). - docs/integration-tests.md: agent count 8→9 + XIAOMI creds row; registry comment documents both modelId shapes.

…points MiMo Code pins its native models.dev xiaomi provider to the standard platform endpoint, so token-plan/regional keys (XIAOMI_BASE_URL) fail with "Invalid API Key". Ship the override through the existing CredentialFile hook: when XIAOMI_API_KEY is present, write ~/.config/mimocode/mimocode.json with provider.xiaomi.options pointing at {env:XIAOMI_BASE_URL}/{env:XIAOMI_API_KEY} — the CLI resolves the env references at runtime, so no secret is materialized in the file. mimo.yaml runs with usage_tracking off: the CLI validates model ids against its own catalog and rejects LiteLLM proxy aliases (openai/benchflow-*), so the eval must send the native xiaomi/<model> id directly (kept intact by the ("mimo","xiaomi") heuristic). Live-verified on Daytona (citation-check, legacy layout): - mimo + xiaomi/mimo-v2.5-pro: PASS reward=1.0, 12 tool calls - control openhands + deepseek-v4-flash: PASS reward=1.0, 27 tool calls

Yiminnn · 2026-06-13T00:16:10Z

@xdotli all three gaps closed, plus two deeper blockers your model-spec fix surfaced — and the lane is now live-verified end to end. Also rebased onto latest main (001da3e) and retargeted the PR there, so the real CI gate runs.

Your 3 run.sh gaps (ea92d688):

mimo added to ALL_AGENTS
model_for_agent emits xiaomi/mimo-v2.5-pro verbatim (check_results compares by string equality)
has_creds_for gates on XIAOMI_API_KEY and XIAOMI_BASE_URL (both required — resolve_agent_env fails closed without either)

Blocker 1 — non-proxy modelId mangling (ea92d688): strip_provider_prefix removed the registered xiaomi/ prefix and the models.dev heuristics fell through to anthropic/mimo-v2.5-pro (nonexistent route, silent except a log warning). Fixed with a ("mimo","xiaomi") heuristic entry; both shapes pinned in test_litellm_hardening (xiaomi/... passes through; proxy aliases still format to openai/benchflow-*).

Blocker 2 — fixed-endpoint auth (f51d340c): the CLI pins its native xiaomi provider to the standard platform endpoint, so token-plan/regional keys fail with "Invalid API Key". Shipped via the existing CredentialFile hook: when XIAOMI_API_KEY is set, benchflow writes ~/.config/mimocode/mimocode.json overriding provider.xiaomi.options with {env:XIAOMI_BASE_URL}/{env:XIAOMI_API_KEY} references (resolved by the CLI at runtime — no secret materialized in the file). mimo.yaml runs with usage_tracking: "off" since the CLI rejects LiteLLM proxy aliases (openai/benchflow-*) against its catalog — the native id must go through directly.

Live evidence (Daytona, citation-check, legacy layout):

mimo + xiaomi/mimo-v2.5-pro: PASS, reward 1.0, 12 tool calls (model id arrives intact, token-plan endpoint authenticated)
control openhands + deepseek/deepseek-v4-flash: PASS, reward 1.0, 27 tool calls

Also added the conventional per-agent tests (web-policy setup_cmd writes mimocode.json tools.webfetch=false + snippet coverage) and the docs updates (integration-tests.md 8→9 agents + creds row).

One heads-up beyond this PR: skillsbench upstream just converted all tasks to native task.md (64395c1, skillsbench@1.2 draft) — main's legacy-only discovery can't see those packages, which makes landing the v0.6 task.md line more pressing. The live proofs above used the pre-migration legacy layout.

xdotli mentioned this pull request Jun 11, 2026

sync: bring v0.6-integration into release/v0.6.0 (ATIF/ADP runtime emission + router docs/tests + smoke gate) #680

Merged

Yiminnn mentioned this pull request Jun 11, 2026

feat(eval): Capture silent provider API errors as unhealthy results #682

Merged

Yiminnn and others added 3 commits June 12, 2026 23:47

Yiminnn force-pushed the feat/mimo-agent branch from ed6de55 to ea92d68 Compare June 12, 2026 23:53

Yiminnn changed the base branch from release/v0.6.0 to main June 12, 2026 23:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agents): Add MiMo Code (mimo) ACP agent#679

feat(agents): Add MiMo Code (mimo) ACP agent#679
Yiminnn wants to merge 4 commits into
mainfrom
feat/mimo-agent

Yiminnn commented Jun 11, 2026

Uh oh!

xdotli commented Jun 12, 2026

Uh oh!

Yiminnn commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Yiminnn commented Jun 11, 2026

Summary

Changes

Evidence (live on a GCP VM, Daytona sandbox, v0.6.0rc6)

Notes for reviewers

How to test

Uh oh!

xdotli commented Jun 12, 2026

Uh oh!

Yiminnn commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants