Skip to content

mini-computer-agent: minimal model-agnostic computer-use agent#6

Open
xdotli wants to merge 1 commit into
mainfrom
feat/mini-computer-acp
Open

mini-computer-agent: minimal model-agnostic computer-use agent#6
xdotli wants to merge 1 commit into
mainfrom
feat/mini-computer-acp

Conversation

@xdotli

@xdotli xdotli commented Jun 13, 2026

Copy link
Copy Markdown
Member

mini-computer-agent

A minimal, model-agnostic computer-use agent — the computer-use analog of mini-swe-agent. ~175 lines, one dependency (litellm).

What it is

One loop: screenshot (scrot) → vision model (any, via litellm) → one JSON action → xdotool, repeat. No vendor computer-use tool, no separate grounding model, no tool-calling, and no benchflow/protocol code in the agent itself — it's a pure run callable. BenchFlow's generic ACP serve wraps it (separate benchflow change), so it needs no per-agent shim.

Load-bearing detail: coordinate scaling

Vision models emit coordinates normalized to [0,1000], not pixels. core.scale() maps them back (x_px = x/1000 · W). Without this, every click lands ~3× off and looks like the model "can't ground" — with it, grounding is solid. (e.g. 195/508/820250/650/1050px.)

Validation (BenchFlow + Daytona, gemini-3.5-flash, ACP-served)

7 / 8 overall; single-target grounding 6 / 6.

Task Reward
click-grounding (synthetic) 1.0
MiniWoB++ click-button / link / dialog / option / tab 1.0 each
MiniWoB++ enter-text 1.0 (after a 1-line focus-fix)
MiniWoB++ click-checkboxes −0.33 (open: multi-select under-reading)

Demonstrated dogfood → fix → verify: enter-text flailed on field focus (13 steps); one line of prompt guidance (focus-then-type, verify-before-submit) → clean pass (7 steps).

Contents

mini-computer-agent/ package (core, README, tests, pyproject) + CI workflows.

Follow-ups

  • click-checkboxes multi-select thoroughness is the lone open task failure.
  • Consumed by BenchFlow via its generic ACP serve; the registry entry pip-installs this package.

Note

Low Risk
New isolated package and CI only; no changes to existing agents beyond an extra lint job step.

Overview
Introduces mini-computer-agent, a new standalone package that implements a minimal desktop GUI loop: screenshot → vision model (litellm) → one JSON action → xdotool, exposed as run(task, model, on_step=...). The agent stays protocol-free (BenchFlow wraps it elsewhere); it depends only on litellm plus system tools scrot / xdotool.

The core adds [0,1000] → pixel coordinate scaling before clicks (vision-model convention), tolerant JSON action parsing, and prompt guidance for focus-then-type / verify-before-done. Hermetic tests cover parsing, scaling, and PNG dimension reads without a model or sandbox.

CI is extended with ruff lint/format for the new package and a path-filtered pytest workflow (test-mini-computer-agent.yaml), aligned with existing mini-swe-agent workflows.

Reviewed by Cursor Bugbot for commit 902f16a. Bugbot is set up for automated code reviews on this repo. Configure here.

… analog)

Pure screenshot -> any vision model (litellm) -> one JSON action -> xdotool,
in ~150 lines with no protocol/harness code. Coordinates are [0,1000]-normalized
and scaled to pixels (raw use mis-clicks ~3x). Hermetic tests (parse/scale/PNG)
+ lint and pytest CI. Validated end-to-end on benchflow+Daytona with
gemini-3.5-flash: 7/8 tasks, single-target grounding 6/6.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 902f16a. Configure here.

elif kind == "scroll":
dy = int(action.get("dy", 0))
for _ in range(max(1, abs(dy))):
_xdotool("click", "4" if dy < 0 else "5")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Zero scroll still scrolls down

Low Severity

In the scroll branch, max(1, abs(dy)) forces at least one wheel click when dy is 0 or omitted (defaults to 0), so a no-op scroll becomes one downward scroll. That contradicts the prompt’s signed dy semantics and can move the UI when the model intended no scrolling.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 902f16a. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant