mini-computer-agent: minimal model-agnostic computer-use agent#6
Open
xdotli wants to merge 1 commit into
Open
Conversation
… analog) Pure screenshot -> any vision model (litellm) -> one JSON action -> xdotool, in ~150 lines with no protocol/harness code. Coordinates are [0,1000]-normalized and scaled to pixels (raw use mis-clicks ~3x). Hermetic tests (parse/scale/PNG) + lint and pytest CI. Validated end-to-end on benchflow+Daytona with gemini-3.5-flash: 7/8 tasks, single-target grounding 6/6.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 902f16a. Configure here.
| elif kind == "scroll": | ||
| dy = int(action.get("dy", 0)) | ||
| for _ in range(max(1, abs(dy))): | ||
| _xdotool("click", "4" if dy < 0 else "5") |
There was a problem hiding this comment.
Zero scroll still scrolls down
Low Severity
In the scroll branch, max(1, abs(dy)) forces at least one wheel click when dy is 0 or omitted (defaults to 0), so a no-op scroll becomes one downward scroll. That contradicts the prompt’s signed dy semantics and can move the UI when the model intended no scrolling.
Reviewed by Cursor Bugbot for commit 902f16a. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


mini-computer-agent
A minimal, model-agnostic computer-use agent — the computer-use analog of mini-swe-agent. ~175 lines, one dependency (
litellm).What it is
One loop: screenshot (
scrot) → vision model (any, via litellm) → one JSON action →xdotool, repeat. No vendor computer-use tool, no separate grounding model, no tool-calling, and no benchflow/protocol code in the agent itself — it's a pureruncallable. BenchFlow's generic ACP serve wraps it (separate benchflow change), so it needs no per-agent shim.Load-bearing detail: coordinate scaling
Vision models emit coordinates normalized to [0,1000], not pixels.
core.scale()maps them back (x_px = x/1000 · W). Without this, every click lands ~3× off and looks like the model "can't ground" — with it, grounding is solid. (e.g.195/508/820→250/650/1050px.)Validation (BenchFlow + Daytona, gemini-3.5-flash, ACP-served)
7 / 8 overall; single-target grounding 6 / 6.
Demonstrated dogfood → fix → verify:
enter-textflailed on field focus (13 steps); one line of prompt guidance (focus-then-type, verify-before-submit) → clean pass (7 steps).Contents
mini-computer-agent/package (core, README, tests, pyproject) + CI workflows.Follow-ups
click-checkboxesmulti-select thoroughness is the lone open task failure.Note
Low Risk
New isolated package and CI only; no changes to existing agents beyond an extra lint job step.
Overview
Introduces
mini-computer-agent, a new standalone package that implements a minimal desktop GUI loop: screenshot → vision model (litellm) → one JSON action →xdotool, exposed asrun(task, model, on_step=...). The agent stays protocol-free (BenchFlow wraps it elsewhere); it depends only on litellm plus system tools scrot / xdotool.The core adds [0,1000] → pixel coordinate scaling before clicks (vision-model convention), tolerant JSON action parsing, and prompt guidance for focus-then-type / verify-before-done. Hermetic tests cover parsing, scaling, and PNG dimension reads without a model or sandbox.
CI is extended with ruff lint/format for the new package and a path-filtered pytest workflow (
test-mini-computer-agent.yaml), aligned with existing mini-swe-agent workflows.Reviewed by Cursor Bugbot for commit 902f16a. Bugbot is set up for automated code reviews on this repo. Configure here.