CanpGrid

Adaptive Recursive Image Grid for Multimodal Agents

CanpGrid is a progressive image observation tool for multimodal agents. It generates guide-lined zoom views, supports recursive region inspection, and resolves structured spatial references back to original image regions or points.

CanpGrid does not perform clicking, UI automation, object detection, OCR, task execution, or UI memory. It is only a visual observation and spatial referencing layer.

Features

Adaptive image grids
Guide-lined zoom views
Recursive region inspection
Grid, ruler, and hybrid overlays
Grid intersection coordinate system
Region path to original bbox
Flexible point_spec protocol
Candidate point preview images
Palette-guided color snap choices
CLI and Python API
Agent-friendly JSON outputs
Calibration-ready design

Concept

Original image
-> guide-lined grid view
-> zoom selected cell
-> guide-lined local view
-> resolve region / resolve point

The model observes annotated images and describes regions using structured grid paths. CanpGrid maps those paths back to original-image coordinates.

Installation

python -m pip install -e ".[dev]"

CanpGrid requires Python 3.10 or newer.

Python API

from canpgrid import (
    extract_color_choices,
    create_cell_ruler_view,
    create_grid_view,
    preview_point,
    resolve_point,
    resolve_region,
    zoom_region,
)

view = create_grid_view(
    "examples/sample.png",
    grid_size=[12, 7],
    overlay_mode="grid",
    out_dir="outputs",
)

levels = [{"grid_size": [12, 7], "cell": [6, 2]}]

zoomed = zoom_region(
    "examples/sample.png",
    levels,
    next_grid_size=[8, 6],
    overlay_mode="hybrid",
    ruler_config={"tick_x": 16, "tick_y": 16},
    out_dir="outputs",
)

region = resolve_region("examples/sample.png", levels)

cell_ruler = create_cell_ruler_view(
    "examples/sample.png",
    levels,
    grid_size=[8, 6],
    cell=[3, 4],
    ruler_config={"tick_x": 10, "tick_y": 10},
    zoom_factor=4,
    out_dir="outputs",
)

point = resolve_point(
    "examples/sample.png",
    levels,
    {
        "type": "cell_ruler_point",
        "grid_size": [8, 6],
        "cell": [3, 4],
        "x": 3,
        "y": 6,
        "ruler_size": [10, 10],
    },
)

color_choices = extract_color_choices(
    "examples/sample.png",
    bbox=point["final_region_bbox_on_original"],
    palette_size=8,
)

preview = preview_point(
    "examples/sample.png",
    levels,
    {
        "type": "cell_ruler_point",
        "grid_size": [8, 6],
        "cell": [3, 4],
        "x": 3,
        "y": 6,
        "ruler_size": [10, 10],
    },
    preview_on="both",
    marker_style="ring_crosshair_inset",
    out_dir="outputs",
)

create_grid_view, zoom_region, and create_cell_ruler_view always return an annotated_image_path. The bbox metadata is companion data, not the main output. create_cell_ruler_view highlights one selected grid cell and draws a finer ruler inside it, so an agent can say "cell [3, 4], horizontal tick 3, vertical tick 6" without spending another observation turn on zooming. preview_point creates a non-executing focus preview so an agent can inspect a candidate point before confirming or adjusting it. Use preview_on="both" when the local view is highly zoomed: the current-view preview checks precision, and the original-image preview keeps the global UI context visible. extract_color_choices turns a local crop into a small c1, c2, ... palette so weaker models can choose a color ID for color_snap_point instead of inventing a precise hex value.

CLI

Create a grid observation view:

canpgrid grid examples/sample.png --density medium --out outputs/

Use an explicit grid:

canpgrid grid examples/sample.png --grid-size 12x7 --out outputs/

Use a ruler overlay:

canpgrid grid examples/sample.png --overlay-mode ruler --detail-mode fine --ruler-size 16x16 --out outputs/

Zoom a selected region:

canpgrid zoom examples/sample.png --levels '[{"grid_size":[12,7],"cell":[6,2]}]' --out outputs/

Add a fine ruler inside a selected cell without another zoom step:

canpgrid cell-ruler examples/sample.png --levels '[{"grid_size":[12,7],"cell":[6,2]}]' --grid-size 8x6 --cell 3x4 --ruler-size 10x10 --zoom-factor 4 --out outputs/

Resolve a region:

canpgrid resolve-region examples/sample.png --levels '[{"grid_size":[12,7],"cell":[6,2]},{"grid_size":[8,6],"cell":[3,4]}]'

Resolve a point:

canpgrid resolve-point examples/sample.png --levels '[{"grid_size":[12,7],"cell":[6,2]}]' --point-spec '{"type":"normalized_point","value":["1/2","1/2"]}'

Preview a candidate point:

canpgrid preview-point examples/sample.png --levels '[{"grid_size":[12,7],"cell":[6,2]}]' --point-spec '{"type":"normalized_point","value":["1/2","1/2"]}' --preview-on both --marker-style ring_crosshair_inset

All CLI commands emit JSON.

Protocol Summary

CanpGrid uses grid intersections. A grid_size of [12, 7] means 12 columns and 7 rows. Intersections run from [0, 0] to [12, 7]; cells run from [0, 0] to [11, 6] and are addressed by their top-left intersection.

Recursive zoom paths are represented as levels:

{
  "levels": [
    {"grid_size": [12, 7], "cell": [6, 2]},
    {"grid_size": [8, 6], "cell": [3, 4]}
  ]
}

Each level is relative to the local canvas produced by the previous level.

Point specs include:

normalized_point
anchor_offset
ruler_point
ruler_offset
hybrid_point
cell_ruler_point
subgrid_point
color_snap_point

See docs/protocol.md and docs/point-spec.md for details. See docs/preview-point.md for point preview and self-check.

color_snap_point lets an agent resolve a coarse base point, then snap to a nearby pixel with a chosen color by nearest search or directional ray scan. This helps when a model can pick the right UI part but cannot place the final point precisely enough by ruler ticks alone. When color choices are available, the agent can provide target_color_id plus color_choices, and CanpGrid resolves the ID to the actual color before snapping.

Calibration Potential

CanpGrid Core does not call models. Future CanpGrid Calibration work can compare model localization accuracy across no overlay, grid, ruler, and hybrid modes, then produce a Model Visual Profile for each model. It can also compute overlay_gain against a no-guide baseline.

Demo

python examples/demo.py

Demo outputs are saved to outputs/demo/.

Version Observation Report

python examples/codex_baseline_report.py

This generates outputs/codex_baseline_report/index.html, an HTML report that shows a Codex-baseline localization trace for a small UI action scenario with checkboxes, text fields, and buttons. It is a progress artifact for release review; it identifies candidate image-space positions only and does not execute real clicks or add UI automation to Core.

WeChat-Like DOM Benchmark

MOONSHOT_API_KEY=... python examples/kimi_wechat_dom_benchmark.py

This generates outputs/wechat_dom_benchmark/index.html, a reproducible WeChat-like UI benchmark with fixture HTML, exact target bboxes, global grid views, selected-cell ruler examples, and point previews. The script always uses real Kimi API calls; if MOONSHOT_API_KEY is missing, it fails instead of generating an offline model-test report. It compares direct model coordinates against the two-step CanpGrid cell-ruler workflow and records token usage.

Task Action Benchmark

python examples/task_action_benchmark.py \
  --chat-image path/to/wechat-chat-list.jpg \
  --contacts-image path/to/wechat-contacts.jpg \
  --providers kimi,mimo \
  --out-dir outputs/task_action_benchmark_wechat

This generates a goal-conditioned benchmark report. Each sample is a UI screenshot plus a user request such as "search for someone", "add a friend", or "send hello to this contact". The model must return exactly one next click focus point for the current screen. Human target rectangles are used only for scoring; CanpGrid still does not execute clicks.

Interaction Dataset Workbench

Open tools/annotation_workbench.html in a browser to create manual clickable region annotations from any app screenshot. Upload a screenshot, draw clickable boxes, set labels and roles, then export a canpgrid.interactions.v1 JSON manifest.

The annotation boxes are only ground-truth tolerance regions. Model benchmarks ask for one click focus point per visible interactive control, not for predicted bounding boxes. If a model returns a bbox anyway, the benchmark evaluates its center point and renders only point markers in prediction maps. In the CanpGrid-assisted pass, the model can return grid_cell plus cell_point; the benchmark resolves that structured point into original-image pixels before scoring.

Run a real model benchmark against those annotations:

MOONSHOT_API_KEY=... python examples/interaction_benchmark.py \
  --image path/to/screenshot.png \
  --annotations path/to/annotations.canpgrid.json

The benchmark asks the model to identify all visible interactive click points and classifies errors as missed interactives, false positives, localization errors, semantic mismatches, duplicates, or correct detections. See docs/interaction-dataset.md.

By default it also runs an object-inventory-first pass: the model first lists unique clickable objects, then CanpGrid renders a multi-block context sheet around each object's rough grid cell and asks for at most one final point per object_id. This makes duplicate clicks and edge-of-cell localization mistakes much easier to diagnose.

Tests

pytest
ruff check .

License

MIT License

Branding

Part of the CANPAI open agent infrastructure.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
bench		bench
canpgrid		canpgrid
docs		docs
examples		examples
tests		tests
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.zh-CN.md		README.zh-CN.md
TODO.md		TODO.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CanpGrid

Features

Concept

Installation

Python API

CLI

Protocol Summary

Calibration Potential

Demo

Version Observation Report

WeChat-Like DOM Benchmark

Task Action Benchmark

Interaction Dataset Workbench

Tests

License

Branding

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CanpGrid

Features

Concept

Installation

Python API

CLI

Protocol Summary

Calibration Potential

Demo

Version Observation Report

WeChat-Like DOM Benchmark

Task Action Benchmark

Interaction Dataset Workbench

Tests

License

Branding

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages