Adaptive Recursive Image Grid for Multimodal Agents
CanpGrid is a progressive image observation tool for multimodal agents. It generates guide-lined zoom views, supports recursive region inspection, and resolves structured spatial references back to original image regions or points.
CanpGrid does not perform clicking, UI automation, object detection, OCR, task execution, or UI memory. It is only a visual observation and spatial referencing layer.
- Adaptive image grids
- Guide-lined zoom views
- Recursive region inspection
- Grid, ruler, and hybrid overlays
- Grid intersection coordinate system
- Region path to original bbox
- Flexible point_spec protocol
- Candidate point preview images
- Palette-guided color snap choices
- CLI and Python API
- Agent-friendly JSON outputs
- Calibration-ready design
Original image
-> guide-lined grid view
-> zoom selected cell
-> guide-lined local view
-> resolve region / resolve point
The model observes annotated images and describes regions using structured grid paths. CanpGrid maps those paths back to original-image coordinates.
python -m pip install -e ".[dev]"CanpGrid requires Python 3.10 or newer.
from canpgrid import (
extract_color_choices,
create_cell_ruler_view,
create_grid_view,
preview_point,
resolve_point,
resolve_region,
zoom_region,
)
view = create_grid_view(
"examples/sample.png",
grid_size=[12, 7],
overlay_mode="grid",
out_dir="outputs",
)
levels = [{"grid_size": [12, 7], "cell": [6, 2]}]
zoomed = zoom_region(
"examples/sample.png",
levels,
next_grid_size=[8, 6],
overlay_mode="hybrid",
ruler_config={"tick_x": 16, "tick_y": 16},
out_dir="outputs",
)
region = resolve_region("examples/sample.png", levels)
cell_ruler = create_cell_ruler_view(
"examples/sample.png",
levels,
grid_size=[8, 6],
cell=[3, 4],
ruler_config={"tick_x": 10, "tick_y": 10},
zoom_factor=4,
out_dir="outputs",
)
point = resolve_point(
"examples/sample.png",
levels,
{
"type": "cell_ruler_point",
"grid_size": [8, 6],
"cell": [3, 4],
"x": 3,
"y": 6,
"ruler_size": [10, 10],
},
)
color_choices = extract_color_choices(
"examples/sample.png",
bbox=point["final_region_bbox_on_original"],
palette_size=8,
)
preview = preview_point(
"examples/sample.png",
levels,
{
"type": "cell_ruler_point",
"grid_size": [8, 6],
"cell": [3, 4],
"x": 3,
"y": 6,
"ruler_size": [10, 10],
},
preview_on="both",
marker_style="ring_crosshair_inset",
out_dir="outputs",
)create_grid_view, zoom_region, and create_cell_ruler_view always return
an annotated_image_path.
The bbox metadata is companion data, not the main output.
create_cell_ruler_view highlights one selected grid cell and draws a finer
ruler inside it, so an agent can say "cell [3, 4], horizontal tick 3,
vertical tick 6" without spending another observation turn on zooming.
preview_point creates a non-executing focus preview so an agent can inspect a
candidate point before confirming or adjusting it. Use preview_on="both" when
the local view is highly zoomed: the current-view preview checks precision, and
the original-image preview keeps the global UI context visible.
extract_color_choices turns a local crop into a small c1, c2, ... palette
so weaker models can choose a color ID for color_snap_point instead of
inventing a precise hex value.
Create a grid observation view:
canpgrid grid examples/sample.png --density medium --out outputs/Use an explicit grid:
canpgrid grid examples/sample.png --grid-size 12x7 --out outputs/Use a ruler overlay:
canpgrid grid examples/sample.png --overlay-mode ruler --detail-mode fine --ruler-size 16x16 --out outputs/Zoom a selected region:
canpgrid zoom examples/sample.png --levels '[{"grid_size":[12,7],"cell":[6,2]}]' --out outputs/Add a fine ruler inside a selected cell without another zoom step:
canpgrid cell-ruler examples/sample.png --levels '[{"grid_size":[12,7],"cell":[6,2]}]' --grid-size 8x6 --cell 3x4 --ruler-size 10x10 --zoom-factor 4 --out outputs/Resolve a region:
canpgrid resolve-region examples/sample.png --levels '[{"grid_size":[12,7],"cell":[6,2]},{"grid_size":[8,6],"cell":[3,4]}]'Resolve a point:
canpgrid resolve-point examples/sample.png --levels '[{"grid_size":[12,7],"cell":[6,2]}]' --point-spec '{"type":"normalized_point","value":["1/2","1/2"]}'Preview a candidate point:
canpgrid preview-point examples/sample.png --levels '[{"grid_size":[12,7],"cell":[6,2]}]' --point-spec '{"type":"normalized_point","value":["1/2","1/2"]}' --preview-on both --marker-style ring_crosshair_insetAll CLI commands emit JSON.
CanpGrid uses grid intersections. A grid_size of [12, 7] means 12 columns
and 7 rows. Intersections run from [0, 0] to [12, 7]; cells run from
[0, 0] to [11, 6] and are addressed by their top-left intersection.
Recursive zoom paths are represented as levels:
{
"levels": [
{"grid_size": [12, 7], "cell": [6, 2]},
{"grid_size": [8, 6], "cell": [3, 4]}
]
}Each level is relative to the local canvas produced by the previous level.
Point specs include:
normalized_pointanchor_offsetruler_pointruler_offsethybrid_pointcell_ruler_pointsubgrid_pointcolor_snap_point
See docs/protocol.md and docs/point-spec.md for details. See docs/preview-point.md for point preview and self-check.
color_snap_point lets an agent resolve a coarse base point, then snap to a
nearby pixel with a chosen color by nearest search or directional ray scan. This
helps when a model can pick the right UI part but cannot place the final point
precisely enough by ruler ticks alone.
When color choices are available, the agent can provide target_color_id plus
color_choices, and CanpGrid resolves the ID to the actual color before
snapping.
CanpGrid Core does not call models. Future CanpGrid Calibration work can compare
model localization accuracy across no overlay, grid, ruler, and hybrid modes,
then produce a Model Visual Profile for each model. It can also compute
overlay_gain against a no-guide baseline.
python examples/demo.pyDemo outputs are saved to outputs/demo/.
python examples/codex_baseline_report.pyThis generates outputs/codex_baseline_report/index.html, an HTML report that
shows a Codex-baseline localization trace for a small UI action scenario with
checkboxes, text fields, and buttons. It is a progress artifact for release
review; it identifies candidate image-space positions only and does not execute
real clicks or add UI automation to Core.
MOONSHOT_API_KEY=... python examples/kimi_wechat_dom_benchmark.pyThis generates outputs/wechat_dom_benchmark/index.html, a reproducible
WeChat-like UI benchmark with fixture HTML, exact target bboxes, global grid
views, selected-cell ruler examples, and point previews. The script always uses
real Kimi API calls; if MOONSHOT_API_KEY is missing, it fails instead of
generating an offline model-test report. It compares direct model coordinates
against the two-step CanpGrid cell-ruler workflow and records token usage.
python examples/task_action_benchmark.py \
--chat-image path/to/wechat-chat-list.jpg \
--contacts-image path/to/wechat-contacts.jpg \
--providers kimi,mimo \
--out-dir outputs/task_action_benchmark_wechatThis generates a goal-conditioned benchmark report. Each sample is a UI screenshot plus a user request such as "search for someone", "add a friend", or "send hello to this contact". The model must return exactly one next click focus point for the current screen. Human target rectangles are used only for scoring; CanpGrid still does not execute clicks.
Open tools/annotation_workbench.html in a browser to create manual clickable
region annotations from any app screenshot. Upload a screenshot, draw clickable
boxes, set labels and roles, then export a canpgrid.interactions.v1 JSON
manifest.
The annotation boxes are only ground-truth tolerance regions. Model benchmarks
ask for one click focus point per visible interactive control, not for predicted
bounding boxes. If a model returns a bbox anyway, the benchmark evaluates its
center point and renders only point markers in prediction maps.
In the CanpGrid-assisted pass, the model can return grid_cell plus
cell_point; the benchmark resolves that structured point into original-image
pixels before scoring.
Run a real model benchmark against those annotations:
MOONSHOT_API_KEY=... python examples/interaction_benchmark.py \
--image path/to/screenshot.png \
--annotations path/to/annotations.canpgrid.jsonThe benchmark asks the model to identify all visible interactive click points and classifies errors as missed interactives, false positives, localization errors, semantic mismatches, duplicates, or correct detections. See docs/interaction-dataset.md.
By default it also runs an object-inventory-first pass: the model first lists
unique clickable objects, then CanpGrid renders a multi-block context sheet
around each object's rough grid cell and asks for at most one final point per
object_id. This makes duplicate clicks and edge-of-cell localization mistakes
much easier to diagnose.
pytest
ruff check .MIT License
Part of the CANPAI open agent infrastructure.