Skip to content

Yeqi99/CanpGrid

Repository files navigation

CanpGrid

Adaptive Recursive Image Grid for Multimodal Agents

CanpGrid is a progressive image observation tool for multimodal agents. It generates guide-lined zoom views, supports recursive region inspection, and resolves structured spatial references back to original image regions or points.

CanpGrid does not perform clicking, UI automation, object detection, OCR, task execution, or UI memory. It is only a visual observation and spatial referencing layer.

Features

  • Adaptive image grids
  • Guide-lined zoom views
  • Recursive region inspection
  • Grid, ruler, and hybrid overlays
  • Grid intersection coordinate system
  • Region path to original bbox
  • Flexible point_spec protocol
  • Candidate point preview images
  • Palette-guided color snap choices
  • CLI and Python API
  • Agent-friendly JSON outputs
  • Calibration-ready design

Concept

Original image
-> guide-lined grid view
-> zoom selected cell
-> guide-lined local view
-> resolve region / resolve point

The model observes annotated images and describes regions using structured grid paths. CanpGrid maps those paths back to original-image coordinates.

Installation

python -m pip install -e ".[dev]"

CanpGrid requires Python 3.10 or newer.

Python API

from canpgrid import (
    extract_color_choices,
    create_cell_ruler_view,
    create_grid_view,
    preview_point,
    resolve_point,
    resolve_region,
    zoom_region,
)

view = create_grid_view(
    "examples/sample.png",
    grid_size=[12, 7],
    overlay_mode="grid",
    out_dir="outputs",
)

levels = [{"grid_size": [12, 7], "cell": [6, 2]}]

zoomed = zoom_region(
    "examples/sample.png",
    levels,
    next_grid_size=[8, 6],
    overlay_mode="hybrid",
    ruler_config={"tick_x": 16, "tick_y": 16},
    out_dir="outputs",
)

region = resolve_region("examples/sample.png", levels)

cell_ruler = create_cell_ruler_view(
    "examples/sample.png",
    levels,
    grid_size=[8, 6],
    cell=[3, 4],
    ruler_config={"tick_x": 10, "tick_y": 10},
    zoom_factor=4,
    out_dir="outputs",
)

point = resolve_point(
    "examples/sample.png",
    levels,
    {
        "type": "cell_ruler_point",
        "grid_size": [8, 6],
        "cell": [3, 4],
        "x": 3,
        "y": 6,
        "ruler_size": [10, 10],
    },
)

color_choices = extract_color_choices(
    "examples/sample.png",
    bbox=point["final_region_bbox_on_original"],
    palette_size=8,
)

preview = preview_point(
    "examples/sample.png",
    levels,
    {
        "type": "cell_ruler_point",
        "grid_size": [8, 6],
        "cell": [3, 4],
        "x": 3,
        "y": 6,
        "ruler_size": [10, 10],
    },
    preview_on="both",
    marker_style="ring_crosshair_inset",
    out_dir="outputs",
)

create_grid_view, zoom_region, and create_cell_ruler_view always return an annotated_image_path. The bbox metadata is companion data, not the main output. create_cell_ruler_view highlights one selected grid cell and draws a finer ruler inside it, so an agent can say "cell [3, 4], horizontal tick 3, vertical tick 6" without spending another observation turn on zooming. preview_point creates a non-executing focus preview so an agent can inspect a candidate point before confirming or adjusting it. Use preview_on="both" when the local view is highly zoomed: the current-view preview checks precision, and the original-image preview keeps the global UI context visible. extract_color_choices turns a local crop into a small c1, c2, ... palette so weaker models can choose a color ID for color_snap_point instead of inventing a precise hex value.

CLI

Create a grid observation view:

canpgrid grid examples/sample.png --density medium --out outputs/

Use an explicit grid:

canpgrid grid examples/sample.png --grid-size 12x7 --out outputs/

Use a ruler overlay:

canpgrid grid examples/sample.png --overlay-mode ruler --detail-mode fine --ruler-size 16x16 --out outputs/

Zoom a selected region:

canpgrid zoom examples/sample.png --levels '[{"grid_size":[12,7],"cell":[6,2]}]' --out outputs/

Add a fine ruler inside a selected cell without another zoom step:

canpgrid cell-ruler examples/sample.png --levels '[{"grid_size":[12,7],"cell":[6,2]}]' --grid-size 8x6 --cell 3x4 --ruler-size 10x10 --zoom-factor 4 --out outputs/

Resolve a region:

canpgrid resolve-region examples/sample.png --levels '[{"grid_size":[12,7],"cell":[6,2]},{"grid_size":[8,6],"cell":[3,4]}]'

Resolve a point:

canpgrid resolve-point examples/sample.png --levels '[{"grid_size":[12,7],"cell":[6,2]}]' --point-spec '{"type":"normalized_point","value":["1/2","1/2"]}'

Preview a candidate point:

canpgrid preview-point examples/sample.png --levels '[{"grid_size":[12,7],"cell":[6,2]}]' --point-spec '{"type":"normalized_point","value":["1/2","1/2"]}' --preview-on both --marker-style ring_crosshair_inset

All CLI commands emit JSON.

Protocol Summary

CanpGrid uses grid intersections. A grid_size of [12, 7] means 12 columns and 7 rows. Intersections run from [0, 0] to [12, 7]; cells run from [0, 0] to [11, 6] and are addressed by their top-left intersection.

Recursive zoom paths are represented as levels:

{
  "levels": [
    {"grid_size": [12, 7], "cell": [6, 2]},
    {"grid_size": [8, 6], "cell": [3, 4]}
  ]
}

Each level is relative to the local canvas produced by the previous level.

Point specs include:

  • normalized_point
  • anchor_offset
  • ruler_point
  • ruler_offset
  • hybrid_point
  • cell_ruler_point
  • subgrid_point
  • color_snap_point

See docs/protocol.md and docs/point-spec.md for details. See docs/preview-point.md for point preview and self-check.

color_snap_point lets an agent resolve a coarse base point, then snap to a nearby pixel with a chosen color by nearest search or directional ray scan. This helps when a model can pick the right UI part but cannot place the final point precisely enough by ruler ticks alone. When color choices are available, the agent can provide target_color_id plus color_choices, and CanpGrid resolves the ID to the actual color before snapping.

Calibration Potential

CanpGrid Core does not call models. Future CanpGrid Calibration work can compare model localization accuracy across no overlay, grid, ruler, and hybrid modes, then produce a Model Visual Profile for each model. It can also compute overlay_gain against a no-guide baseline.

Demo

python examples/demo.py

Demo outputs are saved to outputs/demo/.

Version Observation Report

python examples/codex_baseline_report.py

This generates outputs/codex_baseline_report/index.html, an HTML report that shows a Codex-baseline localization trace for a small UI action scenario with checkboxes, text fields, and buttons. It is a progress artifact for release review; it identifies candidate image-space positions only and does not execute real clicks or add UI automation to Core.

WeChat-Like DOM Benchmark

MOONSHOT_API_KEY=... python examples/kimi_wechat_dom_benchmark.py

This generates outputs/wechat_dom_benchmark/index.html, a reproducible WeChat-like UI benchmark with fixture HTML, exact target bboxes, global grid views, selected-cell ruler examples, and point previews. The script always uses real Kimi API calls; if MOONSHOT_API_KEY is missing, it fails instead of generating an offline model-test report. It compares direct model coordinates against the two-step CanpGrid cell-ruler workflow and records token usage.

Task Action Benchmark

python examples/task_action_benchmark.py \
  --chat-image path/to/wechat-chat-list.jpg \
  --contacts-image path/to/wechat-contacts.jpg \
  --providers kimi,mimo \
  --out-dir outputs/task_action_benchmark_wechat

This generates a goal-conditioned benchmark report. Each sample is a UI screenshot plus a user request such as "search for someone", "add a friend", or "send hello to this contact". The model must return exactly one next click focus point for the current screen. Human target rectangles are used only for scoring; CanpGrid still does not execute clicks.

Interaction Dataset Workbench

Open tools/annotation_workbench.html in a browser to create manual clickable region annotations from any app screenshot. Upload a screenshot, draw clickable boxes, set labels and roles, then export a canpgrid.interactions.v1 JSON manifest.

The annotation boxes are only ground-truth tolerance regions. Model benchmarks ask for one click focus point per visible interactive control, not for predicted bounding boxes. If a model returns a bbox anyway, the benchmark evaluates its center point and renders only point markers in prediction maps. In the CanpGrid-assisted pass, the model can return grid_cell plus cell_point; the benchmark resolves that structured point into original-image pixels before scoring.

Run a real model benchmark against those annotations:

MOONSHOT_API_KEY=... python examples/interaction_benchmark.py \
  --image path/to/screenshot.png \
  --annotations path/to/annotations.canpgrid.json

The benchmark asks the model to identify all visible interactive click points and classifies errors as missed interactives, false positives, localization errors, semantic mismatches, duplicates, or correct detections. See docs/interaction-dataset.md.

By default it also runs an object-inventory-first pass: the model first lists unique clickable objects, then CanpGrid renders a multi-block context sheet around each object's rough grid cell and asks for at most one final point per object_id. This makes duplicate clicks and edge-of-cell localization mistakes much easier to diagnose.

Tests

pytest
ruff check .

License

MIT License

Branding

Part of the CANPAI open agent infrastructure.

About

Adaptive Recursive Image Grid for Multimodal Agents.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors