Skip to content

Browser automation tool (headless, persistent profiles, sidecar, opt-in) #16

Description

@mattmezza

Summary

Give the agent the ability to read JS-heavy pages and perform actions on the user's behalf via a
headless browser, while keeping the core container lean and the capability quarantined.

Approach

Follow the existing "CLI/tool handles protocol complexity" pattern:

  • tools/browser.py wrapping headless Playwright with a small verb set: goto, read,
    screenshot, click, fill, submit.
  • Persistent authenticated profiles (a user-data-dir per persona/site under data/): log in
    once, then reuse the real cookies/session. Highest-leverage reliability lever.
  • Run Chromium in a sidecar compose service, not the main image, to keep the core small.
  • Register in the optional-tools registry (core/tools.py), disabled by default (same shape as
    the gh integration); advertised to the model only when enabled.
  • A browser.md skill documents the verbs and conventions.

Distinctions worth encoding

  • Browser-as-renderer (load + read/screenshot): low detection, broadly reliable.
  • Browser-as-actor (log in + click + submit): higher detection, MFA, ToS exposure; gate behind
    the permission engine.
  • Prefer an existing API/CLI over the browser whenever one exists; browser automation is a last
    resort.

Known limitations (set expectations)

Sites behind major bot-management / anti-automation services, or interactive challenges, may block
headless automation. Persistent authenticated sessions mitigate the common cases but not the hardest
tier. Residential proxies / challenge-solvers are explicitly out of scope.

UX & product

  • The login/auth flow is the key UX problem — design it explicitly: (a) import an existing
    logged-in session, (b) a guided "log in on a trusted device, then we reuse the session" flow, or
    (c) a hosted interactive browser view in the admin UI for the one-time login. Capture the chosen
    flow as a sub-task; do not assume the naive case.
  • Admin UI: enable/disable toggle (off by default), a per-domain permission-rule editor, saved
    profiles with auth status, and a "test" action — responsive/touch-friendly at phone width,
    reusing consistent toggle + list + approval components.
  • On the go (Telegram): the agent sends screenshots so the user follows along on their
    phone; state-changing actions use the existing inline approve/deny flow with consistent button
    conventions.
  • Mobile-first: watching and approving a browser action from Telegram (with a screenshot) is a
    first-class path; full logs live in the web UI.

Setup & onboarding

  • Disabled by default; surfaced as an optional wizard step that, when enabled, stands up the
    sidecar and prompts for the per-domain rules.
  • A clear "what works / what may be blocked" note in the UI sets expectations up front.

Acceptance criteria

  • A page can be loaded and read/screenshotted headlessly.
  • A simple authenticated action works against a site using a persisted profile.
  • A first-time site login can be completed through a documented, mobile-followable flow; the user
    can watch and approve browser actions from Telegram via screenshots + buttons.
  • The admin browser settings are usable at phone width.
  • The capability is invisible when disabled; writes require approval when enabled.

Related

  • Shares the sandbox sidecar with: pi.dev coding harness.
  • Screenshot reading on non-vision models depends on: vision fallback.
  • Profiles/sessions are a natural fit for: secrets vault.

Metadata

Metadata

Assignees

Labels

in-progressIt means someone is working on thisnewNew additiontodoPlanned / not yet started

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions