Summary
Give the agent the ability to read JS-heavy pages and perform actions on the user's behalf via a
headless browser, while keeping the core container lean and the capability quarantined.
Approach
Follow the existing "CLI/tool handles protocol complexity" pattern:
tools/browser.py wrapping headless Playwright with a small verb set: goto, read,
screenshot, click, fill, submit.
- Persistent authenticated profiles (a
user-data-dir per persona/site under data/): log in
once, then reuse the real cookies/session. Highest-leverage reliability lever.
- Run Chromium in a sidecar compose service, not the main image, to keep the core small.
- Register in the optional-tools registry (
core/tools.py), disabled by default (same shape as
the gh integration); advertised to the model only when enabled.
- A
browser.md skill documents the verbs and conventions.
Distinctions worth encoding
- Browser-as-renderer (load + read/screenshot): low detection, broadly reliable.
- Browser-as-actor (log in + click + submit): higher detection, MFA, ToS exposure; gate behind
the permission engine.
- Prefer an existing API/CLI over the browser whenever one exists; browser automation is a last
resort.
Known limitations (set expectations)
Sites behind major bot-management / anti-automation services, or interactive challenges, may block
headless automation. Persistent authenticated sessions mitigate the common cases but not the hardest
tier. Residential proxies / challenge-solvers are explicitly out of scope.
UX & product
- The login/auth flow is the key UX problem — design it explicitly: (a) import an existing
logged-in session, (b) a guided "log in on a trusted device, then we reuse the session" flow, or
(c) a hosted interactive browser view in the admin UI for the one-time login. Capture the chosen
flow as a sub-task; do not assume the naive case.
- Admin UI: enable/disable toggle (off by default), a per-domain permission-rule editor, saved
profiles with auth status, and a "test" action — responsive/touch-friendly at phone width,
reusing consistent toggle + list + approval components.
- On the go (Telegram): the agent sends screenshots so the user follows along on their
phone; state-changing actions use the existing inline approve/deny flow with consistent button
conventions.
- Mobile-first: watching and approving a browser action from Telegram (with a screenshot) is a
first-class path; full logs live in the web UI.
Setup & onboarding
- Disabled by default; surfaced as an optional wizard step that, when enabled, stands up the
sidecar and prompts for the per-domain rules.
- A clear "what works / what may be blocked" note in the UI sets expectations up front.
Acceptance criteria
- A page can be loaded and read/screenshotted headlessly.
- A simple authenticated action works against a site using a persisted profile.
- A first-time site login can be completed through a documented, mobile-followable flow; the user
can watch and approve browser actions from Telegram via screenshots + buttons.
- The admin browser settings are usable at phone width.
- The capability is invisible when disabled; writes require approval when enabled.
Related
- Shares the sandbox sidecar with: pi.dev coding harness.
- Screenshot reading on non-vision models depends on: vision fallback.
- Profiles/sessions are a natural fit for: secrets vault.
Summary
Give the agent the ability to read JS-heavy pages and perform actions on the user's behalf via a
headless browser, while keeping the core container lean and the capability quarantined.
Approach
Follow the existing "CLI/tool handles protocol complexity" pattern:
tools/browser.pywrapping headless Playwright with a small verb set:goto,read,screenshot,click,fill,submit.user-data-dirper persona/site underdata/): log inonce, then reuse the real cookies/session. Highest-leverage reliability lever.
core/tools.py), disabled by default (same shape asthe
ghintegration); advertised to the model only when enabled.browser.mdskill documents the verbs and conventions.Distinctions worth encoding
the permission engine.
resort.
Known limitations (set expectations)
Sites behind major bot-management / anti-automation services, or interactive challenges, may block
headless automation. Persistent authenticated sessions mitigate the common cases but not the hardest
tier. Residential proxies / challenge-solvers are explicitly out of scope.
UX & product
logged-in session, (b) a guided "log in on a trusted device, then we reuse the session" flow, or
(c) a hosted interactive browser view in the admin UI for the one-time login. Capture the chosen
flow as a sub-task; do not assume the naive case.
profiles with auth status, and a "test" action — responsive/touch-friendly at phone width,
reusing consistent toggle + list + approval components.
phone; state-changing actions use the existing inline approve/deny flow with consistent button
conventions.
first-class path; full logs live in the web UI.
Setup & onboarding
sidecar and prompts for the per-domain rules.
Acceptance criteria
can watch and approve browser actions from Telegram via screenshots + buttons.
Related