Skip to content

Action vocabulary additions (send_keys, screenshot, upload_file, dropdown_options) #436

@lmorchard

Description

@lmorchard

Current state

Pilo's web action vocabulary (packages/core/src/tools/webActionTools.ts) covers the core interactions: click, fill, select, hover, check, uncheck, focus, enter, wait, goto, back, forward, extract, done, abort. Plus webSearch, Tabstack tools, and request_user_data are gated tools.

Common interaction patterns the agent cannot currently perform without workarounds:

  • Keyboard shortcuts: Escape to dismiss modals, Tab/Shift+Tab to navigate focus, Ctrl+A to select all, ArrowDown/ArrowUp for autocomplete navigation, etc. The only existing keyboard tool is enter which presses Enter on a specific ref.
  • On-demand screenshot: screenshots are only captured at snapshot time when vision: true. The agent has no way to request a fresh screenshot mid-step (e.g., "did that click cause a visual change I should look at?").
  • File upload: form file inputs are common. Pilo has no tool to populate them.
  • Dropdown option enumeration: when selecting from a <select>, the agent currently has to guess the value or click-to-open the dropdown, see the options, then click an option. A direct "what options does this select have?" tool is cheaper.

The gap

Each missing tool has real-world cases:

  • send_keys — sites that don't expose a "close modal" button (rely on Escape), forms that require keyboard navigation (combobox + ArrowDown + Enter to select), Ctrl+S to save in web editors. The system prompt currently mentions Escape (prompts.ts:339) and arrow keys but the agent can't use them.
  • screenshot — diagnostic tool for the agent itself ("the click didn't seem to do anything, let me see"), and a debug surface for users watching task replays. Currently the screenshot is bundled with each snapshot (vision mode) and is wasted on snapshots where vision isn't useful.
  • upload_file — any task involving "upload your resume", "attach a screenshot", "import a CSV" requires a file upload tool.
  • dropdown_options — closes a common multi-step pattern (open dropdown, read options, close, then select) into a single zero-cost tool call.

Proposed scope

Four small tool additions. Each can ship independently.

A. send_keys tool

send_keys: tool({
  description:
    "Press a keyboard key or combination on the currently focused element. " +
    "Examples: 'Escape' to dismiss a modal, 'Tab' to move focus, 'Control+a' " +
    "to select all, 'ArrowDown' for combobox navigation.",
  inputSchema: z.object({
    keys: z.string().describe("Key name or combination (e.g., 'Escape', 'Control+a', 'Shift+Tab')"),
    ref: z.string().optional()
      .describe("If provided, focus this element first before pressing keys"),
  }),
  execute: async ({ keys, ref }) => {
    return performActionWithValidation(PageAction.SendKeys, context, ref, keys);
  },
}),

Implementation in playwrightBrowser.ts: page.keyboard.press(keys) (Playwright understands Control+a syntax natively).

B. screenshot tool

screenshot: tool({
  description:
    "Take a screenshot of the current page. The image will be included in the next " +
    "page snapshot. Use sparingly — most decisions can be made from the accessibility " +
    "tree alone.",
  inputSchema: z.object({
    fullPage: z.boolean().default(false)
      .describe("Capture the entire page including content below the fold"),
    ref: z.string().optional()
      .describe("If provided, screenshot only this element"),
  }),
  execute: async ({ fullPage, ref }) => {
    const buf = await context.browser.getScreenshot({ withMarks: true, fullPage, ref });
    // Emit BROWSER_SCREENSHOT_CAPTURED event
    // Attach to next user message? Or include in the action result?
    // Probably: include in the next snapshot, and emit immediately.
  },
}),

The exact threading (where the screenshot data ends up in messages) needs design. Simplest: the action result includes a screenshotAvailable: true flag; the next snapshot user message includes the captured image. Vision mode behavior is unchanged.

C. upload_file tool

upload_file: tool({
  description:
    "Upload a file to a file input element. The file path must be a local filesystem path " +
    "or a URL that the agent has been authorized to fetch.",
  inputSchema: z.object({
    ref: z.string().describe("Element reference of the file input (or its container)"),
    path: z.string().describe("Local file path or pre-authorized URL"),
  }),
  execute: async ({ ref, path }) => {
    return performActionWithValidation(PageAction.UploadFile, context, ref, path);
  },
}),

Implementation via Playwright's locator.setInputFiles(path). If the ref isn't a file input directly, find the nearest descendant file input. Validate the file exists and is non-empty before passing to Playwright.

Security considerations: this tool can read arbitrary local files. Gate behind a WebAgentOptions.allowFileUpload?: { allowedPaths?: string[] } option so deployments can constrain it. Default: disabled.

D. dropdown_options tool

dropdown_options: tool({
  description:
    "List the options available in a dropdown (<select>) or ARIA menu. " +
    "Free and fast — use before select() to verify the option you want is present.",
  inputSchema: z.object({
    ref: z.string().describe("Element reference of the dropdown"),
  }),
  execute: async ({ ref }) => {
    return performActionWithValidation(PageAction.DropdownOptions, context, ref);
  },
}),

Implementation: locate the element, check if it's a <select> (return its <option> text + value) or has role="combobox"/role="menu" (return descendants with role="option" or role="menuitem"). Return up to N options (cap at 200 to prevent runaway).

Implementation notes

  • All four tools follow the existing performActionWithValidation pattern for consistency.
  • Each requires a new PageAction.* enum value and corresponding handler in playwrightBrowser.ts.
  • Each needs a tool example added to buildToolExamples in prompts.ts.
  • upload_file security is non-trivial. Default to disabled; opt-in via config; document the security model.
  • screenshot adds latency and image-content to the conversation. Keep its description discouraging overuse ("Use sparingly").
  • These can ship as four separate PRs or one. The simpler ones (send_keys, dropdown_options) are 2-4 hours each. screenshot and upload_file are 1 day each.

Acceptance criteria

  • All four tools exist in webActionTools.ts (upload_file gated by config), have tool descriptions in prompts, and are exercised in tests.
  • send_keys works for single keys and combinations (Control+a, Shift+Tab).
  • screenshot produces an image attached to the next snapshot user message; emits the right events.
  • upload_file validates the file exists; respects the allow-list when configured; gracefully rejects when not enabled.
  • dropdown_options returns options for both <select> and ARIA menus.
  • The system prompt now references send_keys for autocomplete/escape patterns (where the current text says "use keyboard navigation" without a tool to do it).

Effort estimate

3-4 days for all four, including tests and prompt updates. Each is independent enough to split into separate PRs if useful.

Related issues

Pairs with the zero-LLM exploration tools issue (dropdown_options is conceptually similar to find_elements). The screenshot tool benefits from the scroll-position context work (knowing what's visible helps decide whether to screenshot).

Files likely affected

  • packages/core/src/tools/webActionTools.ts
  • packages/core/src/browser/ariaBrowser.ts (PageAction enum)
  • packages/core/src/browser/playwrightBrowser.ts (handlers)
  • packages/core/src/prompts.ts (tool examples)
  • packages/core/src/config/defaults.ts (upload allow-list)
  • packages/core/test/

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions