Action vocabulary additions (send_keys, screenshot, upload_file, dropdown_options)

## Current state

Pilo's web action vocabulary (`packages/core/src/tools/webActionTools.ts`) covers the core interactions: click, fill, select, hover, check, uncheck, focus, enter, wait, goto, back, forward, extract, done, abort. Plus webSearch, Tabstack tools, and `request_user_data` are gated tools.

Common interaction patterns the agent **cannot** currently perform without workarounds:

- **Keyboard shortcuts**: Escape to dismiss modals, Tab/Shift+Tab to navigate focus, Ctrl+A to select all, ArrowDown/ArrowUp for autocomplete navigation, etc. The only existing keyboard tool is `enter` which presses Enter on a specific ref.
- **On-demand screenshot**: screenshots are only captured at snapshot time when `vision: true`. The agent has no way to request a fresh screenshot mid-step (e.g., "did that click cause a visual change I should look at?").
- **File upload**: form file inputs are common. Pilo has no tool to populate them.
- **Dropdown option enumeration**: when selecting from a `<select>`, the agent currently has to guess the value or click-to-open the dropdown, see the options, then click an option. A direct "what options does this select have?" tool is cheaper.

## The gap

Each missing tool has real-world cases:

- **`send_keys`** — sites that don't expose a "close modal" button (rely on Escape), forms that require keyboard navigation (combobox + ArrowDown + Enter to select), Ctrl+S to save in web editors. The system prompt currently mentions Escape (`prompts.ts:339`) and arrow keys but the agent can't use them.
- **`screenshot`** — diagnostic tool for the agent itself ("the click didn't seem to do anything, let me see"), and a debug surface for users watching task replays. Currently the screenshot is bundled with each snapshot (vision mode) and is wasted on snapshots where vision isn't useful.
- **`upload_file`** — any task involving "upload your resume", "attach a screenshot", "import a CSV" requires a file upload tool.
- **`dropdown_options`** — closes a common multi-step pattern (open dropdown, read options, close, then select) into a single zero-cost tool call.

## Proposed scope

Four small tool additions. Each can ship independently.

### A. `send_keys` tool

```ts
send_keys: tool({
  description:
    "Press a keyboard key or combination on the currently focused element. " +
    "Examples: 'Escape' to dismiss a modal, 'Tab' to move focus, 'Control+a' " +
    "to select all, 'ArrowDown' for combobox navigation.",
  inputSchema: z.object({
    keys: z.string().describe("Key name or combination (e.g., 'Escape', 'Control+a', 'Shift+Tab')"),
    ref: z.string().optional()
      .describe("If provided, focus this element first before pressing keys"),
  }),
  execute: async ({ keys, ref }) => {
    return performActionWithValidation(PageAction.SendKeys, context, ref, keys);
  },
}),
```

Implementation in `playwrightBrowser.ts`: `page.keyboard.press(keys)` (Playwright understands `Control+a` syntax natively).

### B. `screenshot` tool

```ts
screenshot: tool({
  description:
    "Take a screenshot of the current page. The image will be included in the next " +
    "page snapshot. Use sparingly — most decisions can be made from the accessibility " +
    "tree alone.",
  inputSchema: z.object({
    fullPage: z.boolean().default(false)
      .describe("Capture the entire page including content below the fold"),
    ref: z.string().optional()
      .describe("If provided, screenshot only this element"),
  }),
  execute: async ({ fullPage, ref }) => {
    const buf = await context.browser.getScreenshot({ withMarks: true, fullPage, ref });
    // Emit BROWSER_SCREENSHOT_CAPTURED event
    // Attach to next user message? Or include in the action result?
    // Probably: include in the next snapshot, and emit immediately.
  },
}),
```

The exact threading (where the screenshot data ends up in `messages`) needs design. Simplest: the action result includes a `screenshotAvailable: true` flag; the next snapshot user message includes the captured image. Vision mode behavior is unchanged.

### C. `upload_file` tool

```ts
upload_file: tool({
  description:
    "Upload a file to a file input element. The file path must be a local filesystem path " +
    "or a URL that the agent has been authorized to fetch.",
  inputSchema: z.object({
    ref: z.string().describe("Element reference of the file input (or its container)"),
    path: z.string().describe("Local file path or pre-authorized URL"),
  }),
  execute: async ({ ref, path }) => {
    return performActionWithValidation(PageAction.UploadFile, context, ref, path);
  },
}),
```

Implementation via Playwright's `locator.setInputFiles(path)`. If the ref isn't a file input directly, find the nearest descendant file input. Validate the file exists and is non-empty before passing to Playwright.

Security considerations: this tool can read arbitrary local files. Gate behind a `WebAgentOptions.allowFileUpload?: { allowedPaths?: string[] }` option so deployments can constrain it. Default: disabled.

### D. `dropdown_options` tool

```ts
dropdown_options: tool({
  description:
    "List the options available in a dropdown (<select>) or ARIA menu. " +
    "Free and fast — use before select() to verify the option you want is present.",
  inputSchema: z.object({
    ref: z.string().describe("Element reference of the dropdown"),
  }),
  execute: async ({ ref }) => {
    return performActionWithValidation(PageAction.DropdownOptions, context, ref);
  },
}),
```

Implementation: locate the element, check if it's a `<select>` (return its `<option>` text + value) or has `role="combobox"`/`role="menu"` (return descendants with `role="option"` or `role="menuitem"`). Return up to N options (cap at 200 to prevent runaway).

## Implementation notes

- All four tools follow the existing `performActionWithValidation` pattern for consistency.
- Each requires a new `PageAction.*` enum value and corresponding handler in `playwrightBrowser.ts`.
- Each needs a tool example added to `buildToolExamples` in `prompts.ts`.
- `upload_file` security is non-trivial. Default to disabled; opt-in via config; document the security model.
- `screenshot` adds latency and image-content to the conversation. Keep its description discouraging overuse ("Use sparingly").
- These can ship as four separate PRs or one. The simpler ones (`send_keys`, `dropdown_options`) are 2-4 hours each. `screenshot` and `upload_file` are 1 day each.

## Acceptance criteria

- All four tools exist in `webActionTools.ts` (`upload_file` gated by config), have tool descriptions in prompts, and are exercised in tests.
- `send_keys` works for single keys and combinations (`Control+a`, `Shift+Tab`).
- `screenshot` produces an image attached to the next snapshot user message; emits the right events.
- `upload_file` validates the file exists; respects the allow-list when configured; gracefully rejects when not enabled.
- `dropdown_options` returns options for both `<select>` and ARIA menus.
- The system prompt now references send_keys for autocomplete/escape patterns (where the current text says "use keyboard navigation" without a tool to do it).

## Effort estimate

3-4 days for all four, including tests and prompt updates. Each is independent enough to split into separate PRs if useful.

## Related issues

Pairs with the zero-LLM exploration tools issue (`dropdown_options` is conceptually similar to `find_elements`). The `screenshot` tool benefits from the scroll-position context work (knowing what's visible helps decide whether to screenshot).

## Files likely affected

- `packages/core/src/tools/webActionTools.ts`
- `packages/core/src/browser/ariaBrowser.ts` (PageAction enum)
- `packages/core/src/browser/playwrightBrowser.ts` (handlers)
- `packages/core/src/prompts.ts` (tool examples)
- `packages/core/src/config/defaults.ts` (upload allow-list)
- `packages/core/test/`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Action vocabulary additions (send_keys, screenshot, upload_file, dropdown_options) #436

Current state

The gap

Proposed scope

A. `send_keys` tool

B. `screenshot` tool

C. `upload_file` tool

D. `dropdown_options` tool

Implementation notes

Acceptance criteria

Effort estimate

Related issues

Files likely affected

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Action vocabulary additions (send_keys, screenshot, upload_file, dropdown_options) #436

Description

Current state

The gap

Proposed scope

A. send_keys tool

B. screenshot tool

C. upload_file tool

D. dropdown_options tool

Implementation notes

Acceptance criteria

Effort estimate

Related issues

Files likely affected

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

A. `send_keys` tool

B. `screenshot` tool

C. `upload_file` tool

D. `dropdown_options` tool