feat(core): add page exploration tools, structured extract by lmorchard · Pull Request #446 · mozilla/pilo

lmorchard · 2026-05-13T22:48:58Z

Summary

Adds two zero-LLM page-inspection tools (search_page, find_elements) so the agent doesn't pay an LLM round-trip for inventory questions ("is X on the page?", "how many Y are there?").
Extends extract with an optional outputSchema (JSON Schema) so the agent can request structured JSON output instead of markdown text it then has to re-parse.
Implemented across both browser backends — PlaywrightBrowser (CLI/server, with iframe iteration) and ExtensionBrowser (top-frame only, matching the extension's existing getTreeWithRefs scope).

Design Decisions

Dedicated AriaBrowser methods rather than routing through PageAction + performAction. Typed input/output, no string-encoded options. Mirrors extract's existing bypass of performAction.
New inspectionTools.ts factory alongside webActionTools/searchTools/tabstackTools/interactiveToolSet. Matches the per-concern file pattern; signals read-only intent.
Refs via the existing data-pilo-ref DOM attribute. The aria-tree code already sets this attribute during tree generation; new tools resolve nearestRef via el.closest('[data-pilo-ref]') and withinRef lookups via document.querySelector('[data-pilo-ref="..."]'). No changes to the aria-tree bundle.
Frame iteration matches each backend's existing aria-tree behavior. Playwright iterates same-origin + accessible cross-origin frames and tags per-frame matches with frameUrl; Extension is top-frame only (so frameUrl is always undefined in extension results). Forcing parity in either direction would require either dropping Playwright's frame coverage or adding allFrames: true machinery to the extension.
Wiring is unconditional. These tools have no API key / callback / provider dependency, so they're always in the agent's tool set — unlike searchTools (gated on search service), tabstackTools (gated on API key), and interactiveToolSet (gated on callback).
Structured extract routes through generateObjectWithRetry with the AI SDK's jsonSchema() helper. Markdown branch (no outputSchema) is byte-identical to its pre-PR behavior. Complementary to tabstack_extract_json — that one is for off-page URL fetches via the Tabstack API; this is for the current page using the configured LLM provider.

Changes

Core (packages/core/src/):

tools/inspectionTools.ts (new) — createInspectionTools factory exposing search_page and find_elements.
browser/ariaBrowser.ts — new types (SearchPageOptions/Match/Result, FindElementsOptions/Match/Result) and two new interface methods (searchPage, findElements).
browser/playwrightBrowser.ts — searchPage and findElements implementations with cross-frame iteration mirroring getTreeWithRefsImpl.
tools/webActionTools.ts — extract extended with optional outputSchema.
utils/retry.ts — new generateObjectWithRetry + shared retryDriver<T> helper extracted from both wrappers. NoObjectGeneratedError is now non-retryable.
webAgent.ts — createInspectionTools instantiated unconditionally; search_page and find_elements added to the pageChanged exempt list.
prompts.ts — three new TOOL_STRINGS entries (searchPage, findElements, extended extract); three buildToolExamples lines; one best-practices bullet in actionLoopSystemPromptTemplate.
core.ts — new type exports.

Extension (packages/extension/src/):

background/ExtensionBrowser.ts — searchPage and findElements implementations via browser.scripting.executeScript, top frame only.

Tests: +1305 across packages (core 719, cli 221, server 88, extension 277). New inspectionTools.test.ts (446 lines) and new describe blocks in playwrightBrowser.test.ts, ExtensionBrowser.test.ts, retry.test.ts, webActionTools.test.ts. Includes NoObjectGeneratedError non-retry coverage.

Test Plan

pnpm run check passes (typecheck + format:check + 1305 tests across all packages)
gitleaks detect — no leaks
Manual smoke: run pnpm pilo run "<task>" against a static page (e.g., a Wikipedia article) and a dynamic page (e.g., a documentation search results page) and confirm:
- search_page returns matches with nearestRef populated
- find_elements with a CSS selector returns elements with auto-resolved href/src
- find_elements with withinRef scopes correctly
- extract without outputSchema still returns markdown
- extract with outputSchema returns a JSON object matching the schema

References

Spec: docs/dev-sessions/2026-05-13-1319-page-exploration-tools/spec.md (not committed to git — session artifact)
Plan: docs/dev-sessions/2026-05-13-1319-page-exploration-tools/plan.md (not committed to git — session artifact)
Closes Add zero-LLM page exploration tools (search_page, find_elements, structured extract) #432

🤖 Generated with Claude Code

Copilot

Pull request overview

Adds two zero-LLM page-inspection tools (search_page, find_elements) and extends extract with optional outputSchema for structured JSON output, implemented across both Playwright and Extension browser backends.

Changes:

New search_page and find_elements tools (via new inspectionTools.ts factory and new AriaBrowser methods), with cross-frame iteration in Playwright and top-frame-only in the Extension.
extract extended with optional outputSchema routed through a new generateObjectWithRetry helper, with NoObjectGeneratedError treated as non-retryable.
retry.ts refactored to share a retryDriver<T> between text and object wrappers; prompts updated with new tool examples and best-practices guidance.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`packages/core/src/browser/ariaBrowser.ts`	New types and interface methods for `searchPage`/`findElements`.
`packages/core/src/browser/playwrightBrowser.ts`	Playwright implementations with main + sub-frame iteration.
`packages/core/src/tools/inspectionTools.ts`	New factory exposing the two zero-LLM tools.
`packages/core/src/tools/webActionTools.ts`	`extract` extended with structured-output branch.
`packages/core/src/utils/retry.ts`	Shared retry driver + new `generateObjectWithRetry`; `NoObjectGeneratedError` non-retryable.
`packages/core/src/webAgent.ts`	Wires inspection tools unconditionally; adds them to `pageChanged` exempt list.
`packages/core/src/prompts.ts`	Tool descriptions, examples, best-practices line.
`packages/core/src/core.ts`	Exports new types.
`packages/extension/src/background/ExtensionBrowser.ts`	Extension implementations via `scripting.executeScript`.
`packages/core/test/**`, `packages/extension/test/ExtensionBrowser.test.ts`	New unit tests covering both backends, retry, and structured extract.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Adds three additions to the agent's tool surface, plus a refactor of the retry layer: - `search_page`: zero-LLM text search of the current page via a TreeWalker. Returns matches with surrounding context and the nearest `data-pilo-ref` ancestor (`nearestRef`) so the agent can chain directly into `click`/`fill` without paying for an `extract` round-trip. - `find_elements`: zero-LLM CSS-selector query. Optional `withinRef` scopes the query to an aria-tree subtree. Returns each match's tag, text, requested attributes (`href`/`src` auto-resolved to absolute URLs), and `nearestRef`. - `extract({outputSchema})`: optional JSON Schema argument routes the existing extract through the AI SDK's `generateObject` (via the new `generateObjectWithRetry`) and returns `data: object` instead of `extractedData: string`. The markdown branch behavior is byte-identical to the prior implementation when `outputSchema` is absent. Implemented across both browser backends: - Playwright iterates same-origin + accessible cross-origin frames and tags per-frame matches with `frameUrl`, matching the existing aria-tree behavior. - Extension is top-frame only (matches `ExtensionBrowser.getTreeWithRefs`), so `frameUrl` is always undefined in extension results. Wiring is unconditional — these are pure DOM primitives with no API key / callback / provider dependency. They live in a new `inspectionTools.ts` factory, alongside `webActionTools` / `searchTools` / `tabstackTools` / `interactiveToolSet` in `webAgent.ts`. `search_page` and `find_elements` are added to the `pageChanged` exempt list. Refs are resolved via the existing `data-pilo-ref` DOM attribute that `ariaSnapshot.ts` already sets during tree generation, so no changes are needed to the aria-tree bundle. Refactor: extracted a shared `retryDriver<T>` from `generateTextWithRetry` and `generateObjectWithRetry`. The two public wrappers become thin call sites via `validateResult` and `getFinishReason` hooks. Net reduction in `retry.ts` line count. Also: `NoObjectGeneratedError` is now non-retryable in `isRetryableError`, preventing 3× cost amplification on schema-validation failures. Tests: +1305 across core/cli/server/extension (+24 search_page block, +30 find_elements block, +8 structured extract + new retry block, plus MockBrowser stubs and a new `NoObjectGeneratedError` non-retry case). Closes #432 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reframes the inspection-tool guidance around "trust the snapshot first; escalate only when needed" rather than aggressively pushing the new tools in all cases. Iter1 of the local prompt-tuning loop over-steered the agent into calling search_page/find_elements to "confirm" what was already visible in the aria-tree snapshot, costing +24% input tokens vs baseline. Changes: - Best-practices block: lead with "default: trust the snapshot" and introduce inspection tools as escalations for cases the snapshot doesn't cover (truncated content, values buried in long page text, attributes at scale). - extract.description: clarify that extract is for cases the snapshot doesn't already answer; explicit "do not pass empty {}" warning. - search_page.description: scoped to "when the snapshot doesn't show the answer"; added concrete alternate-spelling guidance. - find_elements.description: scoped to truncated snapshots, large attribute extraction, and subtree enumeration via withinRef. - outputSchema description: explicit "REQUIRED with a real schema; {} is NOT valid". Relaxed three description-string test assertions from .toBe / .toContain to .toMatch so iteration on description copy doesn't break tests. Local 5-task micro-eval (gemini-2.5-flash, vertex, chrome, headless): total input tokens 390,971 (baseline) → 226,301 (iter2), -42%. Biggest single win: search_page_lookup task (218K → 73K, -66%) — agent now answers "CSS1 published 1996" from the snapshot directly. Sticky remainder: model still passes outputSchema:{} instead of a real schema. Tool wiring, types, and behavior are unchanged.

…ples Iter3 of the local prompt-tuning loop. Two targeted nudges on top of the iter2 snapshot-first framing: - searchPage.description: require at least one zero-match recovery attempt (variant spelling, regex word-boundary, etc.) before answering "no". A single zero-match search is explicitly NOT a final answer. - extract.outputSchema description: three copy-and-adapt one-line schema examples (single object, list of items, boolean+reason) plus an explicit "STOP and write out the shape before calling extract." Local 5-task micro-eval (gemini-2.5-flash, vertex, chrome, headless): total input tokens 226,301 (iter2) → 251,515 (iter3), +11%. The regression is entirely from search_page_presence — agent now correctly tries both spellings before concluding (the answer is correctly "No"; the page truly doesn't mention Beautiful Soup). vs baseline: -36%. outputSchema effectiveness remains unexercised: the agent skipped extract on the structured-data task because the HN snapshot already contained the answer. A task where the snapshot is genuinely insufficient is needed to evaluate the new outputSchema guidance.

Empty commit to fire `evals/**` workflow after switching the eval pipeline's PILO_PW_CDP_ENDPOINT from bundled-browser (default fallback) to Browserless. Several iter3 failures were navCount=1 / "Execution context destroyed" patterns consistent with bundled-browser flakiness in the Argo pod environment, not prompt regressions. This run isolates the prompt changes from the browser stack.

Adds a runtime guard to the extract tool: when outputSchema is provided but evaluates to {} (no keys), short-circuit before any LLM call and return a recoverable error instructing the agent to either fill in a real schema or omit the argument. Why: three rounds of prompt iteration could not stop gemini-2.5-flash from passing outputSchema:{} when asked for structured output. Across two CI eval runs (iter3 bundled + iter3 Browserless, 60 tasks total), zero extract calls included a real JSON Schema — 3-4 calls per run passed {} which gives no validation and is functionally identical to omitting the argument. Prompt-only enforcement has reached a model- capability ceiling. The guard surfaces the issue as a tool error so the agent can self-correct mid-task. Behavior: - outputSchema undefined → markdown branch (unchanged) - outputSchema with real keys → generateObject branch (unchanged) - outputSchema = {} → recoverable error with instructions; no getMarkdown(), no LLM call, no token spend Also updates the outputSchema description so the agent knows the rejection is enforced at runtime rather than a soft prompt-level preference. Tests: +1 covering the empty-schema rejection (no LLM/browser calls, returns success:false / isRecoverable:true with a guiding error). Existing extract tests unchanged (720 / 720 passing).

The previous run (pilo-batch-github-eval-khcjt) had 0/30 passes because the GKE pilo-secrets bundle was reset to stubs/empty values in the ~23-minute gap between two consecutive evals — likely another local make cloud-secrets invocation from a different .env state. This commit retriggers the eval against ad84c4b + 1bc9d4a (runtime guard for empty extract outputSchema) with the correct secret.

The hard rejection from the previous commit (be609b6) caused two task failures on the 100-task CI eval (Google Map #4 and ESPN #0): gemini-2.5-flash passes outputSchema:{}, sees the recoverable error, retries with outputSchema:{} again, and after 5 consecutive errors the agent layer aborts the whole task. Soften the guard: when outputSchema is non-null but has no keys, silently treat it as if it were omitted (fall through to the markdown branch). An empty {} schema gave no validation anyway — the structured branch with an empty schema is indistinguishable from the markdown branch. The fall-through is logged via an AGENT_STATUS event so the downgrade is visible in traces. Updated the outputSchema prompt copy: "an empty {} provides no validation and is silently downgraded to a markdown extract" instead of "will be REJECTED with a recoverable error". Test: updated to assert the markdown branch IS called and the status event IS emitted when outputSchema:{} is passed. Previously asserted the recoverable-error shape; that behavior is gone.

lmorchard requested a review from Copilot May 13, 2026 22:49

Copilot started reviewing on behalf of lmorchard May 13, 2026 22:49 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

lmorchard marked this pull request as draft May 14, 2026 15:23

lmorchard and others added 6 commits May 15, 2026 16:04

lmorchard force-pushed the feat/page-exploration-tools branch from e17b78b to b6a40ab Compare May 15, 2026 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core): add page exploration tools, structured extract#446

feat(core): add page exploration tools, structured extract#446
lmorchard wants to merge 7 commits into
mainfrom
feat/page-exploration-tools

lmorchard commented May 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lmorchard commented May 13, 2026

Summary

Design Decisions

Changes

Test Plan

References

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants