Skip to content

feat(core): add page exploration tools, structured extract#446

Draft
lmorchard wants to merge 7 commits into
mainfrom
feat/page-exploration-tools
Draft

feat(core): add page exploration tools, structured extract#446
lmorchard wants to merge 7 commits into
mainfrom
feat/page-exploration-tools

Conversation

@lmorchard
Copy link
Copy Markdown
Collaborator

Summary

  • Adds two zero-LLM page-inspection tools (search_page, find_elements) so the agent doesn't pay an LLM round-trip for inventory questions ("is X on the page?", "how many Y are there?").
  • Extends extract with an optional outputSchema (JSON Schema) so the agent can request structured JSON output instead of markdown text it then has to re-parse.
  • Implemented across both browser backends — PlaywrightBrowser (CLI/server, with iframe iteration) and ExtensionBrowser (top-frame only, matching the extension's existing getTreeWithRefs scope).

Design Decisions

  • Dedicated AriaBrowser methods rather than routing through PageAction + performAction. Typed input/output, no string-encoded options. Mirrors extract's existing bypass of performAction.
  • New inspectionTools.ts factory alongside webActionTools/searchTools/tabstackTools/interactiveToolSet. Matches the per-concern file pattern; signals read-only intent.
  • Refs via the existing data-pilo-ref DOM attribute. The aria-tree code already sets this attribute during tree generation; new tools resolve nearestRef via el.closest('[data-pilo-ref]') and withinRef lookups via document.querySelector('[data-pilo-ref="..."]'). No changes to the aria-tree bundle.
  • Frame iteration matches each backend's existing aria-tree behavior. Playwright iterates same-origin + accessible cross-origin frames and tags per-frame matches with frameUrl; Extension is top-frame only (so frameUrl is always undefined in extension results). Forcing parity in either direction would require either dropping Playwright's frame coverage or adding allFrames: true machinery to the extension.
  • Wiring is unconditional. These tools have no API key / callback / provider dependency, so they're always in the agent's tool set — unlike searchTools (gated on search service), tabstackTools (gated on API key), and interactiveToolSet (gated on callback).
  • Structured extract routes through generateObjectWithRetry with the AI SDK's jsonSchema() helper. Markdown branch (no outputSchema) is byte-identical to its pre-PR behavior. Complementary to tabstack_extract_json — that one is for off-page URL fetches via the Tabstack API; this is for the current page using the configured LLM provider.

Changes

Core (packages/core/src/):

  • tools/inspectionTools.ts (new) — createInspectionTools factory exposing search_page and find_elements.
  • browser/ariaBrowser.ts — new types (SearchPageOptions/Match/Result, FindElementsOptions/Match/Result) and two new interface methods (searchPage, findElements).
  • browser/playwrightBrowser.tssearchPage and findElements implementations with cross-frame iteration mirroring getTreeWithRefsImpl.
  • tools/webActionTools.tsextract extended with optional outputSchema.
  • utils/retry.ts — new generateObjectWithRetry + shared retryDriver<T> helper extracted from both wrappers. NoObjectGeneratedError is now non-retryable.
  • webAgent.tscreateInspectionTools instantiated unconditionally; search_page and find_elements added to the pageChanged exempt list.
  • prompts.ts — three new TOOL_STRINGS entries (searchPage, findElements, extended extract); three buildToolExamples lines; one best-practices bullet in actionLoopSystemPromptTemplate.
  • core.ts — new type exports.

Extension (packages/extension/src/):

  • background/ExtensionBrowser.tssearchPage and findElements implementations via browser.scripting.executeScript, top frame only.

Tests: +1305 across packages (core 719, cli 221, server 88, extension 277). New inspectionTools.test.ts (446 lines) and new describe blocks in playwrightBrowser.test.ts, ExtensionBrowser.test.ts, retry.test.ts, webActionTools.test.ts. Includes NoObjectGeneratedError non-retry coverage.

Test Plan

  • pnpm run check passes (typecheck + format:check + 1305 tests across all packages)
  • gitleaks detect — no leaks
  • Manual smoke: run pnpm pilo run "<task>" against a static page (e.g., a Wikipedia article) and a dynamic page (e.g., a documentation search results page) and confirm:
    • search_page returns matches with nearestRef populated
    • find_elements with a CSS selector returns elements with auto-resolved href/src
    • find_elements with withinRef scopes correctly
    • extract without outputSchema still returns markdown
    • extract with outputSchema returns a JSON object matching the schema

References

🤖 Generated with Claude Code

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds two zero-LLM page-inspection tools (search_page, find_elements) and extends extract with optional outputSchema for structured JSON output, implemented across both Playwright and Extension browser backends.

Changes:

  • New search_page and find_elements tools (via new inspectionTools.ts factory and new AriaBrowser methods), with cross-frame iteration in Playwright and top-frame-only in the Extension.
  • extract extended with optional outputSchema routed through a new generateObjectWithRetry helper, with NoObjectGeneratedError treated as non-retryable.
  • retry.ts refactored to share a retryDriver<T> between text and object wrappers; prompts updated with new tool examples and best-practices guidance.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.

Show a summary per file
File Description
packages/core/src/browser/ariaBrowser.ts New types and interface methods for searchPage/findElements.
packages/core/src/browser/playwrightBrowser.ts Playwright implementations with main + sub-frame iteration.
packages/core/src/tools/inspectionTools.ts New factory exposing the two zero-LLM tools.
packages/core/src/tools/webActionTools.ts extract extended with structured-output branch.
packages/core/src/utils/retry.ts Shared retry driver + new generateObjectWithRetry; NoObjectGeneratedError non-retryable.
packages/core/src/webAgent.ts Wires inspection tools unconditionally; adds them to pageChanged exempt list.
packages/core/src/prompts.ts Tool descriptions, examples, best-practices line.
packages/core/src/core.ts Exports new types.
packages/extension/src/background/ExtensionBrowser.ts Extension implementations via scripting.executeScript.
packages/core/test/**, packages/extension/test/ExtensionBrowser.test.ts New unit tests covering both backends, retry, and structured extract.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lmorchard lmorchard marked this pull request as draft May 14, 2026 15:23
lmorchard and others added 6 commits May 15, 2026 16:04
Adds three additions to the agent's tool surface, plus a refactor of the
retry layer:

- `search_page`: zero-LLM text search of the current page via a TreeWalker.
  Returns matches with surrounding context and the nearest `data-pilo-ref`
  ancestor (`nearestRef`) so the agent can chain directly into `click`/`fill`
  without paying for an `extract` round-trip.

- `find_elements`: zero-LLM CSS-selector query. Optional `withinRef` scopes
  the query to an aria-tree subtree. Returns each match's tag, text,
  requested attributes (`href`/`src` auto-resolved to absolute URLs), and
  `nearestRef`.

- `extract({outputSchema})`: optional JSON Schema argument routes the existing
  extract through the AI SDK's `generateObject` (via the new
  `generateObjectWithRetry`) and returns `data: object` instead of
  `extractedData: string`. The markdown branch behavior is byte-identical
  to the prior implementation when `outputSchema` is absent.

Implemented across both browser backends:
- Playwright iterates same-origin + accessible cross-origin frames and tags
  per-frame matches with `frameUrl`, matching the existing aria-tree behavior.
- Extension is top-frame only (matches `ExtensionBrowser.getTreeWithRefs`),
  so `frameUrl` is always undefined in extension results.

Wiring is unconditional — these are pure DOM primitives with no API key /
callback / provider dependency. They live in a new `inspectionTools.ts`
factory, alongside `webActionTools` / `searchTools` / `tabstackTools` /
`interactiveToolSet` in `webAgent.ts`. `search_page` and `find_elements`
are added to the `pageChanged` exempt list.

Refs are resolved via the existing `data-pilo-ref` DOM attribute that
`ariaSnapshot.ts` already sets during tree generation, so no changes are
needed to the aria-tree bundle.

Refactor: extracted a shared `retryDriver<T>` from `generateTextWithRetry`
and `generateObjectWithRetry`. The two public wrappers become thin call
sites via `validateResult` and `getFinishReason` hooks. Net reduction in
`retry.ts` line count.

Also: `NoObjectGeneratedError` is now non-retryable in `isRetryableError`,
preventing 3× cost amplification on schema-validation failures.

Tests: +1305 across core/cli/server/extension (+24 search_page block, +30
find_elements block, +8 structured extract + new retry block, plus
MockBrowser stubs and a new `NoObjectGeneratedError` non-retry case).

Closes #432

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reframes the inspection-tool guidance around "trust the snapshot first;
escalate only when needed" rather than aggressively pushing the new tools
in all cases. Iter1 of the local prompt-tuning loop over-steered the
agent into calling search_page/find_elements to "confirm" what was
already visible in the aria-tree snapshot, costing +24% input tokens vs
baseline.

Changes:

- Best-practices block: lead with "default: trust the snapshot" and
  introduce inspection tools as escalations for cases the snapshot
  doesn't cover (truncated content, values buried in long page text,
  attributes at scale).
- extract.description: clarify that extract is for cases the snapshot
  doesn't already answer; explicit "do not pass empty {}" warning.
- search_page.description: scoped to "when the snapshot doesn't show the
  answer"; added concrete alternate-spelling guidance.
- find_elements.description: scoped to truncated snapshots, large
  attribute extraction, and subtree enumeration via withinRef.
- outputSchema description: explicit "REQUIRED with a real schema; {}
  is NOT valid".

Relaxed three description-string test assertions from .toBe / .toContain
to .toMatch so iteration on description copy doesn't break tests.

Local 5-task micro-eval (gemini-2.5-flash, vertex, chrome, headless):
total input tokens 390,971 (baseline) → 226,301 (iter2), -42%. Biggest
single win: search_page_lookup task (218K → 73K, -66%) — agent now
answers "CSS1 published 1996" from the snapshot directly. Sticky
remainder: model still passes outputSchema:{} instead of a real schema.

Tool wiring, types, and behavior are unchanged.
…ples

Iter3 of the local prompt-tuning loop. Two targeted nudges on top of the
iter2 snapshot-first framing:

- searchPage.description: require at least one zero-match recovery
  attempt (variant spelling, regex word-boundary, etc.) before answering
  "no". A single zero-match search is explicitly NOT a final answer.

- extract.outputSchema description: three copy-and-adapt one-line schema
  examples (single object, list of items, boolean+reason) plus an
  explicit "STOP and write out the shape before calling extract."

Local 5-task micro-eval (gemini-2.5-flash, vertex, chrome, headless):
total input tokens 226,301 (iter2) → 251,515 (iter3), +11%. The
regression is entirely from search_page_presence — agent now correctly
tries both spellings before concluding (the answer is correctly "No";
the page truly doesn't mention Beautiful Soup). vs baseline: -36%.

outputSchema effectiveness remains unexercised: the agent skipped
extract on the structured-data task because the HN snapshot already
contained the answer. A task where the snapshot is genuinely
insufficient is needed to evaluate the new outputSchema guidance.
Empty commit to fire `evals/**` workflow after switching the eval
pipeline's PILO_PW_CDP_ENDPOINT from bundled-browser (default fallback)
to Browserless. Several iter3 failures were navCount=1 / "Execution
context destroyed" patterns consistent with bundled-browser flakiness
in the Argo pod environment, not prompt regressions. This run isolates
the prompt changes from the browser stack.
Adds a runtime guard to the extract tool: when outputSchema is provided
but evaluates to {} (no keys), short-circuit before any LLM call and
return a recoverable error instructing the agent to either fill in a
real schema or omit the argument.

Why: three rounds of prompt iteration could not stop gemini-2.5-flash
from passing outputSchema:{} when asked for structured output. Across
two CI eval runs (iter3 bundled + iter3 Browserless, 60 tasks total),
zero extract calls included a real JSON Schema — 3-4 calls per run
passed {} which gives no validation and is functionally identical to
omitting the argument. Prompt-only enforcement has reached a model-
capability ceiling. The guard surfaces the issue as a tool error so
the agent can self-correct mid-task.

Behavior:
- outputSchema undefined → markdown branch (unchanged)
- outputSchema with real keys → generateObject branch (unchanged)
- outputSchema = {} → recoverable error with instructions; no
  getMarkdown(), no LLM call, no token spend

Also updates the outputSchema description so the agent knows the
rejection is enforced at runtime rather than a soft prompt-level
preference.

Tests: +1 covering the empty-schema rejection (no LLM/browser calls,
returns success:false / isRecoverable:true with a guiding error).
Existing extract tests unchanged (720 / 720 passing).
The previous run (pilo-batch-github-eval-khcjt) had 0/30 passes
because the GKE pilo-secrets bundle was reset to stubs/empty values
in the ~23-minute gap between two consecutive evals — likely another
local make cloud-secrets invocation from a different .env state.

This commit retriggers the eval against ad84c4b + 1bc9d4a (runtime
guard for empty extract outputSchema) with the correct secret.
@lmorchard lmorchard force-pushed the feat/page-exploration-tools branch from e17b78b to b6a40ab Compare May 15, 2026 23:05
The hard rejection from the previous commit (be609b6) caused two task
failures on the 100-task CI eval (Google Map #4 and ESPN #0):
gemini-2.5-flash passes outputSchema:{}, sees the recoverable error,
retries with outputSchema:{} again, and after 5 consecutive errors
the agent layer aborts the whole task.

Soften the guard: when outputSchema is non-null but has no keys,
silently treat it as if it were omitted (fall through to the markdown
branch). An empty {} schema gave no validation anyway — the structured
branch with an empty schema is indistinguishable from the markdown
branch. The fall-through is logged via an AGENT_STATUS event so the
downgrade is visible in traces.

Updated the outputSchema prompt copy: "an empty {} provides no
validation and is silently downgraded to a markdown extract" instead
of "will be REJECTED with a recoverable error".

Test: updated to assert the markdown branch IS called and the status
event IS emitted when outputSchema:{} is passed. Previously asserted
the recoverable-error shape; that behavior is gone.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add zero-LLM page exploration tools (search_page, find_elements, structured extract)

2 participants