feat(core): add page exploration tools, structured extract#446
Draft
lmorchard wants to merge 7 commits into
Draft
feat(core): add page exploration tools, structured extract#446lmorchard wants to merge 7 commits into
lmorchard wants to merge 7 commits into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds two zero-LLM page-inspection tools (search_page, find_elements) and extends extract with optional outputSchema for structured JSON output, implemented across both Playwright and Extension browser backends.
Changes:
- New
search_pageandfind_elementstools (via newinspectionTools.tsfactory and newAriaBrowsermethods), with cross-frame iteration in Playwright and top-frame-only in the Extension. extractextended with optionaloutputSchemarouted through a newgenerateObjectWithRetryhelper, withNoObjectGeneratedErrortreated as non-retryable.retry.tsrefactored to share aretryDriver<T>between text and object wrappers; prompts updated with new tool examples and best-practices guidance.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
packages/core/src/browser/ariaBrowser.ts |
New types and interface methods for searchPage/findElements. |
packages/core/src/browser/playwrightBrowser.ts |
Playwright implementations with main + sub-frame iteration. |
packages/core/src/tools/inspectionTools.ts |
New factory exposing the two zero-LLM tools. |
packages/core/src/tools/webActionTools.ts |
extract extended with structured-output branch. |
packages/core/src/utils/retry.ts |
Shared retry driver + new generateObjectWithRetry; NoObjectGeneratedError non-retryable. |
packages/core/src/webAgent.ts |
Wires inspection tools unconditionally; adds them to pageChanged exempt list. |
packages/core/src/prompts.ts |
Tool descriptions, examples, best-practices line. |
packages/core/src/core.ts |
Exports new types. |
packages/extension/src/background/ExtensionBrowser.ts |
Extension implementations via scripting.executeScript. |
packages/core/test/**, packages/extension/test/ExtensionBrowser.test.ts |
New unit tests covering both backends, retry, and structured extract. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Adds three additions to the agent's tool surface, plus a refactor of the
retry layer:
- `search_page`: zero-LLM text search of the current page via a TreeWalker.
Returns matches with surrounding context and the nearest `data-pilo-ref`
ancestor (`nearestRef`) so the agent can chain directly into `click`/`fill`
without paying for an `extract` round-trip.
- `find_elements`: zero-LLM CSS-selector query. Optional `withinRef` scopes
the query to an aria-tree subtree. Returns each match's tag, text,
requested attributes (`href`/`src` auto-resolved to absolute URLs), and
`nearestRef`.
- `extract({outputSchema})`: optional JSON Schema argument routes the existing
extract through the AI SDK's `generateObject` (via the new
`generateObjectWithRetry`) and returns `data: object` instead of
`extractedData: string`. The markdown branch behavior is byte-identical
to the prior implementation when `outputSchema` is absent.
Implemented across both browser backends:
- Playwright iterates same-origin + accessible cross-origin frames and tags
per-frame matches with `frameUrl`, matching the existing aria-tree behavior.
- Extension is top-frame only (matches `ExtensionBrowser.getTreeWithRefs`),
so `frameUrl` is always undefined in extension results.
Wiring is unconditional — these are pure DOM primitives with no API key /
callback / provider dependency. They live in a new `inspectionTools.ts`
factory, alongside `webActionTools` / `searchTools` / `tabstackTools` /
`interactiveToolSet` in `webAgent.ts`. `search_page` and `find_elements`
are added to the `pageChanged` exempt list.
Refs are resolved via the existing `data-pilo-ref` DOM attribute that
`ariaSnapshot.ts` already sets during tree generation, so no changes are
needed to the aria-tree bundle.
Refactor: extracted a shared `retryDriver<T>` from `generateTextWithRetry`
and `generateObjectWithRetry`. The two public wrappers become thin call
sites via `validateResult` and `getFinishReason` hooks. Net reduction in
`retry.ts` line count.
Also: `NoObjectGeneratedError` is now non-retryable in `isRetryableError`,
preventing 3× cost amplification on schema-validation failures.
Tests: +1305 across core/cli/server/extension (+24 search_page block, +30
find_elements block, +8 structured extract + new retry block, plus
MockBrowser stubs and a new `NoObjectGeneratedError` non-retry case).
Closes #432
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reframes the inspection-tool guidance around "trust the snapshot first;
escalate only when needed" rather than aggressively pushing the new tools
in all cases. Iter1 of the local prompt-tuning loop over-steered the
agent into calling search_page/find_elements to "confirm" what was
already visible in the aria-tree snapshot, costing +24% input tokens vs
baseline.
Changes:
- Best-practices block: lead with "default: trust the snapshot" and
introduce inspection tools as escalations for cases the snapshot
doesn't cover (truncated content, values buried in long page text,
attributes at scale).
- extract.description: clarify that extract is for cases the snapshot
doesn't already answer; explicit "do not pass empty {}" warning.
- search_page.description: scoped to "when the snapshot doesn't show the
answer"; added concrete alternate-spelling guidance.
- find_elements.description: scoped to truncated snapshots, large
attribute extraction, and subtree enumeration via withinRef.
- outputSchema description: explicit "REQUIRED with a real schema; {}
is NOT valid".
Relaxed three description-string test assertions from .toBe / .toContain
to .toMatch so iteration on description copy doesn't break tests.
Local 5-task micro-eval (gemini-2.5-flash, vertex, chrome, headless):
total input tokens 390,971 (baseline) → 226,301 (iter2), -42%. Biggest
single win: search_page_lookup task (218K → 73K, -66%) — agent now
answers "CSS1 published 1996" from the snapshot directly. Sticky
remainder: model still passes outputSchema:{} instead of a real schema.
Tool wiring, types, and behavior are unchanged.
…ples Iter3 of the local prompt-tuning loop. Two targeted nudges on top of the iter2 snapshot-first framing: - searchPage.description: require at least one zero-match recovery attempt (variant spelling, regex word-boundary, etc.) before answering "no". A single zero-match search is explicitly NOT a final answer. - extract.outputSchema description: three copy-and-adapt one-line schema examples (single object, list of items, boolean+reason) plus an explicit "STOP and write out the shape before calling extract." Local 5-task micro-eval (gemini-2.5-flash, vertex, chrome, headless): total input tokens 226,301 (iter2) → 251,515 (iter3), +11%. The regression is entirely from search_page_presence — agent now correctly tries both spellings before concluding (the answer is correctly "No"; the page truly doesn't mention Beautiful Soup). vs baseline: -36%. outputSchema effectiveness remains unexercised: the agent skipped extract on the structured-data task because the HN snapshot already contained the answer. A task where the snapshot is genuinely insufficient is needed to evaluate the new outputSchema guidance.
Empty commit to fire `evals/**` workflow after switching the eval pipeline's PILO_PW_CDP_ENDPOINT from bundled-browser (default fallback) to Browserless. Several iter3 failures were navCount=1 / "Execution context destroyed" patterns consistent with bundled-browser flakiness in the Argo pod environment, not prompt regressions. This run isolates the prompt changes from the browser stack.
Adds a runtime guard to the extract tool: when outputSchema is provided
but evaluates to {} (no keys), short-circuit before any LLM call and
return a recoverable error instructing the agent to either fill in a
real schema or omit the argument.
Why: three rounds of prompt iteration could not stop gemini-2.5-flash
from passing outputSchema:{} when asked for structured output. Across
two CI eval runs (iter3 bundled + iter3 Browserless, 60 tasks total),
zero extract calls included a real JSON Schema — 3-4 calls per run
passed {} which gives no validation and is functionally identical to
omitting the argument. Prompt-only enforcement has reached a model-
capability ceiling. The guard surfaces the issue as a tool error so
the agent can self-correct mid-task.
Behavior:
- outputSchema undefined → markdown branch (unchanged)
- outputSchema with real keys → generateObject branch (unchanged)
- outputSchema = {} → recoverable error with instructions; no
getMarkdown(), no LLM call, no token spend
Also updates the outputSchema description so the agent knows the
rejection is enforced at runtime rather than a soft prompt-level
preference.
Tests: +1 covering the empty-schema rejection (no LLM/browser calls,
returns success:false / isRecoverable:true with a guiding error).
Existing extract tests unchanged (720 / 720 passing).
The previous run (pilo-batch-github-eval-khcjt) had 0/30 passes because the GKE pilo-secrets bundle was reset to stubs/empty values in the ~23-minute gap between two consecutive evals — likely another local make cloud-secrets invocation from a different .env state. This commit retriggers the eval against ad84c4b + 1bc9d4a (runtime guard for empty extract outputSchema) with the correct secret.
e17b78b to
b6a40ab
Compare
The hard rejection from the previous commit (be609b6) caused two task failures on the 100-task CI eval (Google Map #4 and ESPN #0): gemini-2.5-flash passes outputSchema:{}, sees the recoverable error, retries with outputSchema:{} again, and after 5 consecutive errors the agent layer aborts the whole task. Soften the guard: when outputSchema is non-null but has no keys, silently treat it as if it were omitted (fall through to the markdown branch). An empty {} schema gave no validation anyway — the structured branch with an empty schema is indistinguishable from the markdown branch. The fall-through is logged via an AGENT_STATUS event so the downgrade is visible in traces. Updated the outputSchema prompt copy: "an empty {} provides no validation and is silently downgraded to a markdown extract" instead of "will be REJECTED with a recoverable error". Test: updated to assert the markdown branch IS called and the status event IS emitted when outputSchema:{} is passed. Previously asserted the recoverable-error shape; that behavior is gone.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
search_page,find_elements) so the agent doesn't pay an LLM round-trip for inventory questions ("is X on the page?", "how many Y are there?").extractwith an optionaloutputSchema(JSON Schema) so the agent can request structured JSON output instead of markdown text it then has to re-parse.PlaywrightBrowser(CLI/server, with iframe iteration) andExtensionBrowser(top-frame only, matching the extension's existinggetTreeWithRefsscope).Design Decisions
AriaBrowsermethods rather than routing throughPageAction+performAction. Typed input/output, no string-encoded options. Mirrorsextract's existing bypass ofperformAction.inspectionTools.tsfactory alongsidewebActionTools/searchTools/tabstackTools/interactiveToolSet. Matches the per-concern file pattern; signals read-only intent.data-pilo-refDOM attribute. The aria-tree code already sets this attribute during tree generation; new tools resolvenearestRefviael.closest('[data-pilo-ref]')andwithinReflookups viadocument.querySelector('[data-pilo-ref="..."]'). No changes to the aria-tree bundle.frameUrl; Extension is top-frame only (soframeUrlis always undefined in extension results). Forcing parity in either direction would require either dropping Playwright's frame coverage or addingallFrames: truemachinery to the extension.searchTools(gated on search service),tabstackTools(gated on API key), andinteractiveToolSet(gated on callback).generateObjectWithRetrywith the AI SDK'sjsonSchema()helper. Markdown branch (nooutputSchema) is byte-identical to its pre-PR behavior. Complementary totabstack_extract_json— that one is for off-page URL fetches via the Tabstack API; this is for the current page using the configured LLM provider.Changes
Core (
packages/core/src/):tools/inspectionTools.ts(new) —createInspectionToolsfactory exposingsearch_pageandfind_elements.browser/ariaBrowser.ts— new types (SearchPageOptions/Match/Result,FindElementsOptions/Match/Result) and two new interface methods (searchPage,findElements).browser/playwrightBrowser.ts—searchPageandfindElementsimplementations with cross-frame iteration mirroringgetTreeWithRefsImpl.tools/webActionTools.ts—extractextended with optionaloutputSchema.utils/retry.ts— newgenerateObjectWithRetry+ sharedretryDriver<T>helper extracted from both wrappers.NoObjectGeneratedErroris now non-retryable.webAgent.ts—createInspectionToolsinstantiated unconditionally;search_pageandfind_elementsadded to thepageChangedexempt list.prompts.ts— three newTOOL_STRINGSentries (searchPage,findElements, extendedextract); threebuildToolExampleslines; one best-practices bullet inactionLoopSystemPromptTemplate.core.ts— new type exports.Extension (
packages/extension/src/):background/ExtensionBrowser.ts—searchPageandfindElementsimplementations viabrowser.scripting.executeScript, top frame only.Tests: +1305 across packages (core 719, cli 221, server 88, extension 277). New
inspectionTools.test.ts(446 lines) and newdescribeblocks inplaywrightBrowser.test.ts,ExtensionBrowser.test.ts,retry.test.ts,webActionTools.test.ts. IncludesNoObjectGeneratedErrornon-retry coverage.Test Plan
pnpm run checkpasses (typecheck + format:check + 1305 tests across all packages)gitleaks detect— no leakspnpm pilo run "<task>"against a static page (e.g., a Wikipedia article) and a dynamic page (e.g., a documentation search results page) and confirm:search_pagereturns matches withnearestRefpopulatedfind_elementswith a CSS selector returns elements with auto-resolvedhref/srcfind_elementswithwithinRefscopes correctlyextractwithoutoutputSchemastill returns markdownextractwithoutputSchemareturns a JSON object matching the schemaReferences
docs/dev-sessions/2026-05-13-1319-page-exploration-tools/spec.md(not committed to git — session artifact)docs/dev-sessions/2026-05-13-1319-page-exploration-tools/plan.md(not committed to git — session artifact)🤖 Generated with Claude Code