Skip to content

Add zero-LLM page exploration tools (search_page, find_elements, structured extract) #432

@lmorchard

Description

@lmorchard

Current state

Pilo's only on-page information-extraction tool is extract (packages/core/src/tools/webActionTools.ts:310-358), which is LLM-powered:

extract: tool({
  description: "Extract specific data from the current page for later reference",
  inputSchema: z.object({
    description: z.string(),
  }),
  execute: async ({ description }) => {
    const markdown = await context.browser.getMarkdown();
    const prompt = buildExtractionPrompt(description, markdown);
    const extractResponse = await generateTextWithRetry({...}, { maxAttempts: 3 });
    // ... returns extractedData as markdown string ...
  },
})

Every extract call:

  • Converts the whole page to markdown (via Turndown, playwrightBrowser.ts:668-696)
  • Sends ~5000 tokens to an LLM
  • Retries up to 3 times on failure
  • Returns markdown text the agent then has to parse/interpret

The agent has no cheaper alternative for simpler questions ("is the word 'logout' on this page?" / "how many product cards are there?" / "what's the URL of the link with text 'Privacy Policy'?"). Every such question costs an extract LLM round trip.

The gap

Three related capability gaps:

  1. No zero-LLM page text search — for "does the page contain X?" the agent must call extract with a descriptive query and pay LLM cost + latency.
  2. No zero-LLM element query — for "how many <article> elements are there?" or "what are the hrefs of links in <nav>?" — same story.
  3. extract returns markdown only — when the agent wants structured data (a list of 10 items each with { name, price, url }), it has to parse the markdown back out, which is fragile. The Vercel AI SDK supports generateObject for structured output; Pilo's extract doesn't use it.

Proposed scope

A. Add search_page tool

search_page: tool({
  description:
    "Search the current page content for text matching a pattern. " +
    "Returns matches with surrounding context. Free and fast — prefer this over " +
    "extract() when you know what text to look for.",
  inputSchema: z.object({
    pattern: z.string(),
    regex: z.boolean().default(false),
    caseSensitive: z.boolean().default(false),
    contextChars: z.number().min(0).max(500).default(80),
    maxResults: z.number().min(1).max(50).default(10),
  }),
  execute: async ({ pattern, regex, caseSensitive, contextChars, maxResults }) => {
    return performActionWithValidation(
      PageAction.SearchPage,
      context,
      undefined,
      JSON.stringify({ pattern, regex, caseSensitive, contextChars, maxResults }),
    );
  },
}),

Implementation in playwrightBrowser.ts via page.evaluate:

const matches = await this.page!.evaluate(({ pattern, regex, caseSensitive, contextChars, maxResults }) => {
  const re = regex
    ? new RegExp(pattern, caseSensitive ? "g" : "gi")
    : new RegExp(pattern.replace(/[.*+?^${}()|[\]\\]/g, "\\$&"), caseSensitive ? "g" : "gi");
  // Walk text nodes via TreeWalker, accumulating offsets
  // Return array of { match, contextBefore, contextAfter, element selector hint }
}, { ... });

B. Add find_elements tool

find_elements: tool({
  description:
    "Find elements on the page by CSS selector. Returns matching elements with their " +
    "text and attributes. Free and fast — useful for inventory queries like " +
    "'how many product cards are there?' before deciding to extract().",
  inputSchema: z.object({
    selector: z.string(),
    attributes: z.array(z.string()).optional()
      .describe("Specific attributes to include (e.g., ['href', 'data-id'])"),
    maxResults: z.number().min(1).max(100).default(20),
    includeText: z.boolean().default(true),
  }),
  execute: async ({ selector, attributes, maxResults, includeText }) => {
    return performActionWithValidation(
      PageAction.FindElements,
      context,
      undefined,
      JSON.stringify({ selector, attributes, maxResults, includeText }),
    );
  },
}),

Implementation runs document.querySelectorAll in-page, returns { tag, text, attributes } per match. Resolve src/href to absolute URLs.

C. Add optional outputSchema to extract

Extend the existing tool:

extract: tool({
  description:
    "Extract data from the current page. If outputSchema is provided, returns structured " +
    "data matching the schema. Else returns markdown text.",
  inputSchema: z.object({
    description: z.string(),
    outputSchema: z.record(z.string(), z.unknown()).optional()
      .describe("JSON Schema describing the desired output structure"),
  }),
  execute: async ({ description, outputSchema }) => {
    const markdown = await context.browser.getMarkdown();
    if (outputSchema) {
      const zodSchema = jsonSchemaToZod(outputSchema);
      const { object } = await generateObjectWithRetry({
        ...providerConfig,
        prompt: buildExtractionPrompt(description, markdown),
        schema: zodSchema,
      }, { maxAttempts: 3 });
      return { success: true, action: "extract", description, data: object };
    } else {
      // existing markdown path
    }
  },
}),

generateObjectWithRetry is a thin wrapper around generateObject from the AI SDK following the same retry pattern as generateTextWithRetry.

Need a small JSON Schema → Zod converter, OR (simpler) accept Zod schemas directly and the model returns the matching JSON. Since tool schemas are already Zod, accepting outputSchema: z.record(z.string(), z.unknown()) is the most flexible — interpret it via generateObject({ output: 'no-schema' }) mode and validate after.

D. Update prompts

In prompts.ts:163-210 (buildToolExamples):

- search_page({"pattern": "logout"}) - Search page text. Free, fast.
- find_elements({"selector": "a.nav-link"}) - Query elements by CSS selector. Free, fast.
- extract({"description": "...", "outputSchema": {...}}) - Extract data. Use outputSchema
  for structured output (lists of items, key-value pairs, etc.).

Add to best practices:

For inventory questions ("how many X are there?", "is Y on the page?"), prefer
find_elements or search_page — they are free and instant. Reserve extract for
cases where you need synthesized or structured data the page doesn't expose directly.

Implementation notes

  • These tools run via performActionWithValidation for consistency in error handling and event emission, even though they aren't "actions" in the traditional sense (no DOM mutation). The naming is a bit off but consistent with the existing pattern.
  • search_page regex compilation can throw SyntaxError on bad patterns — return { success: false, error: "...", isRecoverable: true } rather than crashing.
  • find_elements selector can throw DOMException on bad selectors — same treatment.
  • Both tools should be safe and idempotent — no pageChanged: true.
  • The result shapes are not the standard ActionResult; consider extending the type or adding a discriminated union. Worth a small refactor.

Acceptance criteria

  • search_page and find_elements are available in webActionTools, with the right tool descriptions and prompt examples.
  • extract accepts an optional outputSchema and returns structured data when provided.
  • Tests in packages/core/test/ cover: text search (literal and regex), CSS query for various selectors, bad-pattern error handling, structured extract with a schema.
  • A manual eval on a small task set (e.g., "find the number of pricing tiers on this page" / "extract the top 5 product names and prices") shows the new tools reduce LLM calls per task.

Effort estimate

2-4 days. The two zero-LLM tools are quick (1 day each). The outputSchema work depends on how clean the JSON Schema → Zod path is.

Related issues

Pairs with the action-vocabulary-additions issue (both expand tool capabilities). Related to the modal/viewport-context issue (those tools also benefit from a clearer page model).

Files likely affected

  • packages/core/src/tools/webActionTools.ts (or new tools/inspectionTools.ts)
  • packages/core/src/browser/ariaBrowser.ts (PageAction enum)
  • packages/core/src/browser/playwrightBrowser.ts (handlers)
  • packages/core/src/prompts.ts (tool examples + best practices)
  • packages/core/src/utils/retry.ts (add generateObjectWithRetry)
  • packages/core/test/

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions