Current state
Pilo's only on-page information-extraction tool is extract (packages/core/src/tools/webActionTools.ts:310-358), which is LLM-powered:
extract: tool({
description: "Extract specific data from the current page for later reference",
inputSchema: z.object({
description: z.string(),
}),
execute: async ({ description }) => {
const markdown = await context.browser.getMarkdown();
const prompt = buildExtractionPrompt(description, markdown);
const extractResponse = await generateTextWithRetry({...}, { maxAttempts: 3 });
// ... returns extractedData as markdown string ...
},
})
Every extract call:
- Converts the whole page to markdown (via Turndown,
playwrightBrowser.ts:668-696)
- Sends ~5000 tokens to an LLM
- Retries up to 3 times on failure
- Returns markdown text the agent then has to parse/interpret
The agent has no cheaper alternative for simpler questions ("is the word 'logout' on this page?" / "how many product cards are there?" / "what's the URL of the link with text 'Privacy Policy'?"). Every such question costs an extract LLM round trip.
The gap
Three related capability gaps:
- No zero-LLM page text search — for "does the page contain X?" the agent must call
extract with a descriptive query and pay LLM cost + latency.
- No zero-LLM element query — for "how many
<article> elements are there?" or "what are the hrefs of links in <nav>?" — same story.
extract returns markdown only — when the agent wants structured data (a list of 10 items each with { name, price, url }), it has to parse the markdown back out, which is fragile. The Vercel AI SDK supports generateObject for structured output; Pilo's extract doesn't use it.
Proposed scope
A. Add search_page tool
search_page: tool({
description:
"Search the current page content for text matching a pattern. " +
"Returns matches with surrounding context. Free and fast — prefer this over " +
"extract() when you know what text to look for.",
inputSchema: z.object({
pattern: z.string(),
regex: z.boolean().default(false),
caseSensitive: z.boolean().default(false),
contextChars: z.number().min(0).max(500).default(80),
maxResults: z.number().min(1).max(50).default(10),
}),
execute: async ({ pattern, regex, caseSensitive, contextChars, maxResults }) => {
return performActionWithValidation(
PageAction.SearchPage,
context,
undefined,
JSON.stringify({ pattern, regex, caseSensitive, contextChars, maxResults }),
);
},
}),
Implementation in playwrightBrowser.ts via page.evaluate:
const matches = await this.page!.evaluate(({ pattern, regex, caseSensitive, contextChars, maxResults }) => {
const re = regex
? new RegExp(pattern, caseSensitive ? "g" : "gi")
: new RegExp(pattern.replace(/[.*+?^${}()|[\]\\]/g, "\\$&"), caseSensitive ? "g" : "gi");
// Walk text nodes via TreeWalker, accumulating offsets
// Return array of { match, contextBefore, contextAfter, element selector hint }
}, { ... });
B. Add find_elements tool
find_elements: tool({
description:
"Find elements on the page by CSS selector. Returns matching elements with their " +
"text and attributes. Free and fast — useful for inventory queries like " +
"'how many product cards are there?' before deciding to extract().",
inputSchema: z.object({
selector: z.string(),
attributes: z.array(z.string()).optional()
.describe("Specific attributes to include (e.g., ['href', 'data-id'])"),
maxResults: z.number().min(1).max(100).default(20),
includeText: z.boolean().default(true),
}),
execute: async ({ selector, attributes, maxResults, includeText }) => {
return performActionWithValidation(
PageAction.FindElements,
context,
undefined,
JSON.stringify({ selector, attributes, maxResults, includeText }),
);
},
}),
Implementation runs document.querySelectorAll in-page, returns { tag, text, attributes } per match. Resolve src/href to absolute URLs.
C. Add optional outputSchema to extract
Extend the existing tool:
extract: tool({
description:
"Extract data from the current page. If outputSchema is provided, returns structured " +
"data matching the schema. Else returns markdown text.",
inputSchema: z.object({
description: z.string(),
outputSchema: z.record(z.string(), z.unknown()).optional()
.describe("JSON Schema describing the desired output structure"),
}),
execute: async ({ description, outputSchema }) => {
const markdown = await context.browser.getMarkdown();
if (outputSchema) {
const zodSchema = jsonSchemaToZod(outputSchema);
const { object } = await generateObjectWithRetry({
...providerConfig,
prompt: buildExtractionPrompt(description, markdown),
schema: zodSchema,
}, { maxAttempts: 3 });
return { success: true, action: "extract", description, data: object };
} else {
// existing markdown path
}
},
}),
generateObjectWithRetry is a thin wrapper around generateObject from the AI SDK following the same retry pattern as generateTextWithRetry.
Need a small JSON Schema → Zod converter, OR (simpler) accept Zod schemas directly and the model returns the matching JSON. Since tool schemas are already Zod, accepting outputSchema: z.record(z.string(), z.unknown()) is the most flexible — interpret it via generateObject({ output: 'no-schema' }) mode and validate after.
D. Update prompts
In prompts.ts:163-210 (buildToolExamples):
- search_page({"pattern": "logout"}) - Search page text. Free, fast.
- find_elements({"selector": "a.nav-link"}) - Query elements by CSS selector. Free, fast.
- extract({"description": "...", "outputSchema": {...}}) - Extract data. Use outputSchema
for structured output (lists of items, key-value pairs, etc.).
Add to best practices:
For inventory questions ("how many X are there?", "is Y on the page?"), prefer
find_elements or search_page — they are free and instant. Reserve extract for
cases where you need synthesized or structured data the page doesn't expose directly.
Implementation notes
- These tools run via
performActionWithValidation for consistency in error handling and event emission, even though they aren't "actions" in the traditional sense (no DOM mutation). The naming is a bit off but consistent with the existing pattern.
search_page regex compilation can throw SyntaxError on bad patterns — return { success: false, error: "...", isRecoverable: true } rather than crashing.
find_elements selector can throw DOMException on bad selectors — same treatment.
- Both tools should be safe and idempotent — no
pageChanged: true.
- The result shapes are not the standard
ActionResult; consider extending the type or adding a discriminated union. Worth a small refactor.
Acceptance criteria
search_page and find_elements are available in webActionTools, with the right tool descriptions and prompt examples.
extract accepts an optional outputSchema and returns structured data when provided.
- Tests in
packages/core/test/ cover: text search (literal and regex), CSS query for various selectors, bad-pattern error handling, structured extract with a schema.
- A manual eval on a small task set (e.g., "find the number of pricing tiers on this page" / "extract the top 5 product names and prices") shows the new tools reduce LLM calls per task.
Effort estimate
2-4 days. The two zero-LLM tools are quick (1 day each). The outputSchema work depends on how clean the JSON Schema → Zod path is.
Related issues
Pairs with the action-vocabulary-additions issue (both expand tool capabilities). Related to the modal/viewport-context issue (those tools also benefit from a clearer page model).
Files likely affected
packages/core/src/tools/webActionTools.ts (or new tools/inspectionTools.ts)
packages/core/src/browser/ariaBrowser.ts (PageAction enum)
packages/core/src/browser/playwrightBrowser.ts (handlers)
packages/core/src/prompts.ts (tool examples + best practices)
packages/core/src/utils/retry.ts (add generateObjectWithRetry)
packages/core/test/
Current state
Pilo's only on-page information-extraction tool is
extract(packages/core/src/tools/webActionTools.ts:310-358), which is LLM-powered:Every
extractcall:playwrightBrowser.ts:668-696)The agent has no cheaper alternative for simpler questions ("is the word 'logout' on this page?" / "how many product cards are there?" / "what's the URL of the link with text 'Privacy Policy'?"). Every such question costs an
extractLLM round trip.The gap
Three related capability gaps:
extractwith a descriptive query and pay LLM cost + latency.<article>elements are there?" or "what are thehrefs of links in<nav>?" — same story.extractreturns markdown only — when the agent wants structured data (a list of 10 items each with{ name, price, url }), it has to parse the markdown back out, which is fragile. The Vercel AI SDK supportsgenerateObjectfor structured output; Pilo'sextractdoesn't use it.Proposed scope
A. Add
search_pagetoolImplementation in
playwrightBrowser.tsviapage.evaluate:B. Add
find_elementstoolImplementation runs
document.querySelectorAllin-page, returns{ tag, text, attributes }per match. Resolvesrc/hrefto absolute URLs.C. Add optional
outputSchematoextractExtend the existing tool:
generateObjectWithRetryis a thin wrapper aroundgenerateObjectfrom the AI SDK following the same retry pattern asgenerateTextWithRetry.Need a small JSON Schema → Zod converter, OR (simpler) accept Zod schemas directly and the model returns the matching JSON. Since tool schemas are already Zod, accepting
outputSchema: z.record(z.string(), z.unknown())is the most flexible — interpret it viagenerateObject({ output: 'no-schema' })mode and validate after.D. Update prompts
In
prompts.ts:163-210(buildToolExamples):Add to best practices:
Implementation notes
performActionWithValidationfor consistency in error handling and event emission, even though they aren't "actions" in the traditional sense (no DOM mutation). The naming is a bit off but consistent with the existing pattern.search_pageregex compilation can throwSyntaxErroron bad patterns — return{ success: false, error: "...", isRecoverable: true }rather than crashing.find_elementsselector can throwDOMExceptionon bad selectors — same treatment.pageChanged: true.ActionResult; consider extending the type or adding a discriminated union. Worth a small refactor.Acceptance criteria
search_pageandfind_elementsare available inwebActionTools, with the right tool descriptions and prompt examples.extractaccepts an optionaloutputSchemaand returns structured data when provided.packages/core/test/cover: text search (literal and regex), CSS query for various selectors, bad-pattern error handling, structured extract with a schema.Effort estimate
2-4 days. The two zero-LLM tools are quick (1 day each). The
outputSchemawork depends on how clean the JSON Schema → Zod path is.Related issues
Pairs with the action-vocabulary-additions issue (both expand tool capabilities). Related to the modal/viewport-context issue (those tools also benefit from a clearer page model).
Files likely affected
packages/core/src/tools/webActionTools.ts(or newtools/inspectionTools.ts)packages/core/src/browser/ariaBrowser.ts(PageAction enum)packages/core/src/browser/playwrightBrowser.ts(handlers)packages/core/src/prompts.ts(tool examples + best practices)packages/core/src/utils/retry.ts(addgenerateObjectWithRetry)packages/core/test/