Current state
Pilo's ariaTree (packages/core/src/browser/ariaTree/ariaSnapshot.ts) generates a tree of all CSS-visible, accessible elements with getBoundingClientRect() width and height > 0. Two consequences:
1. No modal-aware filtering
When a modal/dialog is open on top of the page, the underlying page's elements also get refs in the snapshot. Example: a "Sign in" modal covers a checkout page. The snapshot contains both the modal's <input>/buttons AND the underlying page's "Buy now" / "Continue" / etc. buttons. The LLM has to guess which layer to interact with.
The system prompt papers over this with guidance (prompts.ts:339):
Clear obstructing modals/popups first
But the model often clicks the wrong layer first because the underlying elements look like valid targets in the tree.
2. No viewport / scroll-position context
The snapshot includes elements far below the fold — anything with non-zero size is in the tree, regardless of whether the user can see it. The model has no signal for:
- "Is element E47 currently visible, or 3000px down?"
- "Has the page been scrolled? How much room is below?"
- "Is there content above I might need to scroll up to?"
The LLM uses cues like "this is at the top of the tree, so probably near the top of the page" — which is mostly true but breaks for position: fixed headers, sticky sidebars, modal portals that render at the end of <body> etc.
The gap
Modal blindness is a real source of agent failure: the model fills the wrong form, clicks the wrong button, sees errors about elements being obscured. Pilo's BROWSER_ACTION_COMPLETED error on Element is not visible or obstructed by another element traces back to this.
Lack of scroll context makes scroll decisions blind: the model doesn't know when to scroll or how far. If a scroll tool is added (see separate issue), it lands without supporting context.
Proposed scope
A. Modal-aware occlusion in ariaTree
During tree generation in ariaSnapshot.ts, detect "modal-like" elements:
function isModalLike(element: Element, role: string, props: Record<string, unknown>): boolean {
// ARIA: explicit dialog/alertdialog with aria-modal=true
if ((role === "dialog" || role === "alertdialog") && element.getAttribute("aria-modal") === "true") {
return true;
}
// HTML <dialog open> with showModal()
if (element.tagName === "DIALOG" && (element as HTMLDialogElement).open && (element as HTMLDialogElement).matches(":modal")) {
return true;
}
// Heuristic: very large fixed-position overlay covering >70% of viewport
const style = window.getComputedStyle(element);
if (style.position === "fixed" || style.position === "absolute") {
const rect = element.getBoundingClientRect();
if (rect.width > window.innerWidth * 0.7 && rect.height > window.innerHeight * 0.7) {
// Plus has a high z-index OR is the last child of body (portal heuristic)
const zIndex = parseInt(style.zIndex || "0", 10);
if (zIndex > 1000) return true;
}
}
return false;
}
When a modal-like element is detected during tree generation:
- Mark the modal's subtree normally (refs assigned).
- Mark non-modal subtrees with a
[obscured] property OR drop their refs entirely (so the LLM can't target them).
- Add a note at the top of the YAML output:
# A modal is open. Only modal elements have refs.
The [obscured] approach is preferred because it preserves the structural context — the model sees that other elements exist but understands they're not interactable right now.
B. Viewport / scroll context in the per-step user message
Extend getTreeWithRefs (or the snapshot wrapper in webAgent.ts:779-868) to also return:
{
yaml: string;
viewport: {
scrollY: number;
docHeight: number;
viewportHeight: number;
pagesAbove: number; // floor(scrollY / viewportHeight)
pagesBelow: number; // floor((docHeight - scrollY - viewportHeight) / viewportHeight)
atTop: boolean;
atBottom: boolean;
};
}
In the page-snapshot prompt template (prompts.ts:475-500), insert before the tree:
Page position: {pagesAbove} viewport(s) above, {pagesBelow} viewport(s) below.
{% if atTop %}You are at the top of the page.{% endif %}
{% if atBottom %}You are at the bottom of the page.{% endif %}
C. Optional: surface offscreen interactive elements as hints
When some interactive elements (buttons, inputs, links) are below the viewport, surface a short hint:
Page position: 0 viewports above, 5 viewports below.
There are interactive elements below the viewport not shown in detail. Scroll down to see them.
Initially do not list specifics — just signal their presence. A future refinement could surface accessible names of below-fold interactive elements with [offscreen] markers, but that grows the prompt and is not always useful.
Implementation notes
- The modal detection runs in-page (same context as ariaTree). The bundle in
bundle.ts needs to be regenerated after adding isModalLike to ariaSnapshot.ts — that happens via scripts/bundle-aria-tree.ts at build time.
- The heuristic-based modal detection (fixed/absolute + large + high z-index) is fragile. ARIA-based detection (
role="dialog" + aria-modal=true) is reliable when present. Most modern UI libraries set these correctly; some legacy/custom-built modals don't. Combine both.
- Some pages have "soft modals" (a panel that takes focus but doesn't strictly block the rest of the page). The agent should still be able to interact with the rest. Conservative rule: only trigger occlusion behavior on
aria-modal=true or <dialog open>:modal(). Heuristic-based detection emits a comment but does NOT mark non-modal elements obscured (just a warning the model can use).
- Viewport metrics computation is cheap. The data flows through
getTreeWithRefs → addPageSnapshot → buildPageSnapshotPrompt.
- Testing modal occlusion: pick 2-3 real sites with modals (a cookie banner, a sign-in modal, a confirmation dialog). Verify the snapshot reflects the modal-only state when one is open.
Acceptance criteria
- ariaTree detects ARIA modals (
role="dialog" + aria-modal="true" and <dialog>:modal()); their non-modal siblings carry [obscured] markers OR are dropped (decide based on benchmark).
- Heuristic-detected modals emit a comment at the top of the snapshot (non-modal elements stay refs but the model has the signal).
- Per-step snapshot prompt includes scroll position context (
pagesAbove, pagesBelow, atTop, atBottom).
- Tests in
packages/core/test/ cover: ARIA modal occlusion, heuristic modal warning, viewport metrics on scrolled and unscrolled pages.
- Manual smoke test: agent on a page with an open modal correctly interacts only with the modal.
Effort estimate
2-3 days. Modal detection is the larger piece; viewport context is a few hours.
Related issues
Pairs with the scroll-action issue (scroll context becomes actionable when the agent has a scroll tool). Independent of the others.
Files likely affected
packages/core/src/browser/ariaTree/ariaSnapshot.ts (modal detection during tree walk)
packages/core/src/browser/ariaTree/types.ts (extended AriaNode props)
packages/core/src/browser/ariaBrowser.ts (return type for getTreeWithRefs)
packages/core/src/browser/playwrightBrowser.ts (viewport metrics return)
packages/core/src/webAgent.ts (addPageSnapshot)
packages/core/src/prompts.ts (page-snapshot template)
packages/core/test/
Current state
Pilo's ariaTree (
packages/core/src/browser/ariaTree/ariaSnapshot.ts) generates a tree of all CSS-visible, accessible elements withgetBoundingClientRect()width and height > 0. Two consequences:1. No modal-aware filtering
When a modal/dialog is open on top of the page, the underlying page's elements also get refs in the snapshot. Example: a "Sign in" modal covers a checkout page. The snapshot contains both the modal's
<input>/buttons AND the underlying page's "Buy now" / "Continue" / etc. buttons. The LLM has to guess which layer to interact with.The system prompt papers over this with guidance (
prompts.ts:339):But the model often clicks the wrong layer first because the underlying elements look like valid targets in the tree.
2. No viewport / scroll-position context
The snapshot includes elements far below the fold — anything with non-zero size is in the tree, regardless of whether the user can see it. The model has no signal for:
The LLM uses cues like "this is at the top of the tree, so probably near the top of the page" — which is mostly true but breaks for
position: fixedheaders, sticky sidebars, modal portals that render at the end of<body>etc.The gap
Modal blindness is a real source of agent failure: the model fills the wrong form, clicks the wrong button, sees errors about elements being obscured. Pilo's
BROWSER_ACTION_COMPLETEDerror onElement is not visibleorobstructed by another elementtraces back to this.Lack of scroll context makes scroll decisions blind: the model doesn't know when to scroll or how far. If a scroll tool is added (see separate issue), it lands without supporting context.
Proposed scope
A. Modal-aware occlusion in ariaTree
During tree generation in
ariaSnapshot.ts, detect "modal-like" elements:When a modal-like element is detected during tree generation:
[obscured]property OR drop their refs entirely (so the LLM can't target them).# A modal is open. Only modal elements have refs.The
[obscured]approach is preferred because it preserves the structural context — the model sees that other elements exist but understands they're not interactable right now.B. Viewport / scroll context in the per-step user message
Extend
getTreeWithRefs(or the snapshot wrapper inwebAgent.ts:779-868) to also return:In the page-snapshot prompt template (
prompts.ts:475-500), insert before the tree:C. Optional: surface offscreen interactive elements as hints
When some interactive elements (buttons, inputs, links) are below the viewport, surface a short hint:
Initially do not list specifics — just signal their presence. A future refinement could surface accessible names of below-fold interactive elements with
[offscreen]markers, but that grows the prompt and is not always useful.Implementation notes
bundle.tsneeds to be regenerated after addingisModalLiketoariaSnapshot.ts— that happens viascripts/bundle-aria-tree.tsat build time.role="dialog"+aria-modal=true) is reliable when present. Most modern UI libraries set these correctly; some legacy/custom-built modals don't. Combine both.aria-modal=trueor<dialog open>:modal(). Heuristic-based detection emits a comment but does NOT mark non-modal elements obscured (just a warning the model can use).getTreeWithRefs→addPageSnapshot→buildPageSnapshotPrompt.Acceptance criteria
role="dialog"+aria-modal="true"and<dialog>:modal()); their non-modal siblings carry[obscured]markers OR are dropped (decide based on benchmark).pagesAbove,pagesBelow,atTop,atBottom).packages/core/test/cover: ARIA modal occlusion, heuristic modal warning, viewport metrics on scrolled and unscrolled pages.Effort estimate
2-3 days. Modal detection is the larger piece; viewport context is a few hours.
Related issues
Pairs with the scroll-action issue (scroll context becomes actionable when the agent has a scroll tool). Independent of the others.
Files likely affected
packages/core/src/browser/ariaTree/ariaSnapshot.ts(modal detection during tree walk)packages/core/src/browser/ariaTree/types.ts(extended AriaNode props)packages/core/src/browser/ariaBrowser.ts(return type for getTreeWithRefs)packages/core/src/browser/playwrightBrowser.ts(viewport metrics return)packages/core/src/webAgent.ts(addPageSnapshot)packages/core/src/prompts.ts(page-snapshot template)packages/core/test/