Skip to content

Commit 4e8436d

Browse files
author
DavertMik
committed
Merge branch 'fix/more-tests-res-improv'
# Conflicts: # src/ai/researcher.ts
2 parents 503797b + 33f732d commit 4e8436d

26 files changed

Lines changed: 1095 additions & 134 deletions

CHANGELOG.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,31 @@
11
# Changelog
22

3+
## 2026-04-17
4+
5+
### Configuration
6+
- **`ai.agents.researcher.focusSections`** — List of CSS selectors that narrow research to a specific element when present on the page. If any selector matches, the researcher maps only that element instead of the whole page — useful for apps that open a focused panel (modal, drawer, detail view) on top of the main layout.
7+
```javascript
8+
ai: {
9+
agents: {
10+
researcher: {
11+
focusSections: ['[role="dialog"]', '.drawer-open', '#focused-panel'],
12+
},
13+
},
14+
}
15+
```
16+
17+
### Changes
18+
- [Tester] Detects modals and dialogs that appear mid-test and extends the page UI map with their controls — including overlays that don't expose `role="dialog"` (a "Close X" button is enough to recognize them), so the next tool call has selectors for the overlay.
19+
- [Researcher] New overlay analysis appends a section for each newly opened dialog/modal under the page's "Extended Research" heading and caches the result, so revisiting the same page skips the work.
20+
- ExploreCommand: The "Generated:" hints printed at the end of an explore session now list only the test files written during this run, not every file already sitting in `output/tests/`.
21+
- [Researcher] When the model's response gets truncated by context limits, the researcher now retries by splitting research into one request per section (focus, main, sidebar, etc.) and merging the results, instead of a single focused-retry prompt.
22+
- [Researcher] Honors the new `focusSections` config — if any configured CSS selector is present on the page, the researcher limits its UI map to that element rather than the full page.
23+
- [Tester] Past experience is no longer inlined into every tester turn. Instead, a compact table of contents (file tags plus section headings) is injected, and the agent fetches specific sections on demand via the new `learn_experience` tool. Cuts tester token usage on pages with accumulated experience.
24+
- [Pilot] Receives the same experience table of contents when tools are enabled and can pull full sections via `learn_experience`.
25+
- [Captain] The interactive web mode now exposes the `learn_experience` tool alongside `see`, `context`, and `visualClick`, so TUI-driven sessions can read past experience on demand.
26+
- [Planner] Rewrote the `normal`, `curious`, and `psycho` planning styles to rank scenarios by outcome strength: **data change > state change > UI-only**. Normal style now asks for complete commit flows over "form appears" checks, curious style treats an untested control as covered only when the scenario built around it reaches a data or state change (and refuses to merge a variation with a dismissal ending), and psycho style now attacks every reachable control in the same scenario with a different strange value instead of isolating one control per scenario.
27+
- Experience Tracker: New `getExperienceTableOfContents` / `getExperienceSection` API backs the TOC-based experience flow; sections are addressed by a short file tag (A, B, ...) and a 1-based section index.
28+
329
## 2026-04-15
430

531
### CLI Changes

docs/agents.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,7 @@ Agents share context through:
169169

170170
1. **State Manager** — Tracks current page, URL, navigation history
171171
2. **Research Results** — Structured page analysis available to Planner and Tester
172-
3. **Experience Files** — Learned patterns shared across sessions
172+
3. **Experience Files** — Learned patterns shared across sessions. Injected as a compact table of contents (file tags + section headings) rather than full bodies; agents pull individual sections on demand via the `learn_experience` tool.
173173
4. **Knowledge Files** — Domain knowledge you provide
174174

175175
Each agent maintains minimal context to keep costs down. They request specific information when needed rather than carrying full conversation history.

docs/configuration.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -211,6 +211,7 @@ The researcher agent supports all standard agent options plus additional options
211211
|--------|------|-------------|
212212
| `excludeSelectors` | `string[]` | CSS selectors for containers to exclude |
213213
| `includeSelectors` | `string[]` | CSS selectors for containers to always explore |
214+
| `focusSections` | `string[]` | CSS selectors that narrow research to a matching element when present (e.g. an open modal or drawer). First match wins. |
214215
| `stopWords` | `string[]` | Words to filter out (replaces defaults if provided) |
215216
| `maxElementsToExplore` | `number` | Maximum elements to explore per page (default: 10) |
216217

docs/planner.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,11 +45,19 @@ Each time the Planner generates scenarios, it applies a **style** — a testing
4545

4646
### Built-in Styles
4747

48+
All three built-in styles rank scenarios by **outcome strength**, from strongest to weakest:
49+
50+
1. **Data change** — a record is created, edited, deleted; a setting is persisted; a message is sent; a job is triggered.
51+
2. **State change** — a route change, a filter or sort actually applied to real data, a mode or auth change the app remembers.
52+
3. **UI-only change** — something opens, closes, is cancelled, is hovered, is toggled for display. The application never registers anything new.
53+
54+
Scenarios ending in category 1 or 2 are preferred. Category 3 is only proposed when the UI-only behaviour itself has a verifiable side effect (a warning prompt, a persisted draft, a badge appearing).
55+
4856
| Style | Focus | What it generates |
4957
|-------|-------|-------------------|
50-
| **normal** | Complete user workflows | CRUD operations, form submissions, filter+verify flows. Each test changes application state. Distributes tests across all feature areas. |
51-
| **psycho** | Invalid and extreme inputs | Empty submissions, 10000-character strings, special characters, SQL injection, wrong formats, boundary values, incompatible combinations. Finds what breaks. |
52-
| **curious** | Coverage gaps | Cross-references previous test results with page research to find untested controls. Exercises every select option, checkbox state, and skipped form field. Fills gaps, not repeats. |
58+
| **normal** | Complete user workflows | CRUD operations, full commit flows, filter+verify flows — each test ends in a data change or state change. UI-only tests (tab switching, pagination, view toggles) come last and only when data- and state-changing coverage is done. Distributes tests across feature areas. |
59+
| **psycho** | Invalid and extreme inputs | Attacks **every reachable control in the same scenario** with a different strange value — empty, 10000 chars, unicode, SQL, script tags, invalid formats, conflicting toggles, out-of-range dates — then commits. Scenarios that enter bad data and cancel are rejected: the application never received the payload. |
60+
| **curious** | Coverage gaps | Cross-references previous test results with page research to find untested controls. An untested control is only considered covered when the scenario built around it reaches a data or state change. Variation scenarios and dismissal/UI-only scenarios are kept separate — the planner will not merge them by appending a cancel at the end. |
5361

5462
### How Cycling Works
5563

docs/researcher.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,7 @@ ai: {
4343
| `model` | `string` | - | Override default model for Researcher |
4444
| `systemPrompt` | `string` | - | Additional instructions appended to the research prompt |
4545
| `sections` | `string[]` | all sections | Page sections to identify (order = priority) |
46+
| `focusSections` | `string[]` | `[]` | CSS selectors that narrow research to a matching element when present (first match wins). Useful for apps that open a modal, drawer, or detail panel on top of the main layout — the researcher will map only that element instead of the whole page. |
4647
| `excludeSelectors` | `string[]` | `[]` | CSS selectors to exclude from deep exploration |
4748
| `includeSelectors` | `string[]` | `[]` | CSS selectors to always explore (second pass) |
4849
| `stopWords` | `string[]` | defaults | Words to filter during deep exploration (replaces defaults) |
@@ -366,6 +367,35 @@ ai: {
366367
}
367368
```
368369

370+
### Focus on a Single Element
371+
372+
When your app opens a modal, drawer, or detail panel on top of the main layout, you usually want the researcher to map only that overlay, not the page behind it. `focusSections` is a list of CSS selectors — the first one that matches on the current page wins, and the researcher limits its UI map to that element:
373+
374+
```javascript
375+
ai: {
376+
agents: {
377+
researcher: {
378+
focusSections: [
379+
'[role="dialog"]', // open modal
380+
'.drawer-open', // expanded side drawer
381+
'#focused-panel', // your app's detail panel
382+
],
383+
},
384+
},
385+
}
386+
```
387+
388+
When none of the selectors match, the researcher falls back to mapping the whole page as usual.
389+
390+
### Handling Truncated Responses
391+
392+
The researcher produces a lot of output for busy pages. If the model's response gets cut off at `maxTokens`, Explorbot automatically retries by splitting the work into one request per section (focus, main, sidebar, etc.) and merging the results. You will usually see this transparently in the logs; no configuration needed.
393+
394+
If you see this happening often, consider:
395+
- lowering reasoning effort (see [Low Reasoning Effort](#low-reasoning-effort) below),
396+
- pinning the researcher to a non-reasoning model with a larger output window,
397+
- or narrowing the scope with `focusSections`.
398+
369399
### Custom Stop Words
370400

371401
```javascript

docs/test-plans.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# Test Plans
2+
3+
A test plan is a markdown file containing a suite of scenarios for the Tester to execute. Explorbot generates plans via the [Planner](./planner.md), but the format is plain markdown — you can hand-write plans, edit generated ones, or check them into version control.
4+
5+
Plans are saved to `output/plans/` by default and loaded by the same parser whether they were generated or authored manually.
6+
7+
The format is a dialect of the [Testomat.io classical markdown format](https://docs.testomat.io/project/import-export/export-tests/classical-tests-markdown-format/), extended with a `### Prerequisite` block so Explorbot knows which page to open before executing each test.
8+
9+
## Format
10+
11+
```markdown
12+
<!-- suite -->
13+
# Plan Title
14+
15+
### Prerequisite
16+
17+
* URL: /relative-path
18+
19+
<!-- test
20+
priority: critical
21+
-->
22+
# Scenario written as a user-facing sentence
23+
24+
## Steps
25+
* First step in plain language
26+
* Second step
27+
28+
## Expected
29+
* First expected outcome
30+
* Second expected outcome
31+
```
32+
33+
A single file may contain multiple suites — each begins with its own `<!-- suite -->` marker and parses as an independent plan.
34+
35+
## Elements
36+
37+
### `<!-- suite -->`
38+
39+
Marks the start of a plan. The `#` heading on the following line becomes the plan's title. Follows the Testomat.io convention of HTML-comment metadata blocks.
40+
41+
### `### Prerequisite`
42+
43+
Holds the suite-level URL as a single bullet:
44+
45+
```
46+
* URL: /relative-path
47+
```
48+
49+
**The URL is required — without it, tests in the suite will not be executed.** It must be **relative** to the configured base URL (start with `/`), so the same plan runs against staging, production, or a local dev server without edits.
50+
51+
Every test in the suite inherits this URL as its start page. Explorbot navigates to it before running each scenario.
52+
53+
### `<!-- test priority: … -->`
54+
55+
Opens a test block. Valid priorities: `critical`, `important`, `high`, `normal`, `low`. Defaults to `normal` if omitted. See [Test Priorities](./planner.md#test-priorities) for what each level means.
56+
57+
### `#` Scenario heading
58+
59+
A single `#` heading inside a test block is the scenario description. Write it as a business outcome, not a click path.
60+
61+
### `## Steps`
62+
63+
Bulleted list (`* `) of planned actions in plain language. The Tester reads these as guidance, not a strict script — it may adapt them to what it actually sees on the page. A step may span multiple lines by indenting continuation lines with 2 spaces.
64+
65+
Unlike the Testomat.io classical format — which inlines `*Expected*:` inside each step — Explorbot separates actions and outcomes into distinct `## Steps` and `## Expected` sections.
66+
67+
### `## Expected`
68+
69+
Bulleted list (`* `) of expected outcomes. Each outcome should describe a **verifiable change** — a data change, state change, or a UI change with a side effect. See the Planner's [outcome-strength guidance](./planner.md#built-in-styles) for what counts.
70+
71+
The Tester marks a test as passing only when every expected outcome has been verified.
72+
73+
## See Also
74+
75+
- [Planner](./planner.md) — how plans are generated
76+
- [Commands](./commands.md)`/plan`, `/explore`, `explorbot plan`
77+
- [Rerun](./rerun.md) — re-executing generated tests

rules/planner/styles/curious.md

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
Detect new valid paths that previous tests missed. Prioritize mining experience and research together before inventing abstract scenarios.
22

3+
Rank every scenario you build by the **strength of its outcome**, from strongest to weakest:
4+
1. **Data change** — the backend, storage, or persisted state registers a difference (a record is created, edited, or deleted; a setting is persisted; a message is sent; a job is triggered; an item is shared or exported).
5+
2. **State change** — the application moves to a different addressable or remembered state (route or URL change, a filter or sort actually applied to real data, a mode or auth change that the application remembers, the page showing a different underlying dataset).
6+
3. **UI change only** — a control opens, closes, is cancelled, is dismissed, is hovered, is toggled for display only, or the view expands/collapses without the application registering anything new.
7+
8+
Prefer scenarios whose ending falls into category 1. Propose a category 2 scenario when no category 1 outcome is reachable for the control under test. Propose a category 3 scenario last, and only when the UI-only behaviour itself has a verifiable side effect worth checking (a warning prompt, a persisted draft, a state rollback, a badge appearing). A page may expose several paths that reach a data or state change — different buttons, different menus, different keyboard shortcuts, different confirmation flows. Pick whichever path reaches category 1 or 2; do not assume a single "primary action" exists.
9+
310
When <previously_tested_flows> is present, treat it as the ground truth for what already worked:
411
- List items under Successful Flow describe the path that was executed
512
- Lines in blockquotes (lines starting with >) are discoveries: extra fields, side panels, conditional UI, inputs called out during that run
@@ -11,7 +18,7 @@ When <previously_tested_flows> is NOT present, use <tested_scenarios> as the gro
1118
Read the step lines for each test to understand which controls were actually interacted with.
1219
Identify elements from <page_research> that appear in NO test steps — these are coverage gaps.
1320

14-
Cross-read with <page_research>: for each form and Extended Research subsection, compare against those flows. Which text inputs, selects, checkboxes, toggles, and side controls were skipped or touched once with a single value? Prefer filling those gaps over repeating the same path.
21+
Cross-read with <page_research>: for each section and Extended Research subsection, compare against those flows. Which text inputs, selects, checkboxes, toggles, and side controls were skipped or touched once with a single value? Prefer filling those gaps over repeating the same path.
1522

1623
The Type column in <page_research> tables shows the ARIA role of each element.
1724
Cross-reference these types with the steps listed in <tested_scenarios> or <previously_tested_flows>:
@@ -24,16 +31,22 @@ Coverage gaps to look for:
2431
- Action buttons that were never clicked as part of a complete workflow
2532
- Dependent UI: controls that appear or change based on another control's value
2633

27-
When proposing tests for forms, prefer filling ALL visible fields — not just required ones.
34+
A coverage gap for an untested control is only **closed** when the scenario built around it reaches a data change or state change. A scenario that exercises the untested control but ends in a UI-only outcome does not close the gap — the application never registered the variation, so nothing distinguishes that scenario from not running it at all.
35+
36+
Exercising an untested control and testing a UI-only dismissal (cancel, close, navigate away, discard) are **two different categories of scenario**. Do not merge them by appending a dismissal ending to a variation scenario — the variation loses its value because the system never receives it. A dismissal or UI-only ending deserves its own dedicated scenario only when that dismissal itself has a verifiable side effect.
37+
38+
When multiple inputs or configurable controls contribute to the same outcome, prefer scenarios that configure **several of them together** before triggering the data or state change, rather than touching one control in isolation and ending there.
2839
Vary input strategies: try short values, multi-word values, edge-of-valid values.
29-
When a form has sections, tabs, or conditional panels, propose tests that exercise each section.
30-
If a control has downstream effects (e.g., selecting a type reveals extra fields), build a test around that interaction chain.
40+
When sections, tabs, or conditional panels exist, exercise each section.
41+
When a control has downstream effects (selecting one option reveals extra fields, toggling one setting enables another), build the scenario around that interaction chain — and still end it in a data or state change.
3142

3243
Combinatorial coverage (valid data only):
3344
- For each select or equivalent, ensure each option is exercised in at least one scenario, or one scenario whose steps walk through distinct options in sequence if that fits the task constraints better
3445
- Exercise each checkbox or binary control in both states when behavior can differ
3546
- Combine checkboxes and related toggles in small sets (pairs or triples) when they plausibly change validation, visible sections, or outcomes — avoid exploding into huge Cartesian products
3647

37-
When heavy forms are not the focus, still pursue: unvisited state transitions, follow-ups after creates (share, export, duplicate), alternative routes to the same goal, preconditions that unlock UI, and visible controls never clicked.
48+
Each proposed combination must be exercised in a scenario that reaches a data change or state change. Combinations that only change the UI and never reach a registerable outcome do not count as coverage — the system never distinguishes them from each other.
49+
50+
When the page is not heavy on inputs, still pursue: unvisited state transitions, follow-ups after data-changing operations (share, export, duplicate, re-open), alternative paths to the same data change, preconditions that unlock new data-changing actions, and visible controls never clicked. Again, prioritise scenarios whose ending falls into category 1 or 2.
3851

3952
Skip the Menu/Navigation section — we are testing THIS page.

0 commit comments

Comments
 (0)