Skip to content

Commit 043bd3f

Browse files
authored
docs(agent): add config authoring workflow guide (#305)
1 parent 2d64c47 commit 043bd3f

1 file changed

Lines changed: 213 additions & 0 deletions

File tree

AGENTS.md

Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
# html2rss-configs Agent Guide
2+
3+
This repo owns the curated YAML config set for `html2rss`.
4+
5+
Primary goal: add or repair configs that are stable, shippable, and easy to verify. Prefer a narrow, clean surface over a broad noisy one.
6+
7+
## Scope
8+
9+
- Source of truth here: `lib/html2rss/configs/`.
10+
- Do not hand-edit generated schema output.
11+
- Keep config work separate from downstream docs, web, or example changes unless the task explicitly includes them.
12+
13+
## Defaults
14+
15+
- Use the registrable domain folder, not a subdomain folder, unless there is a strong existing reason.
16+
- Start from the cleanest article list the site offers, not the marketing homepage by default.
17+
- Prefer stable list/detail extraction over extracting every possible field.
18+
- If the site only becomes reliable on a narrower path, use that narrower path.
19+
- Omit brittle fields. If dates or descriptions are low quality, leave them out.
20+
- Set `enhance: false` when enhancement pulls in page chrome, duplicate cards, or unrelated links.
21+
22+
## Surface Selection
23+
24+
Prefer these surfaces first:
25+
26+
- dedicated newsroom or blog archive pages
27+
- category pages with one repeated card structure
28+
- stable subpaths like `/blog/latest` or `/blog/everything/`
29+
30+
Avoid these unless they are the only workable option:
31+
32+
- homepages with hero content mixed with promos
33+
- pages that combine multiple unrelated card systems
34+
- infinite-scroll surfaces unless Browserless is already clearly required
35+
- localized or geo-redirecting entry pages when a stable non-localized path exists
36+
37+
## Selector Strategy
38+
39+
Start with the smallest useful selector set:
40+
41+
- `items`
42+
- `title`
43+
- `url`
44+
45+
Add fields only when they are clean:
46+
47+
- `description`
48+
- `published_at`
49+
- `author`
50+
- `categories`
51+
52+
Useful patterns:
53+
54+
- Prefer the repeated article card itself as `items`, especially when it is a single anchor.
55+
- Anchor on article URLs or stable path fragments instead of generic headings.
56+
- Keep selectors item-local when possible.
57+
- Do not add complexity to recover weak optional fields.
58+
59+
## Chrome MCP
60+
61+
Use Chrome MCP when the static HTML is unclear, the page is hydrated, or Faraday fetch returns zero items while the browser shows a valid list.
62+
63+
Recommended sequence:
64+
65+
1. Open the target URL.
66+
2. Take an accessibility snapshot.
67+
3. Identify the exact repeated item boundary.
68+
4. Confirm the title and URL live inside that boundary.
69+
5. Record the final URL if the page redirects by locale or renders a different surface than expected.
70+
71+
## Browserless
72+
73+
Use Browserless when:
74+
75+
- the page is JS-rendered
76+
- Faraday fetch returns zero items but Chrome shows a valid repeated list
77+
- the site is bot-sensitive enough that static fetch is unreliable
78+
79+
Local Browserless notes:
80+
81+
- `html2rss-web` exposes a local endpoint at `ws://127.0.0.1:4002`
82+
- Browserless fetch tests require `BROWSERLESS_IO_WEBSOCKET_URL`
83+
- custom websocket endpoints also require `BROWSERLESS_IO_API_TOKEN`
84+
85+
Do not default the whole repo to Browserless. Use it only for configs that need it.
86+
87+
## Command Assumptions
88+
89+
Assume the `html2rss` CLI is available on `PATH` when working against the sibling core repo.
90+
91+
- Use `html2rss ...` in examples and one-off validation commands.
92+
- If the CLI is not installed globally in the current environment, run the equivalent command from the sibling `html2rss/` checkout, typically `bundle exec exe/html2rss ...`.
93+
- In this repo, keep using `make ...` and `bundle exec rspec ...` because those are the implemented entrypoints.
94+
95+
## Fast Path
96+
97+
1. Find the cleanest stable candidate URL.
98+
2. Inspect the DOM in Chrome MCP before writing selectors.
99+
3. Create the YAML with the schema modeline and minimal selectors.
100+
4. Validate the single file with the core CLI.
101+
5. Generate a live feed with the core CLI.
102+
6. Tighten selectors until the feed output is clean.
103+
7. Run repo validation and non-fetch tests.
104+
8. Run the appropriate fetch lane:
105+
- plain fetch for static or Faraday-backed configs
106+
- Browserless fetch for JS-heavy or Browserless-backed configs
107+
108+
## Quality Gate
109+
110+
For every new or changed config, verify in this order.
111+
112+
1. Single-file runtime validation in the core repo:
113+
114+
```bash
115+
cd ../html2rss
116+
html2rss validate /abs/path/to/config.yml
117+
```
118+
119+
2. Single-file live feed generation in the core repo:
120+
121+
```bash
122+
cd ../html2rss
123+
html2rss feed /abs/path/to/config.yml
124+
```
125+
126+
3. Repo-wide validation in this repo:
127+
128+
```bash
129+
make validate
130+
```
131+
132+
4. Repo non-fetch tests in this repo:
133+
134+
```bash
135+
make test
136+
```
137+
138+
5. Focused fetch verification:
139+
140+
- Faraday-backed candidate:
141+
142+
```bash
143+
bundle exec rspec --tag fetch --example 'example.com/feed.yml' spec/html2rss/configs_dynamic_spec.rb
144+
```
145+
146+
- Browserless-backed candidate:
147+
148+
```bash
149+
BROWSERLESS_IO_WEBSOCKET_URL=ws://127.0.0.1:4002 \
150+
BROWSERLESS_IO_API_TOKEN=... \
151+
bundle exec rspec --tag fetch --example 'example.com/feed.yml' spec/html2rss/configs_dynamic_spec.rb
152+
```
153+
154+
6. If fetch still fails, decide explicitly whether:
155+
156+
- selectors are wrong
157+
- the page needs Browserless
158+
- the chosen surface is too noisy or too dynamic
159+
- the candidate should be downgraded or dropped
160+
161+
## Runtime Debugging
162+
163+
Use the core CLI as the authority for single-config debugging. The quickest loop is:
164+
165+
1. `validate`
166+
2. `feed`
167+
3. inspect the RSS for zero items, nav/footer leakage, duplicates, relative URLs, or noisy descriptions
168+
4. adjust selectors
169+
5. rerun
170+
171+
If Browserless works but Faraday does not, keep the config narrow and classify it as Browserless-backed instead of trying to rescue it with brittle tweaks.
172+
173+
## Auto-Source
174+
175+
Use `auto` for reconnaissance, not as proof that a config is ready.
176+
177+
```bash
178+
cd ../html2rss
179+
html2rss auto 'https://example.com'
180+
```
181+
182+
Use it to:
183+
184+
- discover likely repeated item selectors
185+
- compare Faraday and Browserless behavior quickly
186+
- decide whether a site belongs in the curated set at all
187+
188+
Do not ship raw auto-sourced output without manual tightening.
189+
190+
## Drop Or Downgrade
191+
192+
Drop or defer when:
193+
194+
- the page stays noisy after reasonable selector tightening
195+
- the site already offers first-party RSS and this config adds little curated value
196+
- the page depends on unstable interaction flows that are not worth encoding
197+
198+
Downgrade when:
199+
200+
- a narrower subpath is much cleaner than the flagship page
201+
- the config is acceptable without descriptions or dates
202+
- month-level dates are the best the source offers
203+
204+
## Reporting
205+
206+
When finishing config work, report:
207+
208+
- files changed
209+
- accepted configs
210+
- downgraded configs and why
211+
- dropped or deferred candidates and why
212+
- commands actually run
213+
- residual risks, especially selector drift, localization dependence, or Browserless dependence

0 commit comments

Comments
 (0)