|
| 1 | +# html2rss-configs Agent Guide |
| 2 | + |
| 3 | +This repo owns the curated YAML config set for `html2rss`. |
| 4 | + |
| 5 | +Primary goal: add or repair configs that are stable, shippable, and easy to verify. Prefer a narrow, clean surface over a broad noisy one. |
| 6 | + |
| 7 | +## Scope |
| 8 | + |
| 9 | +- Source of truth here: `lib/html2rss/configs/`. |
| 10 | +- Do not hand-edit generated schema output. |
| 11 | +- Keep config work separate from downstream docs, web, or example changes unless the task explicitly includes them. |
| 12 | + |
| 13 | +## Defaults |
| 14 | + |
| 15 | +- Use the registrable domain folder, not a subdomain folder, unless there is a strong existing reason. |
| 16 | +- Start from the cleanest article list the site offers, not the marketing homepage by default. |
| 17 | +- Prefer stable list/detail extraction over extracting every possible field. |
| 18 | +- If the site only becomes reliable on a narrower path, use that narrower path. |
| 19 | +- Omit brittle fields. If dates or descriptions are low quality, leave them out. |
| 20 | +- Set `enhance: false` when enhancement pulls in page chrome, duplicate cards, or unrelated links. |
| 21 | + |
| 22 | +## Surface Selection |
| 23 | + |
| 24 | +Prefer these surfaces first: |
| 25 | + |
| 26 | +- dedicated newsroom or blog archive pages |
| 27 | +- category pages with one repeated card structure |
| 28 | +- stable subpaths like `/blog/latest` or `/blog/everything/` |
| 29 | + |
| 30 | +Avoid these unless they are the only workable option: |
| 31 | + |
| 32 | +- homepages with hero content mixed with promos |
| 33 | +- pages that combine multiple unrelated card systems |
| 34 | +- infinite-scroll surfaces unless Browserless is already clearly required |
| 35 | +- localized or geo-redirecting entry pages when a stable non-localized path exists |
| 36 | + |
| 37 | +## Selector Strategy |
| 38 | + |
| 39 | +Start with the smallest useful selector set: |
| 40 | + |
| 41 | +- `items` |
| 42 | +- `title` |
| 43 | +- `url` |
| 44 | + |
| 45 | +Add fields only when they are clean: |
| 46 | + |
| 47 | +- `description` |
| 48 | +- `published_at` |
| 49 | +- `author` |
| 50 | +- `categories` |
| 51 | + |
| 52 | +Useful patterns: |
| 53 | + |
| 54 | +- Prefer the repeated article card itself as `items`, especially when it is a single anchor. |
| 55 | +- Anchor on article URLs or stable path fragments instead of generic headings. |
| 56 | +- Keep selectors item-local when possible. |
| 57 | +- Do not add complexity to recover weak optional fields. |
| 58 | + |
| 59 | +## Chrome MCP |
| 60 | + |
| 61 | +Use Chrome MCP when the static HTML is unclear, the page is hydrated, or Faraday fetch returns zero items while the browser shows a valid list. |
| 62 | + |
| 63 | +Recommended sequence: |
| 64 | + |
| 65 | +1. Open the target URL. |
| 66 | +2. Take an accessibility snapshot. |
| 67 | +3. Identify the exact repeated item boundary. |
| 68 | +4. Confirm the title and URL live inside that boundary. |
| 69 | +5. Record the final URL if the page redirects by locale or renders a different surface than expected. |
| 70 | + |
| 71 | +## Browserless |
| 72 | + |
| 73 | +Use Browserless when: |
| 74 | + |
| 75 | +- the page is JS-rendered |
| 76 | +- Faraday fetch returns zero items but Chrome shows a valid repeated list |
| 77 | +- the site is bot-sensitive enough that static fetch is unreliable |
| 78 | + |
| 79 | +Local Browserless notes: |
| 80 | + |
| 81 | +- `html2rss-web` exposes a local endpoint at `ws://127.0.0.1:4002` |
| 82 | +- Browserless fetch tests require `BROWSERLESS_IO_WEBSOCKET_URL` |
| 83 | +- custom websocket endpoints also require `BROWSERLESS_IO_API_TOKEN` |
| 84 | + |
| 85 | +Do not default the whole repo to Browserless. Use it only for configs that need it. |
| 86 | + |
| 87 | +## Command Assumptions |
| 88 | + |
| 89 | +Assume the `html2rss` CLI is available on `PATH` when working against the sibling core repo. |
| 90 | + |
| 91 | +- Use `html2rss ...` in examples and one-off validation commands. |
| 92 | +- If the CLI is not installed globally in the current environment, run the equivalent command from the sibling `html2rss/` checkout, typically `bundle exec exe/html2rss ...`. |
| 93 | +- In this repo, keep using `make ...` and `bundle exec rspec ...` because those are the implemented entrypoints. |
| 94 | + |
| 95 | +## Fast Path |
| 96 | + |
| 97 | +1. Find the cleanest stable candidate URL. |
| 98 | +2. Inspect the DOM in Chrome MCP before writing selectors. |
| 99 | +3. Create the YAML with the schema modeline and minimal selectors. |
| 100 | +4. Validate the single file with the core CLI. |
| 101 | +5. Generate a live feed with the core CLI. |
| 102 | +6. Tighten selectors until the feed output is clean. |
| 103 | +7. Run repo validation and non-fetch tests. |
| 104 | +8. Run the appropriate fetch lane: |
| 105 | + - plain fetch for static or Faraday-backed configs |
| 106 | + - Browserless fetch for JS-heavy or Browserless-backed configs |
| 107 | + |
| 108 | +## Quality Gate |
| 109 | + |
| 110 | +For every new or changed config, verify in this order. |
| 111 | + |
| 112 | +1. Single-file runtime validation in the core repo: |
| 113 | + |
| 114 | +```bash |
| 115 | +cd ../html2rss |
| 116 | +html2rss validate /abs/path/to/config.yml |
| 117 | +``` |
| 118 | + |
| 119 | +2. Single-file live feed generation in the core repo: |
| 120 | + |
| 121 | +```bash |
| 122 | +cd ../html2rss |
| 123 | +html2rss feed /abs/path/to/config.yml |
| 124 | +``` |
| 125 | + |
| 126 | +3. Repo-wide validation in this repo: |
| 127 | + |
| 128 | +```bash |
| 129 | +make validate |
| 130 | +``` |
| 131 | + |
| 132 | +4. Repo non-fetch tests in this repo: |
| 133 | + |
| 134 | +```bash |
| 135 | +make test |
| 136 | +``` |
| 137 | + |
| 138 | +5. Focused fetch verification: |
| 139 | + |
| 140 | +- Faraday-backed candidate: |
| 141 | + |
| 142 | +```bash |
| 143 | +bundle exec rspec --tag fetch --example 'example.com/feed.yml' spec/html2rss/configs_dynamic_spec.rb |
| 144 | +``` |
| 145 | + |
| 146 | +- Browserless-backed candidate: |
| 147 | + |
| 148 | +```bash |
| 149 | +BROWSERLESS_IO_WEBSOCKET_URL=ws://127.0.0.1:4002 \ |
| 150 | +BROWSERLESS_IO_API_TOKEN=... \ |
| 151 | +bundle exec rspec --tag fetch --example 'example.com/feed.yml' spec/html2rss/configs_dynamic_spec.rb |
| 152 | +``` |
| 153 | + |
| 154 | +6. If fetch still fails, decide explicitly whether: |
| 155 | + |
| 156 | +- selectors are wrong |
| 157 | +- the page needs Browserless |
| 158 | +- the chosen surface is too noisy or too dynamic |
| 159 | +- the candidate should be downgraded or dropped |
| 160 | + |
| 161 | +## Runtime Debugging |
| 162 | + |
| 163 | +Use the core CLI as the authority for single-config debugging. The quickest loop is: |
| 164 | + |
| 165 | +1. `validate` |
| 166 | +2. `feed` |
| 167 | +3. inspect the RSS for zero items, nav/footer leakage, duplicates, relative URLs, or noisy descriptions |
| 168 | +4. adjust selectors |
| 169 | +5. rerun |
| 170 | + |
| 171 | +If Browserless works but Faraday does not, keep the config narrow and classify it as Browserless-backed instead of trying to rescue it with brittle tweaks. |
| 172 | + |
| 173 | +## Auto-Source |
| 174 | + |
| 175 | +Use `auto` for reconnaissance, not as proof that a config is ready. |
| 176 | + |
| 177 | +```bash |
| 178 | +cd ../html2rss |
| 179 | +html2rss auto 'https://example.com' |
| 180 | +``` |
| 181 | + |
| 182 | +Use it to: |
| 183 | + |
| 184 | +- discover likely repeated item selectors |
| 185 | +- compare Faraday and Browserless behavior quickly |
| 186 | +- decide whether a site belongs in the curated set at all |
| 187 | + |
| 188 | +Do not ship raw auto-sourced output without manual tightening. |
| 189 | + |
| 190 | +## Drop Or Downgrade |
| 191 | + |
| 192 | +Drop or defer when: |
| 193 | + |
| 194 | +- the page stays noisy after reasonable selector tightening |
| 195 | +- the site already offers first-party RSS and this config adds little curated value |
| 196 | +- the page depends on unstable interaction flows that are not worth encoding |
| 197 | + |
| 198 | +Downgrade when: |
| 199 | + |
| 200 | +- a narrower subpath is much cleaner than the flagship page |
| 201 | +- the config is acceptable without descriptions or dates |
| 202 | +- month-level dates are the best the source offers |
| 203 | + |
| 204 | +## Reporting |
| 205 | + |
| 206 | +When finishing config work, report: |
| 207 | + |
| 208 | +- files changed |
| 209 | +- accepted configs |
| 210 | +- downgraded configs and why |
| 211 | +- dropped or deferred candidates and why |
| 212 | +- commands actually run |
| 213 | +- residual risks, especially selector drift, localization dependence, or Browserless dependence |
0 commit comments