| title | Selectors |
|---|---|
| description | The selectors scraper gives you fine-grained control over content extraction using CSS selectors. |
The selectors scraper gives you fine-grained control over content extraction using CSS selectors.
A valid RSS item requires at least a
titleor adescription.
At a minimum, you need an items selector to define the list of articles and a title selector for the article titles.
channel:
url: "https://example.com"
selectors:
items:
selector: ".article"
title:
selector: "h1"To simplify configuration, html2rss can automatically extract the title, url, and image from each item. This feature is enabled by default.
selectors:
items:
selector: ".article"
enhance: true # default: trueYou can control the order of items in your feed:
selectors:
items:
selector: ".article"
order: "reverse" # Reverse the order of items (newest first)Available options:
"reverse": Reverses the order of items (useful when the website shows oldest items first)- Default: Items appear in the order they are found on the page
html2rss can follow a single rel="next" pagination chain when you configure selectors.items.pagination.max_pages.
channel:
url: "https://example.com/news"
selectors:
items:
selector: "article"
pagination:
max_pages: 3
title:
selector: "h1"
url:
selector: "a"
extractor: "href"Behavior:
max_pagesis the total page budget for the item selector chain, including the initial page.max_pagesis capped by the system request ceiling of 10 pages per feed build.- Pagination follows strict
link[rel~="next"]ora[rel~="next"]targets only. - Follow-up pages use the current page's effective origin after redirects.
- Pagination stops when there is no next link, a page repeats, or the shared request budget is exhausted.
- The same request safeguards apply to pagination and Browserless navigation, including timeout limits, redirect limits, response-size guards, and private-network denial.
While you can define any named selector, only the following are used in the final RSS feed:
| RSS 2.0 Tag | html2rss Name |
Notes |
|---|---|---|
title |
title |
|
description |
description |
|
link |
url |
|
author |
author |
|
category |
categories |
|
guid |
guid |
|
enclosure |
enclosure |
|
pubDate |
published_at |
|
comments |
comments |
Each selector can be configured with the following options:
| Name | Description |
|---|---|
selector |
The CSS selector for the target element. |
extractor |
The extractor to use for this selector. |
attribute |
The attribute name (required for attribute extractor). |
static |
The static value (required for static extractor). |
post_process |
A list of post-processors to apply to the value. |
Extractors define how to get the value from a selected element.
text: The inner text of the element (default).html: The outer HTML of the element.href: The value of thehrefattribute.attribute: The value of a specified attribute.static: A static value.
Post-processors manipulate the extracted value.
gsub: Performs a global substitution on a string.html_to_markdown: Converts HTML to Markdown.markdown_to_html: Converts Markdown to HTML.parse_time: Parses a string into aTimeobject.parse_uri: Resolves a relative URL againstchannel.urland returns the normalized URL string.sanitize_html: Sanitizes HTML to prevent security vulnerabilities.substring: Extracts a substring from a string.template: Creates a new string from a template and other selector values. Use%{self}for the current selector value.
Always use the
sanitize_htmlpost-processor for any HTML content to prevent security risks.
To add categories to an item, provide a list of selector names to the categories selector.
selectors:
genre:
selector: ".genre"
branch:
selector: ".branch"
categories:
- genre
- branchTo create a custom GUID for an item, provide a list of selector names to the guid selector.
selectors:
title:
selector: "h1"
url:
selector: "a"
extractor: "href"
guid:
- urlTo add an enclosure (e.g., an image, audio, or video file) to an item, use the enclosure selector to specify the URL of the file.
selectors:
items:
selector: ".post"
title:
selector: "h2"
enclosure:
selector: "audio"
extractor: "attribute"
attribute: "src"
content_type: "audio/mp3"For detailed documentation on the Ruby API, see the official YARD documentation.