html2rss.github.io/src/content/docs/ruby-gem/reference/selectors.mdx at a4e109138b58f5217093eefd166674d89c5999bc · html2rss/html2rss.github.io

title	Selectors
description	The selectors scraper gives you fine-grained control over content extraction using CSS selectors.

The selectors scraper gives you fine-grained control over content extraction using CSS selectors.

A valid RSS item requires at least a title or a description.

Basic Configuration

At a minimum, you need an items selector to define the list of articles and a title selector for the article titles.

channel:
  url: "https://example.com"
selectors:
  items:
    selector: ".article"
  title:
    selector: "h1"

Automatic Item Enhancement

To simplify configuration, html2rss can automatically extract the title, url, and image from each item. This feature is enabled by default.

selectors:
  items:
    selector: ".article"
    enhance: true # default: true

Item Ordering

You can control the order of items in your feed:

selectors:
  items:
    selector: ".article"
    order: "reverse" # Reverse the order of items (newest first)

Available options:

"reverse": Reverses the order of items (useful when the website shows oldest items first)
Default: Items appear in the order they are found on the page

Paginated Feeds

html2rss can follow a single rel="next" pagination chain when you configure selectors.items.pagination.max_pages.

channel:
  url: "https://example.com/news"
selectors:
  items:
    selector: "article"
    pagination:
      max_pages: 3
  title:
    selector: "h1"
  url:
    selector: "a"
    extractor: "href"

Behavior:

max_pages is the total page budget for the item selector chain, including the initial page.
max_pages is capped by the system request ceiling of 10 pages per feed build.
Pagination follows strict link[rel~="next"] or a[rel~="next"] targets only.
Follow-up pages use the current page's effective origin after redirects.
Pagination stops when there is no next link, a page repeats, or the shared request budget is exhausted.
The same request safeguards apply to pagination and Browserless navigation, including timeout limits, redirect limits, response-size guards, and private-network denial.

RSS 2.0 Selectors

While you can define any named selector, only the following are used in the final RSS feed:

RSS 2.0 Tag	`html2rss` Name	Notes
`title`	`title`
`description`	`description`
`link`	`url`
`author`	`author`
`category`	`categories`
`guid`	`guid`
`enclosure`	`enclosure`
`pubDate`	`published_at`
`comments`	`comments`	⚠️ Not currently implemented

Selector Options

Each selector can be configured with the following options:

Name	Description
`selector`	The CSS selector for the target element.
`extractor`	The extractor to use for this selector.
`attribute`	The attribute name (required for `attribute` extractor).
`static`	The static value (required for `static` extractor).
`post_process`	A list of post-processors to apply to the value.

Extractors

Extractors define how to get the value from a selected element.

text: The inner text of the element (default).
html: The outer HTML of the element.
href: The value of the href attribute.
attribute: The value of a specified attribute.
static: A static value.

Post-Processors

Post-processors manipulate the extracted value.

gsub: Performs a global substitution on a string.
html_to_markdown: Converts HTML to Markdown.
markdown_to_html: Converts Markdown to HTML.
parse_time: Parses a string into a Time object.
parse_uri: Resolves a relative URL against channel.url and returns the normalized URL string.
sanitize_html: Sanitizes HTML to prevent security vulnerabilities.
substring: Extracts a substring from a string.
template: Creates a new string from a template and other selector values. Use %{self} for the current selector value.

Always use the sanitize_html post-processor for any HTML content to prevent security risks.

Advanced Usage

Custom GUID

To create a custom GUID for an item, provide a list of selector names to the guid selector.

selectors:
  title:
    selector: "h1"
  url:
    selector: "a"
    extractor: "href"
  guid:
    - url

Enclosures

To add an enclosure (e.g., an image, audio, or video file) to an item, use the enclosure selector to specify the URL of the file.

selectors:
  items:
    selector: ".post"
  title:
    selector: "h2"
  enclosure:
    selector: "audio"
    extractor: "attribute"
    attribute: "src"
    content_type: "audio/mp3"

For detailed documentation on the Ruby API, see the official YARD documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic Configuration

Automatic Item Enhancement

Item Ordering

Paginated Feeds

RSS 2.0 Selectors

Selector Options

Extractors

Post-Processors

Advanced Usage

Categories

Custom GUID

Enclosures

FilesExpand file tree

selectors.mdx

Latest commit

History

selectors.mdx

File metadata and controls

Basic Configuration

Automatic Item Enhancement

Item Ordering

Paginated Feeds

RSS 2.0 Selectors

Selector Options

Extractors

Post-Processors

Advanced Usage

Categories

Custom GUID

Enclosures