Skip to content

Latest commit

 

History

History
183 lines (140 loc) · 5.54 KB

File metadata and controls

183 lines (140 loc) · 5.54 KB
title Selectors
description The selectors scraper gives you fine-grained control over content extraction using CSS selectors.

The selectors scraper gives you fine-grained control over content extraction using CSS selectors.

A valid RSS item requires at least a title or a description.

Basic Configuration

At a minimum, you need an items selector to define the list of articles and a title selector for the article titles.

channel:
  url: "https://example.com"
selectors:
  items:
    selector: ".article"
  title:
    selector: "h1"

Automatic Item Enhancement

To simplify configuration, html2rss can automatically extract the title, url, and image from each item. This feature is enabled by default.

selectors:
  items:
    selector: ".article"
    enhance: true # default: true

Item Ordering

You can control the order of items in your feed:

selectors:
  items:
    selector: ".article"
    order: "reverse" # Reverse the order of items (newest first)

Available options:

  • "reverse": Reverses the order of items (useful when the website shows oldest items first)
  • Default: Items appear in the order they are found on the page

Paginated Feeds

html2rss can follow a single rel="next" pagination chain when you configure selectors.items.pagination.max_pages.

channel:
  url: "https://example.com/news"
selectors:
  items:
    selector: "article"
    pagination:
      max_pages: 3
  title:
    selector: "h1"
  url:
    selector: "a"
    extractor: "href"

Behavior:

  • max_pages is the total page budget for the item selector chain, including the initial page.
  • max_pages is capped by the system request ceiling of 10 pages per feed build.
  • Pagination follows strict link[rel~="next"] or a[rel~="next"] targets only.
  • Follow-up pages use the current page's effective origin after redirects.
  • Pagination stops when there is no next link, a page repeats, or the shared request budget is exhausted.
  • The same request safeguards apply to pagination and Browserless navigation, including timeout limits, redirect limits, response-size guards, and private-network denial.

RSS 2.0 Selectors

While you can define any named selector, only the following are used in the final RSS feed:

RSS 2.0 Tag html2rss Name Notes
title title
description description
link url
author author
category categories
guid guid
enclosure enclosure
pubDate published_at
comments comments ⚠️ Not currently implemented

Selector Options

Each selector can be configured with the following options:

Name Description
selector The CSS selector for the target element.
extractor The extractor to use for this selector.
attribute The attribute name (required for attribute extractor).
static The static value (required for static extractor).
post_process A list of post-processors to apply to the value.

Extractors

Extractors define how to get the value from a selected element.

  • text: The inner text of the element (default).
  • html: The outer HTML of the element.
  • href: The value of the href attribute.
  • attribute: The value of a specified attribute.
  • static: A static value.

Post-Processors

Post-processors manipulate the extracted value.

  • gsub: Performs a global substitution on a string.
  • html_to_markdown: Converts HTML to Markdown.
  • markdown_to_html: Converts Markdown to HTML.
  • parse_time: Parses a string into a Time object.
  • parse_uri: Resolves a relative URL against channel.url and returns the normalized URL string.
  • sanitize_html: Sanitizes HTML to prevent security vulnerabilities.
  • substring: Extracts a substring from a string.
  • template: Creates a new string from a template and other selector values. Use %{self} for the current selector value.

Always use the sanitize_html post-processor for any HTML content to prevent security risks.

Advanced Usage

Categories

To add categories to an item, provide a list of selector names to the categories selector.

selectors:
  genre:
    selector: ".genre"
  branch:
    selector: ".branch"
  categories:
    - genre
    - branch

Custom GUID

To create a custom GUID for an item, provide a list of selector names to the guid selector.

selectors:
  title:
    selector: "h1"
  url:
    selector: "a"
    extractor: "href"
  guid:
    - url

Enclosures

To add an enclosure (e.g., an image, audio, or video file) to an item, use the enclosure selector to specify the URL of the file.

selectors:
  items:
    selector: ".post"
  title:
    selector: "h2"
  enclosure:
    selector: "audio"
    extractor: "attribute"
    attribute: "src"
    content_type: "audio/mp3"

For detailed documentation on the Ruby API, see the official YARD documentation.