Skip to content

Add OECD.org scraper#32

Open
Copilot wants to merge 2 commits intomainfrom
copilot/add-scraper-file-oecd-org
Open

Add OECD.org scraper#32
Copilot wants to merge 2 commits intomainfrom
copilot/add-scraper-file-oecd-org

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 9, 2026

Adds scrapers/oecd.org.js to scrape news cards from https://www.oecd.org/.

Key details

  • listSelector: div.card---theme (triple-dash class, exact from Webscraper JSON config)
  • parse: extracts title + absolute link from first <a>; returns null if either is missing; resolves relative URLs against https://www.oecd.org
  • Image: lazy-load fallback chain (srcdata-srcdata-lazy-srcdata-original), discarding data: base64 placeholders
  • Date: parseOecdDate handles DD/MM/YYYY (uses Date.UTC to avoid timezone day-shift) and named-month strings like "9 March 2026", falls back to now
  • content: empty string — no description selector in JSON config
  • creator: 'OECD'
  • No domain property (not used by the engine)
Original prompt

Task

Create a new scraper file scrapers/oecd.org.js for https://www.oecd.org/.

robots.txt clearance

The OECD robots.txt (https://www.oecd.org/robots.txt) only disallows /content/dam/oecd/ and /adobe/dynamicmedia/deliver/ for generic crawlers. The homepage https://www.oecd.org/ is allowed. ✅

Webscraper JSON config (source of truth)

{
  "_id": "oecd-main-insights",
  "startUrl": ["https://www.oecd.org/"],
  "selectors": [
    {
      "elementLimit": 0,
      "id": "elem",
      "multiple": true,
      "parentSelectors": ["_root"],
      "scroll": false,
      "selector": "div.card---theme",
      "type": "SelectorElement"
    },
    {
      "id": "title",
      "multiple": false,
      "multipleType": "singleColumn",
      "parentSelectors": ["elem"],
      "regex": "",
      "selector": "a",
      "type": "SelectorText",
      "version": 2
    },
    {
      "id": "link",
      "linkType": "linkFromHref",
      "multiple": false,
      "parentSelectors": ["elem"],
      "selector": "a",
      "type": "SelectorLink",
      "version": 2
    },
    {
      "id": "image",
      "multiple": false,
      "multipleType": "singleColumn",
      "parentSelectors": ["elem"],
      "selector": "img",
      "type": "SelectorImage",
      "version": 2
    },
    {
      "id": "date",
      "multiple": false,
      "multipleType": "singleColumn",
      "parentSelectors": ["elem"],
      "regex": "",
      "selector": ".card__date",
      "type": "SelectorText",
      "version": 2
    }
  ]
}

Implementation requirements

Selectors

  • listSelector: div.card---theme — note the triple dash --- in the class name (this is exact, straight from the JSON)
  • title: text of the first a inside the card
  • link: href of the first a inside the card — resolve to absolute URL using https://www.oecd.org as base if relative
  • image: img inside the card — apply the lazy-load fallback pattern: srcdata-srcdata-lazy-srcdata-originalnull; discard data: base64 placeholders
  • date: .card__date text — OECD typically uses formats like "9 March 2026" or "09/03/2026". Parse robustly:
    • Try DD/MM/YYYY regex first
    • Fall back to new Date(str) for named-month formats like "9 March 2026"
    • Fall back to new Date().toISOString() if unparseable

Return object

return {
    title,
    link,
    enforcedImage: img,
    content,   // empty string "" — no description selector in JSON config
    pubDate,
    creator: 'OECD'
};

Return null if title or link is missing/empty.

File location

scrapers/oecd.org.js

Reference implementation pattern

Follow the same structure as the existing scrapers in the repo, especially scrapers/icann.org.js and scrapers/coe.int.js. The file must use module.exports = { listSelector, parse }.

Do NOT add a domain property — it is not used by the engine for scrapers.

The fetch_news.js engine automatically resolves relative URLs but it is safer to make the link absolute in the scraper using new URL(rawLink, 'https://www.oecd.org').href.

Date parsing helper

function parseOecdDate(str) {
    if (!str) return new Date().toISOString();

    // Try DD/MM/YYYY (European slash format)
    const slashMatch = str.match(/^(\d{1,2})\/(\d{1,2})\/(\d{4})$/);
    if (slashMatch) {
        const day   = parseInt(slashMatch[1], 10);
        const month = parseInt(slashMatch[2], 10) - 1;
        const year  = parseInt(slashMatch[3], 10);
        return new Date(year, month, day, 12, 0, 0).toISOString();
    }

    // Fallback: JS Date parse ("9 March 2026", "2026-03-09", etc.)
    const d = new Date(str);
    if (!isNaN(d.getTime())) return d.toISOString();

    return new Date().toISOString();
}

This pull request was created from Copilot chat.


🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. Learn more about Advanced Security.

Co-authored-by: susannaanas <4725416+susannaanas@users.noreply.github.com>
Copilot AI changed the title [WIP] Create new scraper file for oecd.org Add OECD.org scraper Mar 9, 2026
@susannaanas susannaanas marked this pull request as ready for review March 9, 2026 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants