Skip to content

refactor: remove content-heuristic paywall detection (ADR 028)#226

Merged
forcingfx merged 1 commit into
mainfrom
claude/remove-paywall-detection-QuJ5A
Jun 5, 2026
Merged

refactor: remove content-heuristic paywall detection (ADR 028)#226
forcingfx merged 1 commit into
mainfrom
claude/remove-paywall-detection-QuJ5A

Conversation

@forcingfx

Copy link
Copy Markdown
Owner

The per-publisher content detectors (src/core/extractor/paywall-detectors/)
guessed at paywalls from page HTML — phrase lists ("Subscribe to read",
"Already a subscriber?") plus a 600-char body-too-short threshold. They were
unreliable in the worst direction: free articles whose nav/footer/newsletter
chrome shipped those industry-standard phrases were flagged as paywalled
(issue #211). The #211 fix layered an "extract-first, only consult heuristics
on a thin result" guard on top to compensate for a signal that was guessing.

Replace the guess with the publisher's own answer: a paywall is recognized
only by a gated HTTP status (401/402/403/451) on the anonymous /api/page
fetch. That status is the publisher telling us the content is gated — no
phrase lists, no per-publisher upkeep, no false positives from page chrome.

What changed:

  • Delete src/core/extractor/paywall-detectors/ (default/nytimes/economist
    detectors, registry, visible-text, detectPaywall, PaywallDetector).
  • Keep the certain pieces — PaywallVerdict shape + publisherHost() — in a new
    src/core/extractor/paywall.ts.
  • extraction-store: a 200 response is never content-flagged; whatever Defuddle
    extracts is the article, empty extraction is a plain failure. Drop the
    extract-first MIN_ARTICLE_CHARS guard (no longer needed without heuristics).
  • handlePaywalledFetch: session-expired is now inferred from an authenticated
    retry that won't extract, replacing the second content-heuristic check.

The extension authenticated-retry path, paywallMap, getPaywallVerdict, and the
reader-pane PaywallPrompt are unchanged — they now trigger off HTTP status.

Key files: src/core/extractor/paywall.ts (new), src/stores/extraction-store.ts,
docs/decisions/028-remove-paywall-content-heuristics.md (new), feature 019 +
ADR 020 amended.

Tests: deleted the detector unit tests; rewrote extraction-store-paywall to
trigger the gated/retry paths via HTTP status and assert no content false
positives (incl. #211 regressions). Full suite green (3743 passing), tsc clean.

The per-publisher content detectors (src/core/extractor/paywall-detectors/)
guessed at paywalls from page HTML — phrase lists ("Subscribe to read",
"Already a subscriber?") plus a 600-char body-too-short threshold. They were
unreliable in the worst direction: free articles whose nav/footer/newsletter
chrome shipped those industry-standard phrases were flagged as paywalled
(issue #211). The #211 fix layered an "extract-first, only consult heuristics
on a thin result" guard on top to compensate for a signal that was guessing.

Replace the guess with the publisher's own answer: a paywall is recognized
only by a gated HTTP status (401/402/403/451) on the anonymous /api/page
fetch. That status is the publisher telling us the content is gated — no
phrase lists, no per-publisher upkeep, no false positives from page chrome.

What changed:
- Delete src/core/extractor/paywall-detectors/ (default/nytimes/economist
  detectors, registry, visible-text, detectPaywall, PaywallDetector).
- Keep the certain pieces — PaywallVerdict shape + publisherHost() — in a new
  src/core/extractor/paywall.ts.
- extraction-store: a 200 response is never content-flagged; whatever Defuddle
  extracts is the article, empty extraction is a plain failure. Drop the
  extract-first MIN_ARTICLE_CHARS guard (no longer needed without heuristics).
- handlePaywalledFetch: session-expired is now inferred from an authenticated
  retry that won't extract, replacing the second content-heuristic check.

The extension authenticated-retry path, paywallMap, getPaywallVerdict, and the
reader-pane PaywallPrompt are unchanged — they now trigger off HTTP status.

Key files: src/core/extractor/paywall.ts (new), src/stores/extraction-store.ts,
docs/decisions/028-remove-paywall-content-heuristics.md (new), feature 019 +
ADR 020 amended.

Tests: deleted the detector unit tests; rewrote extraction-store-paywall to
trigger the gated/retry paths via HTTP status and assert no content false
positives (incl. #211 regressions). Full suite green (3743 passing), tsc clean.
@vercel

vercel Bot commented Jun 5, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
feedzero Ready Ready Preview, Comment Jun 5, 2026 4:40pm

@forcingfx forcingfx merged commit 6ae8747 into main Jun 5, 2026
15 checks passed
@forcingfx forcingfx deleted the claude/remove-paywall-detection-QuJ5A branch June 15, 2026 02:27
@forcingfx forcingfx restored the claude/remove-paywall-detection-QuJ5A branch June 15, 2026 02:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants