Skip to content

refactor: remove content-heuristic paywall detection (ADR 028)#230

Open
forcingfx wants to merge 2 commits into
mainfrom
claude/remove-paywall-detection-QuJ5A
Open

refactor: remove content-heuristic paywall detection (ADR 028)#230
forcingfx wants to merge 2 commits into
mainfrom
claude/remove-paywall-detection-QuJ5A

Conversation

@forcingfx

Copy link
Copy Markdown
Owner

Why

The per-publisher content paywall detectors (src/core/extractor/paywall-detectors/) guessed at paywalls from page HTML — phrase lists ("Subscribe to read", "Already a subscriber?") plus a 600-char body-too-short threshold, with NYT/Economist specializations. The signal was unreliable in the worst direction: free articles whose nav/footer/newsletter chrome shipped those industry-standard phrases were flagged as paywalled (issue #211). The #211 fix had already layered an "extract-first, only consult heuristics on a thin result" guard on top to compensate for a signal that was fundamentally guessing.

What changed

Replace the guess with the publisher's own answer: a paywall is recognized only by a gated HTTP status (401/402/403/451) on the anonymous /api/page fetch. No phrase lists, no per-publisher upkeep, no false positives from page chrome.

  • Deleted src/core/extractor/paywall-detectors/ (default/nytimes/economist detectors, registry, visible-text, detectPaywall, PaywallDetector) + their two test files.
  • Kept the certain pieces — PaywallVerdict shape + publisherHost() — in a new src/core/extractor/paywall.ts.
  • extraction-store.ts: a 200 response is never content-flagged; whatever Defuddle extracts is the article, an empty extraction is a plain failure. Dropped the MIN_ARTICLE_CHARS extract-first guard (unneeded without heuristics). session-expired is now inferred from an authenticated retry that won't extract, replacing the second content-heuristic check.
  • Unchanged: the extension authenticated-retry path, paywallMap, getPaywallVerdict, and the reader-pane PaywallPrompt — they now trigger off HTTP status.

Scope note

This is the surgical removal: only the unreliable content heuristics are gone. HTTP-status paywall gating and the (flag-gated, dormant) browser-extension authenticated-fetch plumbing are retained.

Docs

  • New ADR 028 (docs/decisions/028-remove-paywall-content-heuristics.md).
  • Amended ADR 020 + feature 019 to reflect HTTP-status-only recognition.
  • Fixed a stale protocol.ts comment.

Tests

Closes #211.

https://claude.ai/code/session_01M2QkatKS68qcK1irnGJitw


Generated by Claude Code

claude and others added 2 commits June 4, 2026 04:56
The per-publisher content detectors (src/core/extractor/paywall-detectors/)
guessed at paywalls from page HTML — phrase lists ("Subscribe to read",
"Already a subscriber?") plus a 600-char body-too-short threshold. They were
unreliable in the worst direction: free articles whose nav/footer/newsletter
chrome shipped those industry-standard phrases were flagged as paywalled
(issue #211). The #211 fix layered an "extract-first, only consult heuristics
on a thin result" guard on top to compensate for a signal that was guessing.

Replace the guess with the publisher's own answer: a paywall is recognized
only by a gated HTTP status (401/402/403/451) on the anonymous /api/page
fetch. That status is the publisher telling us the content is gated — no
phrase lists, no per-publisher upkeep, no false positives from page chrome.

What changed:
- Delete src/core/extractor/paywall-detectors/ (default/nytimes/economist
  detectors, registry, visible-text, detectPaywall, PaywallDetector).
- Keep the certain pieces — PaywallVerdict shape + publisherHost() — in a new
  src/core/extractor/paywall.ts.
- extraction-store: a 200 response is never content-flagged; whatever Defuddle
  extracts is the article, empty extraction is a plain failure. Drop the
  extract-first MIN_ARTICLE_CHARS guard (no longer needed without heuristics).
- handlePaywalledFetch: session-expired is now inferred from an authenticated
  retry that won't extract, replacing the second content-heuristic check.

The extension authenticated-retry path, paywallMap, getPaywallVerdict, and the
reader-pane PaywallPrompt are unchanged — they now trigger off HTTP status.

Key files: src/core/extractor/paywall.ts (new), src/stores/extraction-store.ts,
docs/decisions/028-remove-paywall-content-heuristics.md (new), feature 019 +
ADR 020 amended.

Tests: deleted the detector unit tests; rewrote extraction-store-paywall to
trigger the gated/retry paths via HTTP status and assert no content false
positives (incl. #211 regressions). Full suite green (3743 passing), tsc clean.
@vercel

vercel Bot commented Jun 15, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
feedzero Ready Ready Preview, Comment Jun 15, 2026 2:44am

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Articles incorrectly flagged as paywalled & settings menu doesn't display the installed version

2 participants