refactor: remove content-heuristic paywall detection (ADR 028) by forcingfx · Pull Request #226 · forcingfx/feedzero

forcingfx · 2026-06-05T16:40:13Z

The per-publisher content detectors (src/core/extractor/paywall-detectors/)
guessed at paywalls from page HTML — phrase lists ("Subscribe to read",
"Already a subscriber?") plus a 600-char body-too-short threshold. They were
unreliable in the worst direction: free articles whose nav/footer/newsletter
chrome shipped those industry-standard phrases were flagged as paywalled
(issue #211). The #211 fix layered an "extract-first, only consult heuristics
on a thin result" guard on top to compensate for a signal that was guessing.

Replace the guess with the publisher's own answer: a paywall is recognized
only by a gated HTTP status (401/402/403/451) on the anonymous /api/page
fetch. That status is the publisher telling us the content is gated — no
phrase lists, no per-publisher upkeep, no false positives from page chrome.

What changed:

Delete src/core/extractor/paywall-detectors/ (default/nytimes/economist
detectors, registry, visible-text, detectPaywall, PaywallDetector).
Keep the certain pieces — PaywallVerdict shape + publisherHost() — in a new
src/core/extractor/paywall.ts.
extraction-store: a 200 response is never content-flagged; whatever Defuddle
extracts is the article, empty extraction is a plain failure. Drop the
extract-first MIN_ARTICLE_CHARS guard (no longer needed without heuristics).
handlePaywalledFetch: session-expired is now inferred from an authenticated
retry that won't extract, replacing the second content-heuristic check.

The extension authenticated-retry path, paywallMap, getPaywallVerdict, and the
reader-pane PaywallPrompt are unchanged — they now trigger off HTTP status.

Key files: src/core/extractor/paywall.ts (new), src/stores/extraction-store.ts,
docs/decisions/028-remove-paywall-content-heuristics.md (new), feature 019 +
ADR 020 amended.

Tests: deleted the detector unit tests; rewrote extraction-store-paywall to
trigger the gated/retry paths via HTTP status and assert no content false
positives (incl. #211 regressions). Full suite green (3743 passing), tsc clean.

The per-publisher content detectors (src/core/extractor/paywall-detectors/) guessed at paywalls from page HTML — phrase lists ("Subscribe to read", "Already a subscriber?") plus a 600-char body-too-short threshold. They were unreliable in the worst direction: free articles whose nav/footer/newsletter chrome shipped those industry-standard phrases were flagged as paywalled (issue #211). The #211 fix layered an "extract-first, only consult heuristics on a thin result" guard on top to compensate for a signal that was guessing. Replace the guess with the publisher's own answer: a paywall is recognized only by a gated HTTP status (401/402/403/451) on the anonymous /api/page fetch. That status is the publisher telling us the content is gated — no phrase lists, no per-publisher upkeep, no false positives from page chrome. What changed: - Delete src/core/extractor/paywall-detectors/ (default/nytimes/economist detectors, registry, visible-text, detectPaywall, PaywallDetector). - Keep the certain pieces — PaywallVerdict shape + publisherHost() — in a new src/core/extractor/paywall.ts. - extraction-store: a 200 response is never content-flagged; whatever Defuddle extracts is the article, empty extraction is a plain failure. Drop the extract-first MIN_ARTICLE_CHARS guard (no longer needed without heuristics). - handlePaywalledFetch: session-expired is now inferred from an authenticated retry that won't extract, replacing the second content-heuristic check. The extension authenticated-retry path, paywallMap, getPaywallVerdict, and the reader-pane PaywallPrompt are unchanged — they now trigger off HTTP status. Key files: src/core/extractor/paywall.ts (new), src/stores/extraction-store.ts, docs/decisions/028-remove-paywall-content-heuristics.md (new), feature 019 + ADR 020 amended. Tests: deleted the detector unit tests; rewrote extraction-store-paywall to trigger the gated/retry paths via HTTP status and assert no content false positives (incl. #211 regressions). Full suite green (3743 passing), tsc clean.

vercel · 2026-06-05T16:40:19Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
feedzero	Ready	Preview, Comment	Jun 5, 2026 4:40pm

forcingfx merged commit 6ae8747 into main Jun 5, 2026
15 checks passed

forcingfx deleted the claude/remove-paywall-detection-QuJ5A branch June 15, 2026 02:27

forcingfx restored the claude/remove-paywall-detection-QuJ5A branch June 15, 2026 02:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: remove content-heuristic paywall detection (ADR 028)#226

refactor: remove content-heuristic paywall detection (ADR 028)#226
forcingfx merged 1 commit into
mainfrom
claude/remove-paywall-detection-QuJ5A

forcingfx commented Jun 5, 2026

Uh oh!

vercel Bot commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

forcingfx commented Jun 5, 2026

Uh oh!

vercel Bot commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants