refactor: remove content-heuristic paywall detection (ADR 028) by forcingfx · Pull Request #230 · forcingfx/feedzero

forcingfx · 2026-06-15T02:33:32Z

Why

The per-publisher content paywall detectors (src/core/extractor/paywall-detectors/) guessed at paywalls from page HTML — phrase lists ("Subscribe to read", "Already a subscriber?") plus a 600-char body-too-short threshold, with NYT/Economist specializations. The signal was unreliable in the worst direction: free articles whose nav/footer/newsletter chrome shipped those industry-standard phrases were flagged as paywalled (issue #211). The #211 fix had already layered an "extract-first, only consult heuristics on a thin result" guard on top to compensate for a signal that was fundamentally guessing.

What changed

Replace the guess with the publisher's own answer: a paywall is recognized only by a gated HTTP status (401/402/403/451) on the anonymous /api/page fetch. No phrase lists, no per-publisher upkeep, no false positives from page chrome.

Deleted src/core/extractor/paywall-detectors/ (default/nytimes/economist detectors, registry, visible-text, detectPaywall, PaywallDetector) + their two test files.
Kept the certain pieces — PaywallVerdict shape + publisherHost() — in a new src/core/extractor/paywall.ts.
extraction-store.ts: a 200 response is never content-flagged; whatever Defuddle extracts is the article, an empty extraction is a plain failure. Dropped the MIN_ARTICLE_CHARS extract-first guard (unneeded without heuristics). session-expired is now inferred from an authenticated retry that won't extract, replacing the second content-heuristic check.
Unchanged: the extension authenticated-retry path, paywallMap, getPaywallVerdict, and the reader-pane PaywallPrompt — they now trigger off HTTP status.

Scope note

This is the surgical removal: only the unreliable content heuristics are gone. HTTP-status paywall gating and the (flag-gated, dormant) browser-extension authenticated-fetch plumbing are retained.

Docs

New ADR 028 (docs/decisions/028-remove-paywall-content-heuristics.md).
Amended ADR 020 + feature 019 to reflect HTTP-status-only recognition.
Fixed a stale protocol.ts comment.

Tests

Deleted the detector unit tests (detect-paywall, visible-text).
Rewrote extraction-store-paywall to drive the gated/retry paths via HTTP status and assert no content false positives (incl. the bug: Articles incorrectly flagged as paywalled & settings menu doesn't display the installed version #211 regressions).
npx tsc --noEmit clean; full suite green (3743 passing, no regressions).

Closes #211.

https://claude.ai/code/session_01M2QkatKS68qcK1irnGJitw

Generated by Claude Code

The per-publisher content detectors (src/core/extractor/paywall-detectors/) guessed at paywalls from page HTML — phrase lists ("Subscribe to read", "Already a subscriber?") plus a 600-char body-too-short threshold. They were unreliable in the worst direction: free articles whose nav/footer/newsletter chrome shipped those industry-standard phrases were flagged as paywalled (issue #211). The #211 fix layered an "extract-first, only consult heuristics on a thin result" guard on top to compensate for a signal that was guessing. Replace the guess with the publisher's own answer: a paywall is recognized only by a gated HTTP status (401/402/403/451) on the anonymous /api/page fetch. That status is the publisher telling us the content is gated — no phrase lists, no per-publisher upkeep, no false positives from page chrome. What changed: - Delete src/core/extractor/paywall-detectors/ (default/nytimes/economist detectors, registry, visible-text, detectPaywall, PaywallDetector). - Keep the certain pieces — PaywallVerdict shape + publisherHost() — in a new src/core/extractor/paywall.ts. - extraction-store: a 200 response is never content-flagged; whatever Defuddle extracts is the article, empty extraction is a plain failure. Drop the extract-first MIN_ARTICLE_CHARS guard (no longer needed without heuristics). - handlePaywalledFetch: session-expired is now inferred from an authenticated retry that won't extract, replacing the second content-heuristic check. The extension authenticated-retry path, paywallMap, getPaywallVerdict, and the reader-pane PaywallPrompt are unchanged — they now trigger off HTTP status. Key files: src/core/extractor/paywall.ts (new), src/stores/extraction-store.ts, docs/decisions/028-remove-paywall-content-heuristics.md (new), feature 019 + ADR 020 amended. Tests: deleted the detector unit tests; rewrote extraction-store-paywall to trigger the gated/retry paths via HTTP status and assert no content false positives (incl. #211 regressions). Full suite green (3743 passing), tsc clean.

vercel · 2026-06-15T02:44:17Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
feedzero	Ready	Preview, Comment	Jun 15, 2026 2:44am

claude and others added 2 commits June 4, 2026 04:56

Merge branch 'main' into claude/remove-paywall-detection-QuJ5A

0c4cff3

vercel Bot deployed to Preview June 15, 2026 02:44 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: remove content-heuristic paywall detection (ADR 028)#230

refactor: remove content-heuristic paywall detection (ADR 028)#230
forcingfx wants to merge 2 commits into
mainfrom
claude/remove-paywall-detection-QuJ5A

forcingfx commented Jun 15, 2026

Uh oh!

vercel Bot commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

forcingfx commented Jun 15, 2026

Why

What changed

Scope note

Docs

Tests

Uh oh!

vercel Bot commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented Jun 15, 2026 •

edited

Loading