refactor: remove content-heuristic paywall detection (ADR 028)#230
Open
forcingfx wants to merge 2 commits into
Open
refactor: remove content-heuristic paywall detection (ADR 028)#230forcingfx wants to merge 2 commits into
forcingfx wants to merge 2 commits into
Conversation
The per-publisher content detectors (src/core/extractor/paywall-detectors/)
guessed at paywalls from page HTML — phrase lists ("Subscribe to read",
"Already a subscriber?") plus a 600-char body-too-short threshold. They were
unreliable in the worst direction: free articles whose nav/footer/newsletter
chrome shipped those industry-standard phrases were flagged as paywalled
(issue #211). The #211 fix layered an "extract-first, only consult heuristics
on a thin result" guard on top to compensate for a signal that was guessing.
Replace the guess with the publisher's own answer: a paywall is recognized
only by a gated HTTP status (401/402/403/451) on the anonymous /api/page
fetch. That status is the publisher telling us the content is gated — no
phrase lists, no per-publisher upkeep, no false positives from page chrome.
What changed:
- Delete src/core/extractor/paywall-detectors/ (default/nytimes/economist
detectors, registry, visible-text, detectPaywall, PaywallDetector).
- Keep the certain pieces — PaywallVerdict shape + publisherHost() — in a new
src/core/extractor/paywall.ts.
- extraction-store: a 200 response is never content-flagged; whatever Defuddle
extracts is the article, empty extraction is a plain failure. Drop the
extract-first MIN_ARTICLE_CHARS guard (no longer needed without heuristics).
- handlePaywalledFetch: session-expired is now inferred from an authenticated
retry that won't extract, replacing the second content-heuristic check.
The extension authenticated-retry path, paywallMap, getPaywallVerdict, and the
reader-pane PaywallPrompt are unchanged — they now trigger off HTTP status.
Key files: src/core/extractor/paywall.ts (new), src/stores/extraction-store.ts,
docs/decisions/028-remove-paywall-content-heuristics.md (new), feature 019 +
ADR 020 amended.
Tests: deleted the detector unit tests; rewrote extraction-store-paywall to
trigger the gated/retry paths via HTTP status and assert no content false
positives (incl. #211 regressions). Full suite green (3743 passing), tsc clean.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The per-publisher content paywall detectors (
src/core/extractor/paywall-detectors/) guessed at paywalls from page HTML — phrase lists ("Subscribe to read","Already a subscriber?") plus a 600-char body-too-short threshold, with NYT/Economist specializations. The signal was unreliable in the worst direction: free articles whose nav/footer/newsletter chrome shipped those industry-standard phrases were flagged as paywalled (issue #211). The #211 fix had already layered an "extract-first, only consult heuristics on a thin result" guard on top to compensate for a signal that was fundamentally guessing.What changed
Replace the guess with the publisher's own answer: a paywall is recognized only by a gated HTTP status (401/402/403/451) on the anonymous
/api/pagefetch. No phrase lists, no per-publisher upkeep, no false positives from page chrome.src/core/extractor/paywall-detectors/(default/nytimes/economist detectors, registry,visible-text,detectPaywall,PaywallDetector) + their two test files.PaywallVerdictshape +publisherHost()— in a newsrc/core/extractor/paywall.ts.extraction-store.ts: a 200 response is never content-flagged; whatever Defuddle extracts is the article, an empty extraction is a plain failure. Dropped theMIN_ARTICLE_CHARSextract-first guard (unneeded without heuristics).session-expiredis now inferred from an authenticated retry that won't extract, replacing the second content-heuristic check.paywallMap,getPaywallVerdict, and the reader-panePaywallPrompt— they now trigger off HTTP status.Scope note
This is the surgical removal: only the unreliable content heuristics are gone. HTTP-status paywall gating and the (flag-gated, dormant) browser-extension authenticated-fetch plumbing are retained.
Docs
docs/decisions/028-remove-paywall-content-heuristics.md).protocol.tscomment.Tests
detect-paywall,visible-text).extraction-store-paywallto drive the gated/retry paths via HTTP status and assert no content false positives (incl. the bug: Articles incorrectly flagged as paywalled & settings menu doesn't display the installed version #211 regressions).npx tsc --noEmitclean; full suite green (3743 passing, no regressions).Closes #211.
https://claude.ai/code/session_01M2QkatKS68qcK1irnGJitw
Generated by Claude Code