refactor: remove content-heuristic paywall detection (ADR 028)#226
Merged
Conversation
The per-publisher content detectors (src/core/extractor/paywall-detectors/)
guessed at paywalls from page HTML — phrase lists ("Subscribe to read",
"Already a subscriber?") plus a 600-char body-too-short threshold. They were
unreliable in the worst direction: free articles whose nav/footer/newsletter
chrome shipped those industry-standard phrases were flagged as paywalled
(issue #211). The #211 fix layered an "extract-first, only consult heuristics
on a thin result" guard on top to compensate for a signal that was guessing.
Replace the guess with the publisher's own answer: a paywall is recognized
only by a gated HTTP status (401/402/403/451) on the anonymous /api/page
fetch. That status is the publisher telling us the content is gated — no
phrase lists, no per-publisher upkeep, no false positives from page chrome.
What changed:
- Delete src/core/extractor/paywall-detectors/ (default/nytimes/economist
detectors, registry, visible-text, detectPaywall, PaywallDetector).
- Keep the certain pieces — PaywallVerdict shape + publisherHost() — in a new
src/core/extractor/paywall.ts.
- extraction-store: a 200 response is never content-flagged; whatever Defuddle
extracts is the article, empty extraction is a plain failure. Drop the
extract-first MIN_ARTICLE_CHARS guard (no longer needed without heuristics).
- handlePaywalledFetch: session-expired is now inferred from an authenticated
retry that won't extract, replacing the second content-heuristic check.
The extension authenticated-retry path, paywallMap, getPaywallVerdict, and the
reader-pane PaywallPrompt are unchanged — they now trigger off HTTP status.
Key files: src/core/extractor/paywall.ts (new), src/stores/extraction-store.ts,
docs/decisions/028-remove-paywall-content-heuristics.md (new), feature 019 +
ADR 020 amended.
Tests: deleted the detector unit tests; rewrote extraction-store-paywall to
trigger the gated/retry paths via HTTP status and assert no content false
positives (incl. #211 regressions). Full suite green (3743 passing), tsc clean.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The per-publisher content detectors (src/core/extractor/paywall-detectors/)
guessed at paywalls from page HTML — phrase lists ("Subscribe to read",
"Already a subscriber?") plus a 600-char body-too-short threshold. They were
unreliable in the worst direction: free articles whose nav/footer/newsletter
chrome shipped those industry-standard phrases were flagged as paywalled
(issue #211). The #211 fix layered an "extract-first, only consult heuristics
on a thin result" guard on top to compensate for a signal that was guessing.
Replace the guess with the publisher's own answer: a paywall is recognized
only by a gated HTTP status (401/402/403/451) on the anonymous /api/page
fetch. That status is the publisher telling us the content is gated — no
phrase lists, no per-publisher upkeep, no false positives from page chrome.
What changed:
detectors, registry, visible-text, detectPaywall, PaywallDetector).
src/core/extractor/paywall.ts.
extracts is the article, empty extraction is a plain failure. Drop the
extract-first MIN_ARTICLE_CHARS guard (no longer needed without heuristics).
retry that won't extract, replacing the second content-heuristic check.
The extension authenticated-retry path, paywallMap, getPaywallVerdict, and the
reader-pane PaywallPrompt are unchanged — they now trigger off HTTP status.
Key files: src/core/extractor/paywall.ts (new), src/stores/extraction-store.ts,
docs/decisions/028-remove-paywall-content-heuristics.md (new), feature 019 +
ADR 020 amended.
Tests: deleted the detector unit tests; rewrote extraction-store-paywall to
trigger the gated/retry paths via HTTP status and assert no content false
positives (incl. #211 regressions). Full suite green (3743 passing), tsc clean.