diff --git a/docs/decisions/020-browser-extension-surface.md b/docs/decisions/020-browser-extension-surface.md index a3cfdc53..18450f95 100644 --- a/docs/decisions/020-browser-extension-surface.md +++ b/docs/decisions/020-browser-extension-surface.md @@ -2,7 +2,7 @@ ## Status -Accepted (2026-05-21). +Accepted (2026-05-21). **Amended by [ADR 028](028-remove-paywall-content-heuristics.md) (2026-06-04):** the per-publisher *content* detectors referenced below (`src/core/extractor/paywall-detectors/`) were removed for being unreliable (false positives on free articles, issue #211). Paywall recognition is now keyed solely off the publisher's gated HTTP status on the anonymous fetch. The extension-as-transport decision itself stands; only the "detection lives in per-publisher detectors" mechanism changed. ## Context diff --git a/docs/decisions/028-remove-paywall-content-heuristics.md b/docs/decisions/028-remove-paywall-content-heuristics.md new file mode 100644 index 00000000..05f23ddd --- /dev/null +++ b/docs/decisions/028-remove-paywall-content-heuristics.md @@ -0,0 +1,81 @@ +# ADR 028: Remove Content-Heuristic Paywall Detection + +## Status + +Accepted (2026-06-04). Amends [ADR 020](020-browser-extension-surface.md). + +## Context + +Feature 019 (authenticated full-text fetching) shipped a content-heuristic +paywall detector in `src/core/extractor/paywall-detectors/`. It inspected the +fetched page HTML and flagged an article as paywalled when either: + +- the body matched an industry-wide or per-publisher "Subscribe to read" / + "Already a subscriber?" phrase list (`default-detector`, `nytimes`, + `economist`), or +- the visible text was shorter than a 600-character threshold + (`body-too-short`, via `visible-text.ts`). + +The heuristic was unreliable in the direction that hurts most: **false +positives on free articles.** Many free articles ship the exact phrases the +detector keyed on in their nav/footer/newsletter chrome (Wired, NYT, lttlabs, +…). Issue #211 was filed when fully-readable free articles rendered a "Paywalled +article" prompt instead of their content. The #211 fix layered an +"extract-first, only consult heuristics on a thin result" guard on top — adding +complexity to compensate for a signal that was guessing in the first place. + +A phrase list is a guess about another company's HTML. It needs per-publisher +upkeep ("publishers change paywall HTML twice a year"), and every guess is a +chance to gate a free article. Meanwhile there is a signal that is not a guess: +when a publisher refuses an anonymous request, it returns a gated HTTP status +(401 Unauthorized, 402 Payment Required, 403 Forbidden, 451). That status is +the publisher *telling us* the content is gated. + +## Decision + +Remove the content-heuristic detectors entirely. Recognize a paywall **only** +from a gated HTTP status (401/402/403/451) on the anonymous `/api/page` fetch. + +- Delete `src/core/extractor/paywall-detectors/` (default/nytimes/economist + detectors, the registry, `visible-text.ts`, `detectPaywall`, and the + `PaywallDetector` interface). +- Keep the small, certain pieces — the `PaywallVerdict` shape and + `publisherHost()` — in a new `src/core/extractor/paywall.ts`. +- A 200 response is never re-classified as paywalled from its content: whatever + Defuddle extracts is the article; an empty extraction is a plain failure. +- The extension authenticated-retry path is retained, now triggered by the + gated status. A retry that still yields no readable article is treated as + `session-expired` (the cookie has expired) — previously this was a second + content-heuristic check on the retried HTML; it is now the absence of a + readable extraction. + +The extension plumbing, `paywallMap`, `getPaywallVerdict`, and the reader-pane +`PaywallPrompt` are unchanged. + +## Consequences + +**Positive** + +- No more false positives from page chrome — the #211 class of bug is gone by + construction, and the extract-first guard it required is removed too. +- Zero per-publisher upkeep. HTTP status needs no phrase lists; adding a + publisher is mostly verifying its anonymous fetch is gated and its + authenticated retry extracts. +- Less code and a smaller bundle (eight detector modules + their tests removed). + +**Negative / accepted trade-offs** + +- A publisher that serves a *200 with a paywall stub* (rather than a gated + status) now reads as an empty extraction → a plain "extraction found nothing" + failure, not a paywall prompt. We judged the soft-paywall miss less harmful + than gating free articles, and the reader still offers "Open original". +- `session-expired` is now inferred from a non-extracting authenticated retry + rather than a positive content match, so a genuinely empty (but free) page + fetched through the extension would surface the session-expired copy. This + only occurs on the already-gated path for an authorized publisher. + +## References + +- Feature 019: `docs/features/019-authenticated-fetch.md` +- ADR 020: `docs/decisions/020-browser-extension-surface.md` +- Issue #211 (free articles wrongly flagged paywalled) diff --git a/docs/features/019-authenticated-fetch.md b/docs/features/019-authenticated-fetch.md index cf87f819..1823f3c7 100644 --- a/docs/features/019-authenticated-fetch.md +++ b/docs/features/019-authenticated-fetch.md @@ -8,15 +8,25 @@ |---|---|---| | 1 | Web-app `protocol.ts` + MV3 extension scaffold + `ping` handshake | ✅ Shipped (`ade8970`, `9688f2d`) | | 2 | `fetch-article` round-trip with cookies, permission gate, scheme guard | ✅ Shipped (`f6ff5eb`) | -| 3 | Paywall detectors, `authorize-publisher` protocol message, extension store, reader-pane prompt, extraction-store wiring | ✅ Shipped (slices on `claude/paywall-extension-feature-hR15P`) | +| 3 | HTTP-status paywall gating, `authorize-publisher` protocol message, extension store, reader-pane prompt, extraction-store wiring | ✅ Shipped (slices on `claude/paywall-extension-feature-hR15P`) | + +> **Update (ADR 028):** The content-heuristic paywall detectors (phrase lists + +> body-length thresholds, formerly `src/core/extractor/paywall-detectors/`) +> were **removed** — they false-positived on free articles (issue #211). The +> only paywall signal FeedZero now trusts is the publisher's own gated HTTP +> status (401/402/403/451) on the anonymous `/api/page` fetch. The verdict +> shape (`PaywallVerdict`) + `publisherHost()` now live in +> `src/core/extractor/paywall.ts`. The extension retry, `paywallMap`, and the +> reader-pane prompt are unchanged. | 4 | Settings tab listing authorized publishers; session-expired auto-refresh; Firefox parity | ⏳ Next | | 5 | Chrome Web Store + Firefox AMO distribution; Safari path | ⏳ | ### Shipping gate — `VITE_EXTENSION_ENABLED` -Paywall **detection** and the **"Open original"** fallback ship now: hitting a -paywalled article shows a clean "Paywalled article → Open original" card -instead of a broken extraction. The **extension CTAs** (Install the FeedZero +Paywall **gating** (by HTTP status) and the **"Open original"** fallback ship +now: a publisher that refuses the anonymous fetch (401/402/403/451) shows a +clean "Paywalled article → Open original" card instead of a broken extraction. +The **extension CTAs** (Install the FeedZero extension, Authorize ``, session-expired sign-in) are gated behind `isExtensionEnabled()` (`src/core/extension/extension-enabled.ts`), which reads `VITE_EXTENSION_ENABLED` and **defaults off**. The boot-time `detect()` ping is @@ -89,10 +99,10 @@ content-script.js │ window.postMessage(response, origin) ▼ Page receives response via protocol.ts `fetchArticle()` - │ Web app: paywall detector → Defuddle → reader + │ Web app: Defuddle → reader (re-extract; empty ⇒ session-expired) ``` -The extension is pure transport. **Paywall detection, content extraction, and rendering all stay on the web-app side** so the extension's surface stays small (currently ~3KB bundled) and the detection logic versions with the web app, not the user's local extension. +The extension is pure transport. **Paywall gating, content extraction, and rendering all stay on the web-app side** so the extension's surface stays small (currently ~3KB bundled). Gating is by the publisher's HTTP status, not page content (ADR 028). ### Files @@ -102,25 +112,23 @@ The extension is pure transport. **Paywall detection, content extraction, and re |------|------| | `src/core/extension/protocol.ts` | Message envelope types (`OutboundMessage`, `InboundMessage`), `ping()` for detection, `fetchArticle(url)` for the authenticated fetch, `authorizePublisher(domain)` for the runtime host-permission grant. Origin-pinned `window.postMessage` transport with `requestId` correlation and timeout. | -#### Web app — `src/core/extractor/paywall-detectors/` +#### Web app — `src/core/extractor/paywall.ts` -| File | Role | +| Export | Role | |------|------| -| `types.ts` | `PaywallVerdict` discriminated union + `PaywallDetector` interface. | -| `host.ts` | `publisherHost(url)` — canonical publisher host with leading `www.` stripped. | -| `visible-text.ts` | `visibleTextLength(html)` — crude tag-stripping length heuristic, sync, dep-free. | -| `default-detector.ts` | Substring scan over industry-wide paywall phrases + body-too-short fallback (600-char threshold). | -| `nytimes.ts` | Publisher-specific detector for `nytimes.com` and its subdomains (e.g. `cooking.nytimes.com`). | -| `economist.ts` | Publisher-specific detector for `economist.com`. | -| `registry.ts` | Ordered first-match registry. | -| `index.ts` | Registers detectors; exports `detectPaywall(html, url): PaywallVerdict`. | +| `PaywallVerdict` | Verdict shape recorded for a gated article (`paywalled: true`, `publisher`, `reason`). | +| `publisherHost(url)` | Canonical publisher host with leading `www.` stripped; null when unparseable. | + +> The content-heuristic detectors (`paywall-detectors/`) were removed in ADR +> 028. There is no `detectPaywall(html, url)` anymore — a paywall is recognized +> only by the publisher's gated HTTP status on the anonymous fetch. #### Web app — stores + UI | File | Role | |------|------| | `src/stores/extension-store.ts` | Zustand mirror of extension presence + per-publisher grants. `status: "unknown" \| "installed" \| "absent"`, `authorizedDomains[]`, `detect()`, `requestPublisherAccess(domain)`, `isAuthorized(domain)`. | -| `src/stores/extraction-store.ts` | On every `/api/page` response, runs `detectPaywall`. If gated + authorized for the publisher, retries via `fetchArticle()`; if still gated marks `session-expired`. Surfaces `paywallMap` for the reader pane. | +| `src/stores/extraction-store.ts` | A gated `/api/page` status (401/402/403/451) is the paywall signal. If gated + authorized for the publisher, retries via `fetchArticle()`; a retry that still won't extract marks `session-expired`. Surfaces `paywallMap` for the reader pane. | | `src/components/reader/paywall-prompt.tsx` | Four-state reader-pane affordance: install-extension, authorize-``, session-expired, fallback "Open original". | | `src/app.tsx` (`AppInit`) | Calls `useExtensionStore.getState().detect()` once at boot so the prompt picks the right CTA without per-render pings. | @@ -147,11 +155,10 @@ The extension is pure transport. **Paywall detection, content extraction, and re |------|----------| | `tests/core/extension/protocol.test.ts` | 13 cases: ping round-trip / timeout / requestId mismatch / protocol-version envelope / origin filter, fetchArticle success / failure-reason forwarding / URL forwarded / timeout, authorizePublisher grant / decline / domain forwarded / timeout. Uses a `fakeExtension` helper that stands in for the content script. | | `tests/extension/handlers.test.ts` | 15 cases: ping happy path, malformed/non-FeedZero messages rejected, response-typed messages rejected (echo-loop guard), wrong protocol version rejected, fetch happy path / no-permission short-circuit / blocked-scheme / network-error wraps throws / malformed fetch (no url), authorize-publisher grant / decline / runtime throw / missing-domain / scheme-or-path domain rejected. All IO mocked via `HandlerContext`. | -| `tests/core/extractor/paywall-detectors/detect-paywall.test.ts` | 10 cases: NYT phrase-match across www / cooking subdomain, NYT false-negative on long body, default phrase-match, default body-too-short, null publisher for unparseable URL, verdict shape. | | `tests/stores/extension-store.test.ts` | 8 cases: detect installed / absent / repeated calls, requestPublisherAccess grant / decline / timeout, dedupe on re-grant, isAuthorized reflection. | -| `tests/stores/extraction-store-paywall.test.ts` | 8 cases: paywall verdict on absent extension, skip extension fetch when unauthorized, no-op for clean articles, authenticated retry success, session-expired on still-gated retry, fallback verdict on extension network-error, `getPaywallVerdict` selector. | -| `tests/components/reader/paywall-prompt.test.tsx` | 8 cases: install affordance, open-original fallback, authorize-button shown / disabled in-flight / clicking calls store, quiet stub during unknown probe, session-expired sign-in link, null-publisher collapse. | -| `tests/components/reader/reader-panel-paywall.test.tsx` | 3 cases: prompt renders only in extracted view with a verdict, session-expired copy surfaces correctly. | +| `tests/stores/extraction-store-paywall.test.ts` | A 200 response is never content-flagged (incl. #211 regressions); gated-status verdict on absent extension; authenticated retry success; session-expired when the retry won't extract; fallback verdict on extension network-error; `getPaywallVerdict` selector. | +| `tests/components/reader/paywall-prompt.test.tsx` | Install affordance, open-original fallback, authorize-button shown / disabled in-flight / clicking calls store, quiet stub during unknown probe, session-expired sign-in link, null-publisher collapse. | +| `tests/components/reader/reader-panel-paywall.test.tsx` | Prompt renders only in extracted view with a verdict; session-expired copy surfaces correctly. | End-to-end manual smoke test in `extension/README.md`. Real-extension Playwright test (`tests/e2e/extension.spec.ts`) is Phase 4 work. @@ -159,7 +166,7 @@ End-to-end manual smoke test in `extension/README.md`. Real-extension Playwright - **Browser extension, not server-side credential storage.** A FeedZero-hosted "store your NYT cookies with us, encrypted" path was rejected up front — it would make FeedZero a credentials-storage target and require cookie refreshes, both of which conflict with the no-data-leaves-browser principle. The extension is the only shape that keeps credentials in the place they already live (the user's browser session) and routes the authenticated fetch through the same place. -- **Extension is pure transport.** Paywall detection lives in the web app under `src/core/extractor/paywall-detectors/` (Phase 3). Per-publisher detector logic versions with FeedZero releases — users don't need to update the extension every time NYT ships a new paywall variant. The extension itself stays small (~3KB) and rarely changes. +- **Extension is pure transport.** Paywall recognition lives in the web app, keyed off the publisher's gated HTTP status (ADR 028). The earlier per-publisher content detectors were removed for being unreliable; HTTP status needs no per-publisher upkeep. The extension itself stays small (~3KB) and rarely changes. - **Per-publisher `optional_host_permissions`, no global access at install.** The extension manifest declares `host_permissions: []` and only `optional_host_permissions: ["https://*/*"]` as the reservoir. Each publisher is granted via `chrome.permissions.request` on user action ("Authorize nytimes.com" in the reader pane). This means: (a) the install prompt says "needs no special permissions," (b) the user retains per-domain control, (c) revoking is `chrome.permissions.remove` per domain — no need to uninstall. @@ -178,8 +185,8 @@ A fresh session continuing this work should read: 1. This doc (`docs/features/019-authenticated-fetch.md`). 2. `docs/decisions/020-browser-extension-surface.md` for the why. 3. `src/core/extension/protocol.ts` and `extension/src/handlers.ts` — page <-> extension wire format. -4. `src/core/extractor/paywall-detectors/index.ts` — where to add a new publisher. -5. `src/stores/extension-store.ts` and `src/stores/extraction-store.ts` — orchestration. +4. `src/stores/extraction-store.ts` — gated-status recognition + retry orchestration. +5. `src/stores/extension-store.ts` — extension presence + per-publisher grants. 6. `src/components/reader/paywall-prompt.tsx` — the four-state UI. 7. `extension/README.md` for the smoke-test procedure. @@ -188,7 +195,7 @@ A fresh session continuing this work should read: - **Settings tab listing authorized publishers** — read from `useExtensionStore.authorizedDomains` + a per-domain "Revoke" button that calls a new `feedzero/revoke-publisher` protocol message routing to `chrome.permissions.remove`. Mirror in chrome.storage so the popup can render the same list when the page is not open. - **Session-expired auto-refresh** — when the user clicks "Open `` to sign in", the reader pane could subscribe to `visibilitychange` and auto-retry the fetch on tab return. Today the user must manually toggle "Full text" off and on again. - **Firefox parity** — Firefox's MV3 differs from Chrome's in `optional_host_permissions` semantics. Verify the install / authorize flow on Firefox Beta; document any divergence in `extension/README.md`. -- **Additional publishers** — at minimum WSJ, FT, Economist, Bloomberg, Atlantic, New Yorker. Each is a new file in `src/core/extractor/paywall-detectors/` registered in `index.ts`. Take care with Bloomberg (anti-bot CAPTCHA) — may need per-publisher header overrides on the extension's `fetchUrl`. +- **Additional publishers** — at minimum WSJ, FT, Economist, Bloomberg, Atlantic, New Yorker. With content detectors gone (ADR 028), "supporting" a publisher is mostly verifying its anonymous fetch returns a gated status and that the authenticated retry extracts. Take care with Bloomberg (anti-bot CAPTCHA) — may need per-publisher header overrides on the extension's `fetchUrl`. - **Real-extension Playwright test** — `tests/e2e/extension.spec.ts`. Boot Chromium with `--load-extension=extension/dist`, open the reader on a fixture NYT page, click "Authorize", assert the prompt disappears and Defuddle output renders. Will need a stub HTTP endpoint that serves both the paywalled and authenticated variants based on a cookie. ### Open questions for Phase 4 @@ -202,6 +209,6 @@ A fresh session continuing this work should read: - Mobile: Chrome on Android works (limited install UX); Firefox Mobile works; iOS Safari requires a stub iOS app — deferred to v2. - Cookie expiry is detected reactively, not proactively. We see the paywall stub and tell the user to refresh. -- Each publisher's paywall changes break the *parse*, not the fetch. Per-publisher detectors need updates twice a year (rough industry norm). +- Gating now keys off the publisher's HTTP status, not page content, so a publisher restyling its paywall HTML no longer breaks recognition. A publisher that serves a 200 stub instead of a gated status will read as an empty extraction (plain failure), not a paywall prompt — an accepted trade-off for dropping the false-positive-prone content heuristics (ADR 028). - No background prefetch — articles are extracted on-click only. Background prefetch is a post-MVP design pass; it raises rate-limit and credential-burn concerns that need their own threat-model. - Self-hosters: the extension hardcodes `my.feedzero.app` + `feedzero.app` + `localhost:3000` as content-script origins today. A configurable FeedZero origin (per the plan) is a Phase 4 polish item. diff --git a/src/core/extension/protocol.ts b/src/core/extension/protocol.ts index 266d1853..9ee8a4e9 100644 --- a/src/core/extension/protocol.ts +++ b/src/core/extension/protocol.ts @@ -43,8 +43,8 @@ export type PingResponse = { /** * The extension's reply for a fetch-article request. `ok: true` carries the * raw HTML the extension received with the user's session; the web app then - * runs paywall detection + Defuddle on it. `ok: false` reasons are - * extension-side failures only — paywall detection is the web app's job. + * runs Defuddle on it (a retry that won't extract is treated as a paywall + * session-expired). `ok: false` reasons are extension-side failures only. */ export type FetchArticleResponse = | { diff --git a/src/core/extractor/paywall-detectors/default-detector.ts b/src/core/extractor/paywall-detectors/default-detector.ts deleted file mode 100644 index c42047ee..00000000 --- a/src/core/extractor/paywall-detectors/default-detector.ts +++ /dev/null @@ -1,45 +0,0 @@ -import type { PaywallDetector, PaywallVerdict } from "./types.ts"; -import { publisherHost } from "./host.ts"; -import { visibleTextLength } from "./visible-text.ts"; - -/** - * Phrases that publishers across the industry use in their paywall stubs. - * The list is conservative — false positives turn a free article into a - * "we think this is paywalled" prompt, which is a worse UX than missing a - * gate. Case-insensitive substring match. - */ -const PAYWALL_PHRASES = [ - "subscribe to read", - "subscribe to continue", - "already a subscriber?", - "this article is for subscribers", - "this story is for subscribers", - "to continue reading this article", - "create a free account to keep reading", -]; - -/** - * Below this many visible characters, the page is almost certainly a stub - * rather than the full article. Default-detector threshold only; individual - * publisher detectors may use their own. - */ -const MIN_BODY_LENGTH = 600; - -export const defaultDetector: PaywallDetector = { - name: "default", - publisher: null, - matches: () => true, - detect(html, url): PaywallVerdict { - const publisher = publisherHost(url); - const lower = html.toLowerCase(); - for (const phrase of PAYWALL_PHRASES) { - if (lower.includes(phrase)) { - return { paywalled: true, publisher, reason: "phrase-match" }; - } - } - if (visibleTextLength(html) < MIN_BODY_LENGTH) { - return { paywalled: true, publisher, reason: "body-too-short" }; - } - return { paywalled: false, publisher }; - }, -}; diff --git a/src/core/extractor/paywall-detectors/economist.ts b/src/core/extractor/paywall-detectors/economist.ts deleted file mode 100644 index 07ccd1c3..00000000 --- a/src/core/extractor/paywall-detectors/economist.ts +++ /dev/null @@ -1,49 +0,0 @@ -import type { PaywallDetector, PaywallVerdict } from "./types.ts"; - -/** - * The Economist paywall detector. Matches the recurring CTA strings shown - * inside the subscribe block on `economist.com` (and its `www.` subdomain). - * As with NYT, we keep the phrase list small and high-confidence — false - * positives surface as "Authorize " prompts on free articles, - * which is a worse UX than a missed gate. - */ -const ECONOMIST_PHRASES = [ - "subscribe to the economist", - "to continue reading this article you need to subscribe", - "subscribe to continue", - "get unlimited access to economist.com", -]; - -const ECONOMIST_HOST_SUFFIXES = ["economist.com"]; - -function isEconomistHost(host: string): boolean { - return ECONOMIST_HOST_SUFFIXES.some( - (suffix) => host === suffix || host.endsWith(`.${suffix}`), - ); -} - -export const economistDetector: PaywallDetector = { - name: "economist", - publisher: "economist.com", - matches(url) { - try { - const host = new URL(url).hostname.toLowerCase(); - return isEconomistHost(host); - } catch { - return false; - } - }, - detect(html): PaywallVerdict { - const lower = html.toLowerCase(); - for (const phrase of ECONOMIST_PHRASES) { - if (lower.includes(phrase)) { - return { - paywalled: true, - publisher: "economist.com", - reason: "economist-cta", - }; - } - } - return { paywalled: false, publisher: "economist.com" }; - }, -}; diff --git a/src/core/extractor/paywall-detectors/host.ts b/src/core/extractor/paywall-detectors/host.ts deleted file mode 100644 index 401721bf..00000000 --- a/src/core/extractor/paywall-detectors/host.ts +++ /dev/null @@ -1,14 +0,0 @@ -/** - * Extract a canonical publisher host (no leading "www.") from a URL string. - * Returns null when the URL cannot be parsed; the reader pane should treat a - * null publisher as "we cannot offer authorize-publisher UI for this article" - * and fall back to the install-extension prompt. - */ -export function publisherHost(rawUrl: string): string | null { - try { - const host = new URL(rawUrl).hostname.toLowerCase(); - return host.startsWith("www.") ? host.slice(4) : host; - } catch { - return null; - } -} diff --git a/src/core/extractor/paywall-detectors/index.ts b/src/core/extractor/paywall-detectors/index.ts deleted file mode 100644 index 30396622..00000000 --- a/src/core/extractor/paywall-detectors/index.ts +++ /dev/null @@ -1,25 +0,0 @@ -import { paywallRegistry } from "./registry.ts"; -import { defaultDetector } from "./default-detector.ts"; -import { nytimesDetector } from "./nytimes.ts"; -import { economistDetector } from "./economist.ts"; -import type { PaywallVerdict } from "./types.ts"; - -paywallRegistry.register(nytimesDetector); -paywallRegistry.register(economistDetector); -paywallRegistry.register(defaultDetector); - -/** - * Inspect fetched HTML for a paywall. Picks the first publisher-specific - * detector that claims the URL; falls through to the default detector - * otherwise. Returns a `PaywallVerdict` — see `./types.ts` for the shape. - * - * Caller (extraction-store) uses the verdict to decide: - * paywalled=false → render anonymous HTML as today - * paywalled=true → ask the extension to refetch with credentials - */ -export function detectPaywall(html: string, url: string): PaywallVerdict { - const detector = paywallRegistry.findDetector(url) ?? defaultDetector; - return detector.detect(html, url); -} - -export type { PaywallVerdict, PaywallDetector } from "./types.ts"; diff --git a/src/core/extractor/paywall-detectors/nytimes.ts b/src/core/extractor/paywall-detectors/nytimes.ts deleted file mode 100644 index f8c9835f..00000000 --- a/src/core/extractor/paywall-detectors/nytimes.ts +++ /dev/null @@ -1,49 +0,0 @@ -import type { PaywallDetector, PaywallVerdict } from "./types.ts"; - -/** - * NYT-specific paywall detector. Targets the recurring CTA strings the NYT - * site renders inside the gate component on `nytimes.com` and `cooking.nytimes.com`. - * We deliberately do not match on a single phrase; NYT has shipped multiple - * variants in the last year. The intersection of "already a subscriber?" with - * any of the gate-class names below has been stable. - */ -const NYT_PHRASES = [ - "already a subscriber?", - "subscribe to read", - "create a free account to keep reading", - "you have been granted access", -]; - -const NYT_HOST_SUFFIXES = ["nytimes.com"]; - -function isNytHost(host: string): boolean { - return NYT_HOST_SUFFIXES.some( - (suffix) => host === suffix || host.endsWith(`.${suffix}`), - ); -} - -export const nytimesDetector: PaywallDetector = { - name: "nytimes", - publisher: "nytimes.com", - matches(url) { - try { - const host = new URL(url).hostname.toLowerCase(); - return isNytHost(host); - } catch { - return false; - } - }, - detect(html): PaywallVerdict { - const lower = html.toLowerCase(); - for (const phrase of NYT_PHRASES) { - if (lower.includes(phrase)) { - return { - paywalled: true, - publisher: "nytimes.com", - reason: "nyt-cta", - }; - } - } - return { paywalled: false, publisher: "nytimes.com" }; - }, -}; diff --git a/src/core/extractor/paywall-detectors/registry.ts b/src/core/extractor/paywall-detectors/registry.ts deleted file mode 100644 index 638f872d..00000000 --- a/src/core/extractor/paywall-detectors/registry.ts +++ /dev/null @@ -1,23 +0,0 @@ -import type { PaywallDetector } from "./types.ts"; - -/** - * Ordered registry of paywall detectors. Order matters: the first detector - * whose `matches(url)` returns true gets to make the call. Default detector - * is registered last and matches every URL. - */ -class PaywallDetectorRegistry { - private detectors: PaywallDetector[] = []; - - register(detector: PaywallDetector): void { - this.detectors.push(detector); - } - - findDetector(url: string): PaywallDetector | null { - for (const detector of this.detectors) { - if (detector.matches(url)) return detector; - } - return null; - } -} - -export const paywallRegistry = new PaywallDetectorRegistry(); diff --git a/src/core/extractor/paywall-detectors/types.ts b/src/core/extractor/paywall-detectors/types.ts deleted file mode 100644 index 4e9d9b80..00000000 --- a/src/core/extractor/paywall-detectors/types.ts +++ /dev/null @@ -1,32 +0,0 @@ -/** - * A paywall detector inspects fetched HTML and decides whether the article - * appears gated. Detectors are pure functions — no DOM, no network — so the - * same module runs in the web app, in tests, and (someday) in a service - * worker without a window. - * - * Each detector either claims a URL via `matches(url)` or punts to the next - * one in the chain. The default detector matches every URL and runs as a - * fallback after publisher-specific detectors decline. - */ - -export type PaywallVerdict = - | { - paywalled: false; - publisher: string | null; - } - | { - paywalled: true; - publisher: string | null; - reason: string; - }; - -export interface PaywallDetector { - /** Human-readable name for debugging. */ - name: string; - /** Publisher-stable identifier returned in the verdict (e.g. "nytimes.com"). */ - publisher: string | null; - /** Whether this detector claims responsibility for the given URL. */ - matches(url: string): boolean; - /** Inspect the html; return a verdict. */ - detect(html: string, url: string): PaywallVerdict; -} diff --git a/src/core/extractor/paywall-detectors/visible-text.ts b/src/core/extractor/paywall-detectors/visible-text.ts deleted file mode 100644 index de2e22ba..00000000 --- a/src/core/extractor/paywall-detectors/visible-text.ts +++ /dev/null @@ -1,33 +0,0 @@ -/** - * Crude visible-text length for a paywall heuristic. Strips scripts, styles, - * and tag markup; collapses whitespace. We never render the result — only - * its length matters. The detectors use this to flag a stub article (a page - * that fetched OK but whose body is tiny because the bulk is behind a gate). - * - * NOT a sanitizer. The output is a `number`, never HTML; the only consumer - * is `length < THRESHOLD` in default-detector.ts. The script/style regexes - * tolerate every closing-tag variant a tolerant HTML parser would accept - * (``, ``, ``) — both because CodeQL's - * `js/bad-tag-filter` flags any narrower form, and because real publisher - * HTML occasionally serves them. If a stray `"; - expect(visibleTextLength(html)).toBe("visible".length); - }); - - it("strips ` did not match - // ``. If the regex regresses, the script body leaks into the - // visible-text count and inflates length past the threshold. - const html = "

x

"; - expect(visibleTextLength(html)).toBe(1); - }); - - it("strips `. - const html = "

x

"; - expect(visibleTextLength(html)).toBe(1); - }); - - it("strips