Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/decisions/020-browser-extension-surface.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Status

Accepted (2026-05-21).
Accepted (2026-05-21). **Amended by [ADR 028](028-remove-paywall-content-heuristics.md) (2026-06-04):** the per-publisher *content* detectors referenced below (`src/core/extractor/paywall-detectors/`) were removed for being unreliable (false positives on free articles, issue #211). Paywall recognition is now keyed solely off the publisher's gated HTTP status on the anonymous fetch. The extension-as-transport decision itself stands; only the "detection lives in per-publisher detectors" mechanism changed.

## Context

Expand Down
81 changes: 81 additions & 0 deletions docs/decisions/028-remove-paywall-content-heuristics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# ADR 028: Remove Content-Heuristic Paywall Detection

## Status

Accepted (2026-06-04). Amends [ADR 020](020-browser-extension-surface.md).

## Context

Feature 019 (authenticated full-text fetching) shipped a content-heuristic
paywall detector in `src/core/extractor/paywall-detectors/`. It inspected the
fetched page HTML and flagged an article as paywalled when either:

- the body matched an industry-wide or per-publisher "Subscribe to read" /
"Already a subscriber?" phrase list (`default-detector`, `nytimes`,
`economist`), or
- the visible text was shorter than a 600-character threshold
(`body-too-short`, via `visible-text.ts`).

The heuristic was unreliable in the direction that hurts most: **false
positives on free articles.** Many free articles ship the exact phrases the
detector keyed on in their nav/footer/newsletter chrome (Wired, NYT, lttlabs,
…). Issue #211 was filed when fully-readable free articles rendered a "Paywalled
article" prompt instead of their content. The #211 fix layered an
"extract-first, only consult heuristics on a thin result" guard on top — adding
complexity to compensate for a signal that was guessing in the first place.

A phrase list is a guess about another company's HTML. It needs per-publisher
upkeep ("publishers change paywall HTML twice a year"), and every guess is a
chance to gate a free article. Meanwhile there is a signal that is not a guess:
when a publisher refuses an anonymous request, it returns a gated HTTP status
(401 Unauthorized, 402 Payment Required, 403 Forbidden, 451). That status is
the publisher *telling us* the content is gated.

## Decision

Remove the content-heuristic detectors entirely. Recognize a paywall **only**
from a gated HTTP status (401/402/403/451) on the anonymous `/api/page` fetch.

- Delete `src/core/extractor/paywall-detectors/` (default/nytimes/economist
detectors, the registry, `visible-text.ts`, `detectPaywall`, and the
`PaywallDetector` interface).
- Keep the small, certain pieces — the `PaywallVerdict` shape and
`publisherHost()` — in a new `src/core/extractor/paywall.ts`.
- A 200 response is never re-classified as paywalled from its content: whatever
Defuddle extracts is the article; an empty extraction is a plain failure.
- The extension authenticated-retry path is retained, now triggered by the
gated status. A retry that still yields no readable article is treated as
`session-expired` (the cookie has expired) — previously this was a second
content-heuristic check on the retried HTML; it is now the absence of a
readable extraction.

The extension plumbing, `paywallMap`, `getPaywallVerdict`, and the reader-pane
`PaywallPrompt` are unchanged.

## Consequences

**Positive**

- No more false positives from page chrome — the #211 class of bug is gone by
construction, and the extract-first guard it required is removed too.
- Zero per-publisher upkeep. HTTP status needs no phrase lists; adding a
publisher is mostly verifying its anonymous fetch is gated and its
authenticated retry extracts.
- Less code and a smaller bundle (eight detector modules + their tests removed).

**Negative / accepted trade-offs**

- A publisher that serves a *200 with a paywall stub* (rather than a gated
status) now reads as an empty extraction → a plain "extraction found nothing"
failure, not a paywall prompt. We judged the soft-paywall miss less harmful
than gating free articles, and the reader still offers "Open original".
- `session-expired` is now inferred from a non-extracting authenticated retry
rather than a positive content match, so a genuinely empty (but free) page
fetched through the extension would surface the session-expired copy. This
only occurs on the already-gated path for an authorized publisher.

## References

- Feature 019: `docs/features/019-authenticated-fetch.md`
- ADR 020: `docs/decisions/020-browser-extension-surface.md`
- Issue #211 (free articles wrongly flagged paywalled)
59 changes: 33 additions & 26 deletions docs/features/019-authenticated-fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,25 @@
|---|---|---|
| 1 | Web-app `protocol.ts` + MV3 extension scaffold + `ping` handshake | ✅ Shipped (`ade8970`, `9688f2d`) |
| 2 | `fetch-article` round-trip with cookies, permission gate, scheme guard | ✅ Shipped (`f6ff5eb`) |
| 3 | Paywall detectors, `authorize-publisher` protocol message, extension store, reader-pane prompt, extraction-store wiring | ✅ Shipped (slices on `claude/paywall-extension-feature-hR15P`) |
| 3 | HTTP-status paywall gating, `authorize-publisher` protocol message, extension store, reader-pane prompt, extraction-store wiring | ✅ Shipped (slices on `claude/paywall-extension-feature-hR15P`) |

> **Update (ADR 028):** The content-heuristic paywall detectors (phrase lists +
> body-length thresholds, formerly `src/core/extractor/paywall-detectors/`)
> were **removed** — they false-positived on free articles (issue #211). The
> only paywall signal FeedZero now trusts is the publisher's own gated HTTP
> status (401/402/403/451) on the anonymous `/api/page` fetch. The verdict
> shape (`PaywallVerdict`) + `publisherHost()` now live in
> `src/core/extractor/paywall.ts`. The extension retry, `paywallMap`, and the
> reader-pane prompt are unchanged.
| 4 | Settings tab listing authorized publishers; session-expired auto-refresh; Firefox parity | ⏳ Next |
| 5 | Chrome Web Store + Firefox AMO distribution; Safari path | ⏳ |

### Shipping gate — `VITE_EXTENSION_ENABLED`

Paywall **detection** and the **"Open original"** fallback ship now: hitting a
paywalled article shows a clean "Paywalled article → Open original" card
instead of a broken extraction. The **extension CTAs** (Install the FeedZero
Paywall **gating** (by HTTP status) and the **"Open original"** fallback ship
now: a publisher that refuses the anonymous fetch (401/402/403/451) shows a
clean "Paywalled article → Open original" card instead of a broken extraction.
The **extension CTAs** (Install the FeedZero
extension, Authorize `<publisher>`, session-expired sign-in) are gated behind
`isExtensionEnabled()` (`src/core/extension/extension-enabled.ts`), which reads
`VITE_EXTENSION_ENABLED` and **defaults off**. The boot-time `detect()` ping is
Expand Down Expand Up @@ -89,10 +99,10 @@ content-script.js
│ window.postMessage(response, origin)
Page receives response via protocol.ts `fetchArticle()`
│ Web app: paywall detector → Defuddle → reader
│ Web app: Defuddle → reader (re-extract; empty ⇒ session-expired)
```

The extension is pure transport. **Paywall detection, content extraction, and rendering all stay on the web-app side** so the extension's surface stays small (currently ~3KB bundled) and the detection logic versions with the web app, not the user's local extension.
The extension is pure transport. **Paywall gating, content extraction, and rendering all stay on the web-app side** so the extension's surface stays small (currently ~3KB bundled). Gating is by the publisher's HTTP status, not page content (ADR 028).

### Files

Expand All @@ -102,25 +112,23 @@ The extension is pure transport. **Paywall detection, content extraction, and re
|------|------|
| `src/core/extension/protocol.ts` | Message envelope types (`OutboundMessage`, `InboundMessage`), `ping()` for detection, `fetchArticle(url)` for the authenticated fetch, `authorizePublisher(domain)` for the runtime host-permission grant. Origin-pinned `window.postMessage` transport with `requestId` correlation and timeout. |

#### Web app — `src/core/extractor/paywall-detectors/`
#### Web app — `src/core/extractor/paywall.ts`

| File | Role |
| Export | Role |
|------|------|
| `types.ts` | `PaywallVerdict` discriminated union + `PaywallDetector` interface. |
| `host.ts` | `publisherHost(url)` — canonical publisher host with leading `www.` stripped. |
| `visible-text.ts` | `visibleTextLength(html)` — crude tag-stripping length heuristic, sync, dep-free. |
| `default-detector.ts` | Substring scan over industry-wide paywall phrases + body-too-short fallback (600-char threshold). |
| `nytimes.ts` | Publisher-specific detector for `nytimes.com` and its subdomains (e.g. `cooking.nytimes.com`). |
| `economist.ts` | Publisher-specific detector for `economist.com`. |
| `registry.ts` | Ordered first-match registry. |
| `index.ts` | Registers detectors; exports `detectPaywall(html, url): PaywallVerdict`. |
| `PaywallVerdict` | Verdict shape recorded for a gated article (`paywalled: true`, `publisher`, `reason`). |
| `publisherHost(url)` | Canonical publisher host with leading `www.` stripped; null when unparseable. |

> The content-heuristic detectors (`paywall-detectors/`) were removed in ADR
> 028. There is no `detectPaywall(html, url)` anymore — a paywall is recognized
> only by the publisher's gated HTTP status on the anonymous fetch.

#### Web app — stores + UI

| File | Role |
|------|------|
| `src/stores/extension-store.ts` | Zustand mirror of extension presence + per-publisher grants. `status: "unknown" \| "installed" \| "absent"`, `authorizedDomains[]`, `detect()`, `requestPublisherAccess(domain)`, `isAuthorized(domain)`. |
| `src/stores/extraction-store.ts` | On every `/api/page` response, runs `detectPaywall`. If gated + authorized for the publisher, retries via `fetchArticle()`; if still gated marks `session-expired`. Surfaces `paywallMap` for the reader pane. |
| `src/stores/extraction-store.ts` | A gated `/api/page` status (401/402/403/451) is the paywall signal. If gated + authorized for the publisher, retries via `fetchArticle()`; a retry that still won't extract marks `session-expired`. Surfaces `paywallMap` for the reader pane. |
| `src/components/reader/paywall-prompt.tsx` | Four-state reader-pane affordance: install-extension, authorize-`<publisher>`, session-expired, fallback "Open original". |
| `src/app.tsx` (`AppInit`) | Calls `useExtensionStore.getState().detect()` once at boot so the prompt picks the right CTA without per-render pings. |

Expand All @@ -147,19 +155,18 @@ The extension is pure transport. **Paywall detection, content extraction, and re
|------|----------|
| `tests/core/extension/protocol.test.ts` | 13 cases: ping round-trip / timeout / requestId mismatch / protocol-version envelope / origin filter, fetchArticle success / failure-reason forwarding / URL forwarded / timeout, authorizePublisher grant / decline / domain forwarded / timeout. Uses a `fakeExtension` helper that stands in for the content script. |
| `tests/extension/handlers.test.ts` | 15 cases: ping happy path, malformed/non-FeedZero messages rejected, response-typed messages rejected (echo-loop guard), wrong protocol version rejected, fetch happy path / no-permission short-circuit / blocked-scheme / network-error wraps throws / malformed fetch (no url), authorize-publisher grant / decline / runtime throw / missing-domain / scheme-or-path domain rejected. All IO mocked via `HandlerContext`. |
| `tests/core/extractor/paywall-detectors/detect-paywall.test.ts` | 10 cases: NYT phrase-match across www / cooking subdomain, NYT false-negative on long body, default phrase-match, default body-too-short, null publisher for unparseable URL, verdict shape. |
| `tests/stores/extension-store.test.ts` | 8 cases: detect installed / absent / repeated calls, requestPublisherAccess grant / decline / timeout, dedupe on re-grant, isAuthorized reflection. |
| `tests/stores/extraction-store-paywall.test.ts` | 8 cases: paywall verdict on absent extension, skip extension fetch when unauthorized, no-op for clean articles, authenticated retry success, session-expired on still-gated retry, fallback verdict on extension network-error, `getPaywallVerdict` selector. |
| `tests/components/reader/paywall-prompt.test.tsx` | 8 cases: install affordance, open-original fallback, authorize-button shown / disabled in-flight / clicking calls store, quiet stub during unknown probe, session-expired sign-in link, null-publisher collapse. |
| `tests/components/reader/reader-panel-paywall.test.tsx` | 3 cases: prompt renders only in extracted view with a verdict, session-expired copy surfaces correctly. |
| `tests/stores/extraction-store-paywall.test.ts` | A 200 response is never content-flagged (incl. #211 regressions); gated-status verdict on absent extension; authenticated retry success; session-expired when the retry won't extract; fallback verdict on extension network-error; `getPaywallVerdict` selector. |
| `tests/components/reader/paywall-prompt.test.tsx` | Install affordance, open-original fallback, authorize-button shown / disabled in-flight / clicking calls store, quiet stub during unknown probe, session-expired sign-in link, null-publisher collapse. |
| `tests/components/reader/reader-panel-paywall.test.tsx` | Prompt renders only in extracted view with a verdict; session-expired copy surfaces correctly. |

End-to-end manual smoke test in `extension/README.md`. Real-extension Playwright test (`tests/e2e/extension.spec.ts`) is Phase 4 work.

## Design decisions

- **Browser extension, not server-side credential storage.** A FeedZero-hosted "store your NYT cookies with us, encrypted" path was rejected up front — it would make FeedZero a credentials-storage target and require cookie refreshes, both of which conflict with the no-data-leaves-browser principle. The extension is the only shape that keeps credentials in the place they already live (the user's browser session) and routes the authenticated fetch through the same place.

- **Extension is pure transport.** Paywall detection lives in the web app under `src/core/extractor/paywall-detectors/` (Phase 3). Per-publisher detector logic versions with FeedZero releases — users don't need to update the extension every time NYT ships a new paywall variant. The extension itself stays small (~3KB) and rarely changes.
- **Extension is pure transport.** Paywall recognition lives in the web app, keyed off the publisher's gated HTTP status (ADR 028). The earlier per-publisher content detectors were removed for being unreliable; HTTP status needs no per-publisher upkeep. The extension itself stays small (~3KB) and rarely changes.

- **Per-publisher `optional_host_permissions`, no global access at install.** The extension manifest declares `host_permissions: []` and only `optional_host_permissions: ["https://*/*"]` as the reservoir. Each publisher is granted via `chrome.permissions.request` on user action ("Authorize nytimes.com" in the reader pane). This means: (a) the install prompt says "needs no special permissions," (b) the user retains per-domain control, (c) revoking is `chrome.permissions.remove` per domain — no need to uninstall.

Expand All @@ -178,8 +185,8 @@ A fresh session continuing this work should read:
1. This doc (`docs/features/019-authenticated-fetch.md`).
2. `docs/decisions/020-browser-extension-surface.md` for the why.
3. `src/core/extension/protocol.ts` and `extension/src/handlers.ts` — page <-> extension wire format.
4. `src/core/extractor/paywall-detectors/index.ts` — where to add a new publisher.
5. `src/stores/extension-store.ts` and `src/stores/extraction-store.ts` — orchestration.
4. `src/stores/extraction-store.ts` — gated-status recognition + retry orchestration.
5. `src/stores/extension-store.ts` — extension presence + per-publisher grants.
6. `src/components/reader/paywall-prompt.tsx` — the four-state UI.
7. `extension/README.md` for the smoke-test procedure.

Expand All @@ -188,7 +195,7 @@ A fresh session continuing this work should read:
- **Settings tab listing authorized publishers** — read from `useExtensionStore.authorizedDomains` + a per-domain "Revoke" button that calls a new `feedzero/revoke-publisher` protocol message routing to `chrome.permissions.remove`. Mirror in chrome.storage so the popup can render the same list when the page is not open.
- **Session-expired auto-refresh** — when the user clicks "Open `<publisher>` to sign in", the reader pane could subscribe to `visibilitychange` and auto-retry the fetch on tab return. Today the user must manually toggle "Full text" off and on again.
- **Firefox parity** — Firefox's MV3 differs from Chrome's in `optional_host_permissions` semantics. Verify the install / authorize flow on Firefox Beta; document any divergence in `extension/README.md`.
- **Additional publishers** — at minimum WSJ, FT, Economist, Bloomberg, Atlantic, New Yorker. Each is a new file in `src/core/extractor/paywall-detectors/` registered in `index.ts`. Take care with Bloomberg (anti-bot CAPTCHA) — may need per-publisher header overrides on the extension's `fetchUrl`.
- **Additional publishers** — at minimum WSJ, FT, Economist, Bloomberg, Atlantic, New Yorker. With content detectors gone (ADR 028), "supporting" a publisher is mostly verifying its anonymous fetch returns a gated status and that the authenticated retry extracts. Take care with Bloomberg (anti-bot CAPTCHA) — may need per-publisher header overrides on the extension's `fetchUrl`.
- **Real-extension Playwright test** — `tests/e2e/extension.spec.ts`. Boot Chromium with `--load-extension=extension/dist`, open the reader on a fixture NYT page, click "Authorize", assert the prompt disappears and Defuddle output renders. Will need a stub HTTP endpoint that serves both the paywalled and authenticated variants based on a cookie.

### Open questions for Phase 4
Expand All @@ -202,6 +209,6 @@ A fresh session continuing this work should read:

- Mobile: Chrome on Android works (limited install UX); Firefox Mobile works; iOS Safari requires a stub iOS app — deferred to v2.
- Cookie expiry is detected reactively, not proactively. We see the paywall stub and tell the user to refresh.
- Each publisher's paywall changes break the *parse*, not the fetch. Per-publisher detectors need updates twice a year (rough industry norm).
- Gating now keys off the publisher's HTTP status, not page content, so a publisher restyling its paywall HTML no longer breaks recognition. A publisher that serves a 200 stub instead of a gated status will read as an empty extraction (plain failure), not a paywall prompt — an accepted trade-off for dropping the false-positive-prone content heuristics (ADR 028).
- No background prefetch — articles are extracted on-click only. Background prefetch is a post-MVP design pass; it raises rate-limit and credential-burn concerns that need their own threat-model.
- Self-hosters: the extension hardcodes `my.feedzero.app` + `feedzero.app` + `localhost:3000` as content-script origins today. A configurable FeedZero origin (per the plan) is a Phase 4 polish item.
4 changes: 2 additions & 2 deletions src/core/extension/protocol.ts
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,8 @@ export type PingResponse = {
/**
* The extension's reply for a fetch-article request. `ok: true` carries the
* raw HTML the extension received with the user's session; the web app then
* runs paywall detection + Defuddle on it. `ok: false` reasons are
* extension-side failures only — paywall detection is the web app's job.
* runs Defuddle on it (a retry that won't extract is treated as a paywall
* session-expired). `ok: false` reasons are extension-side failures only.
*/
export type FetchArticleResponse =
| {
Expand Down
Loading
Loading