Skip to content

Reader: 'Passer la publicité' still appearing despite being in JUNK_PATTERNS_TO_REMOVE #586

@mircealungu

Description

@mircealungu

Symptom

In the iOS reader, French articles still display "Passer la publicité" as a standalone paragraph between body paragraphs. Screenshot from a recent reader session confirms it appears verbatim, mid-article, between two otherwise clean French sentences.

This is the French ad-skip prompt (literally "Skip the ad") — leftover from a video/ad placeholder on the source site that the cleaner should be stripping.

What we already know

The exact string is already in the blocklist at api/zeeguu/core/content_cleaning/content_cleaner.py:27:

JUNK_PATTERNS_TO_REMOVE = [
    ...
    # French cookie/ad notices
    \"Passer la publicité\",
    \"La suite après cette publicité\",
    ...
]

So the question is why the existing entry isn't catching this instance.

Hypotheses worth checking

  1. Article crawled before the entry was added. Cleaning is applied at crawl time only — old articles in the DB retain the artifact. Confirmable by checking when this entry landed (git blame on line 27) vs the article's published_time / crawl date.

  2. Wiring gap: JUNK_PATTERNS_TO_REMOVE may not flow into sent_filter_set. filter_noise_patterns (line 96) matches against sent_filter_set passed in by the caller, not JUNK_PATTERNS_TO_REMOVE directly. Worth verifying the caller actually unions both lists into the set.

  3. Sentence-tokenization splits the phrase. Matching uses sent_tokenize + exact match on normalize_sent(sent) (.lower().strip()). If the phrase appears with attached punctuation or wrapped inside a longer line on this particular site's HTML, the tokenized sentence won't equal the blocklist entry, even after normalization.

  4. Second cleaning path bypasses the list. The JS readability-server does its own cleanup via SpecificCleanup/. If that path runs at render time instead of crawl time, it has no awareness of this Python list.

Suggested diagnosis order

  1. Check git log -- zeeguu/core/content_cleaning/content_cleaner.py and compare to article publish/crawl dates for affected URL.
  2. Trace the caller of filter_noise_patterns to confirm JUNK_PATTERNS_TO_REMOVE is actually included in sent_filter_set.
  3. Add a debug print of sent values being compared on a re-clean of an affected article — see whether the candidate sentence is exactly "passer la publicité" or has trailing characters.
  4. If wiring + tokenization are fine, the article is likely pre-list — bulk re-clean older articles.

Out of scope (separate issues)

  • Whether other French ad-prompt variants are still missing from the list (e.g. "Passer cette publicité", different casing/punctuation)
  • Whether to make matching less brittle (substring instead of exact-match) — separate design discussion

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions