Skip to content

feat(legacy-archive): italic titles, plain_title cleanup, URL-encode legacy paths#149

Merged
RonanHevenor merged 2 commits into
mainfrom
legacy-archive-completeness-2
May 7, 2026
Merged

feat(legacy-archive): italic titles, plain_title cleanup, URL-encode legacy paths#149
RonanHevenor merged 2 commits into
mainfrom
legacy-archive-completeness-2

Conversation

@RonanHevenor
Copy link
Copy Markdown
Member

Summary

Round 6 of legacy-archive cleanup, following up on PR #148. Three smaller polish items from the audit:

  • 165 WP italic/em/bold titles restored — source post_title carried inline <i>, <em>, <b>, <strong> markup that the original importer dropped when building the Lexical title doc (only plain_title had the tags stripped correctly). Article headlines like "Super MNC fails to live up to predecessor" now render italic.
  • 27 polluted plain_titles cleaned — 22 pipeline rows had numeric HTML entities (&#8217;, &#038;); 5+ poly-online rows had literal <i>...</i> tags. Decoded + tag-stripped in place.
  • 23 legacy_html_url chips URL-encoded — paths with literal spaces (e.g. /archive/wordpress/mirror/2018/09/12/Chris Mooney discusses current state of climate change/) now use %20, so the "View original" chip actually clicks through.

No schema changes. All updates were applied to production directly via the SSH tunnel before this PR landed; the scripts are committed for the historical record + repeatability.

Test plan

  • pnpm typecheck (no new files require it; pre-checked locally)
  • Verified articles 14498 (<i>Super MNC</i>...) and 11493 (Chris Mooney URL) render correctly post-update
  • Confirmed wp_id 7191's neuromarketing letter and the 75 expanded-shortcode articles from PR feat(legacy-archive): kickers, featured images, slugs, shortlinks #148 are still intact
  • Post-deploy: visit a known italic-title article and verify the headline renders in italic
  • Post-deploy: click the chip on a legacy_html_url that was previously space-broken

🤖 Generated with Claude Code

…ncode legacy paths

* restore-wp-italic-titles.ts: 165 WP rich-text titles rebuilt to preserve
  the source <i>/<em>/<b>/<strong> formatting (e.g. "<i>Super MNC</i>
  fails to live up to predecessor" now renders italic). plain_title was
  already correct; only the Lexical title node was wrong.
* clean-plain-titles.ts: 27 plain_titles cleaned of residual &#8217;-style
  numeric entities (22 pipeline) and literal <i> tags (poly-online).
* legacy_html_url URL-encoding: 23 chips with literal spaces (Chris
  Mooney discusses... etc.) now use %20 so the chip clicks through to a
  working /archive/ path.
…lit a slash byline

* 11 polytechnic-online articles had the wrong published_date (MM/DD swap
  from manifest parse ambiguity, plus 2 off-by-ones). Slugs adjusted to
  reflect the corrected date; collisions disambiguated with the legacy
  article id suffix. previous_slug retained for the old shape so existing
  links 301.
* normalize-author-casing.ts: 53 author groups had 2+ casings ("James
  Lenze II" vs "JAmes Lenze II" etc.). Picked the most-common form per
  group, renamed 123 article rows to it.
* wp_id 2558's mis-OCR'd "Russell Brown/Paul O'Neil" byline split into
  two write-in author rows.
@RonanHevenor RonanHevenor merged commit 0257339 into main May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant