Skip to content

fix: adjust FxTwitter facet indices for surrogate-pair emoji#281

Open
Thinkscape wants to merge 1 commit into
kepano:mainfrom
Thinkscape:fix/fxtwitter-surrogate-indices
Open

fix: adjust FxTwitter facet indices for surrogate-pair emoji#281
Thinkscape wants to merge 1 commit into
kepano:mainfrom
Thinkscape:fix/fxtwitter-surrogate-indices

Conversation

@Thinkscape
Copy link
Copy Markdown

Problem

FxTwitter API returns facet indices counting Unicode code points (emoji = 1), but renderTweet() and applyFacets() treat them as UTF-16 code units (emoji in the supplementary multilingual plane are surrogate pairs that count as 2 in JavaScript strings).

When a tweet contains emoji (🦞, 🩺, 🔌, 🌐, etc.) before a URL or mention facet, the facet markers are placed at incorrect positions. This produces broken HTML like:

<p>Small maintenance relea<a href="https://t.co/VkjTYkoy7V">se:<br>...</a>oy7V</p>

After Turndown conversion, this becomes garbled markdown:

Small maintenance relea[se:  
https://t.co/VkjTYk](https://t.co/VkjTYkoy7V)oy7V

Real-world example

Tweet: https://x.com/openclaw/status/2052096219233587451

The tweet contains 4 emoji before the t.co URL, causing a 4-code-unit offset between FxTwitter facet indices and JavaScript string positions.

Fix

Added two helper methods to XOembedExtractor:

  • codePointToUtf16Index() — converts a single code-point index to its UTF-16 code-unit equivalent by iterating over characters and counting surrogate pairs
  • adjustFacetIndicesToUtf16() — applies the conversion to all facet indices, with a fast-path returning early when the text contains no surrogate pairs

These are called in renderTweet() before facets are passed to applyFacets().

Tests

Added tests/x-oembed-surrogates.test.ts with 3 test cases:

  1. Tweet with emoji — uses the real FxTwitter API response for the bug-report tweet, verifies no broken link placement
  2. Tweet without emoji — verifies indices pass through unchanged when no surrogate pairs exist
  3. Markdown output — verifies Turndown produces clean markdown after the fix

All 284 existing tests continue to pass.

FxTwitter returns facet indices counting Unicode code points (emoji = 1),
but JavaScript string operations (indexOf, slice, .length) use UTF-16
code units where surrogate-pair emoji (e.g. 🦞, 🩺, 🔌, 🌐) count as 2.

When a tweet contains emoji before a URL or mention facet, the indices
are off by the number of preceding surrogate pairs, causing links to be
placed at incorrect positions in the text. This produces broken HTML like:

  <p>Small maintenance relea<a href="...">se:<br>...</a>oy7V</p>

instead of:

  <p>Small maintenance release:<br><a href="...">https://t.co/...</a></p>

Fix: add codePointToUtf16Index() and adjustFacetIndicesToUtf16() helpers
to convert facet indices from code-point space to UTF-16 code units before
applying markers. Fast-path returns early when no surrogate pairs exist.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant