fix: adjust FxTwitter facet indices for surrogate-pair emoji by Thinkscape · Pull Request #281 · kepano/defuddle

Thinkscape · 2026-05-07T03:07:49Z

Problem

FxTwitter API returns facet indices counting Unicode code points (emoji = 1), but renderTweet() and applyFacets() treat them as UTF-16 code units (emoji in the supplementary multilingual plane are surrogate pairs that count as 2 in JavaScript strings).

When a tweet contains emoji (🦞, 🩺, 🔌, 🌐, etc.) before a URL or mention facet, the facet markers are placed at incorrect positions. This produces broken HTML like:

<p>Small maintenance relea<a href="https://t.co/VkjTYkoy7V">se:<br>...</a>oy7V</p>

After Turndown conversion, this becomes garbled markdown:

Small maintenance relea[se:  
https://t.co/VkjTYk](https://t.co/VkjTYkoy7V)oy7V

Real-world example

Tweet: https://x.com/openclaw/status/2052096219233587451

The tweet contains 4 emoji before the t.co URL, causing a 4-code-unit offset between FxTwitter facet indices and JavaScript string positions.

Fix

Added two helper methods to XOembedExtractor:

codePointToUtf16Index() — converts a single code-point index to its UTF-16 code-unit equivalent by iterating over characters and counting surrogate pairs
adjustFacetIndicesToUtf16() — applies the conversion to all facet indices, with a fast-path returning early when the text contains no surrogate pairs

These are called in renderTweet() before facets are passed to applyFacets().

Tests

Added tests/x-oembed-surrogates.test.ts with 3 test cases:

Tweet with emoji — uses the real FxTwitter API response for the bug-report tweet, verifies no broken link placement
Tweet without emoji — verifies indices pass through unchanged when no surrogate pairs exist
Markdown output — verifies Turndown produces clean markdown after the fix

All 284 existing tests continue to pass.

FxTwitter returns facet indices counting Unicode code points (emoji = 1), but JavaScript string operations (indexOf, slice, .length) use UTF-16 code units where surrogate-pair emoji (e.g. 🦞, 🩺, 🔌, 🌐) count as 2. When a tweet contains emoji before a URL or mention facet, the indices are off by the number of preceding surrogate pairs, causing links to be placed at incorrect positions in the text. This produces broken HTML like: Small maintenance relea<a href="...">se: ...</a>oy7V instead of: Small maintenance release: <a href="...">https://t.co/...</a> Fix: add codePointToUtf16Index() and adjustFacetIndicesToUtf16() helpers to convert facet indices from code-point space to UTF-16 code units before applying markers. Fast-path returns early when no surrogate pairs exist.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: adjust FxTwitter facet indices for surrogate-pair emoji#281

fix: adjust FxTwitter facet indices for surrogate-pair emoji#281
Thinkscape wants to merge 1 commit into
kepano:mainfrom
Thinkscape:fix/fxtwitter-surrogate-indices

Thinkscape commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Thinkscape commented May 7, 2026

Problem

Real-world example

Fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant