Prototype Unicode addon + xterm-detect comparisons #5584

koal44 · 2026-01-03T22:17:47Z

koal44
Jan 3, 2026

Prototype Unicode addon wired into xterm:
https://github.com/koal44/xterm.js/tree/ucwidth-addon

Depends (temporarily) on:
https://github.com/koal44/uc-width

In uc-width, running:

npm run xterm-detect

executes a ucs-detect-style comparison that runs xterm's existing Unicode providers (6/11/15-grapheme) plus this one, in the same environment.

The xterm prototype currently has a temporary dependency on uc-width. If you want to take this further, I'll restructure it to match whatever upstream approach you prefer (vendored tables, generated tables, etc.).

If anything needs doing to make this fit better with what you want, just ask.

Tyriar · 2026-01-04T14:50:48Z

Tyriar
Jan 4, 2026
Maintainer

Hi, I'm not understanding the proposal here?

0 replies

koal44 · 2026-01-04T19:39:18Z

koal44
Jan 4, 2026
Author

xterm currently has three Unicode add-ons: default 6, 11, and 15-g. 15-g handles things like 👨‍🌾 (ZWJ sequences) that can break rendering on the default provider, but it’s still parked as experimental. What I’m offering is a drop-in provider built from Unicode 17 tables (UAX #29 grapheme + width tables), with table generation owned by the provider, not pulled from a third party dump. There isn’t a clean width standard here; I aimed for wcwidth-lineage/backwards compatibility, and the ucs-detect-style runner makes it easy to compare 6/11/15-g/17 in one place and also against other terminals. Grapheme logic does introduce some interesting corner cases for wcwidth, and there are real diffs vs 15-g, which I’d be happy to review if useful.

7 replies

koal44 Jan 5, 2026
Author

yep, i traced the client-pty socket stream while reproducing the prompt corruption (paste 👨‍🌾, then backspace). here’s what i see:

// paste
in:  ['\x1B[I', '👨‍🌾', '\x1B[O']
out: ['\x1B[?25l', '\x1B[93m👨‍🌾\x1B[?25h']

// backspace
in:  ['\x1B[I', '\x7F', '\x1B[O']
out: ['\x1B[m\x1B[?25l',
      '\x1B[93m\x1B[6;46H👨‍�\x1B[97m\x1B[2m\x1B[3m�\b\x1B[?25h']

after backspace i end up with approximately: man + ZWJ + diamond, then (cursor), then another diamond.

it looks like the delete leaves a dangling ZWJ, and then the backend redraw emits U+FFFD and wigs out. i'll keep digging.

koal44 Jan 6, 2026
Author

If we want grapheme clusters to behave atomically, one approach is to send a DEL burst with count equal to the cluster's UTF-16 code unit length. this kind of works, but on PSReadLine's repaint the cursor position is not well behaved.

// CoreBrowserTerminal._keyDown: expand a single DEL into a DEL burst sized by previous cell text length.
let keyBurst = result.key;

if (result.key === C0.DEL && !event.altKey && !event.ctrlKey && !event.metaKey) {
  const prev = this._findPrevCell();
  if (prev) keyBurst = C0.DEL.repeat(prev.text.length); // UTF-16 code units
}

this._onKey.fire({ key: result.key, domEvent: event });
this._showCursor();
this.coreService.triggerDataEvent(keyBurst, true);

private _findPrevCell(): { y: number; x: number; text: string } | null {
  let y = this.buffer.y;
  let x = this.buffer.x - 1;

  while (y >= 0) {
    const line = this.buffer.lines.get(y);
    if (!line) return null;

    if (x < 0) {
      y--;
      if (y < 0) return null;

      const prevLine = this.buffer.lines.get(y);
      if (!prevLine) return null;

      x = prevLine.length - 1;
      continue;
    }

    const text = line.getString(x);
    if (text) return { y, x, text };

    x--;
  }

  return null;
}

Tyriar Jan 6, 2026
Maintainer

I think it's a conflict between how the OS or shell is handling them and how we are. It's never been clear to me how we're meant to handle this discrepancy. It would be great if we could pick the right unicode addon to use based prodding the OS, but I'm not sure if that's possible.

jerch Jan 11, 2026
Collaborator

It would be great if we could pick the right unicode addon to use based prodding the OS...

Yes this would be the right way forward for 80% of the apps, as most would follow the OS defaults. But there is no good way to derive the OS defaults regarding unicode. And it still does not help in all cases, as an app can still link its own unicode version. For that an unicode protocol between app and terminal would be nice, but we haven't anything like that.

All in all - unicode hell without any clear sign how to overcome it reliable one day. Seem we are stuck in the guessing for longer... 😸

koal44 Jan 11, 2026
Author

Here is a small demo tool for exploring column space in different shells and line editors:
https://github.com/koal44/xterm.js/tree/shell-explorer

The basic idea is to jump between HOME and END. This forces responses where we can infer column counts for a test string. For example, in pwsh, 👨‍🌾 occupies 5 columns.

Summary of inferred column counts for 4 test grapheme clusters

	🙂	👨‍🌾	👨‍👩‍👧‍👦	🇺🇳
pwsh (PSReadLine)	2	5	11	4
bash (readline)	2	4	8	2
fish (command line editor)	2	4	8	2
zsh (ZLE)	2	10	26	2

pwsh appears to use UTF-16 style column sizing
bash, fish, and zsh appear to use some flavor of wcwidth logic. confirmed in readline source code https://cgit.git.savannah.gnu.org/cgit/readline.git/tree/display.c
zsh substitutes ZWJ with a textual form "<200d>", so 👨‍🌾 becomes 2 + 6 + 2 = 10

These results suggest that there are at least two incompatible upstream column models in the wild: a POSIX wcwidth-style model and a Windows UTF-16-style model. A terminal cannot be universally consistent across these without some form of adaptive behavior.

Separately, deletion and cursor movement appear to operate on naive column units rather than grapheme clusters.

koal44 · 2026-02-02T22:59:43Z

koal44
Feb 2, 2026
Author

Hey, hey. I didn't disappear. I’m still working on this, and I wanted to give an update.

Here's a small demo that measures an app's Unicode compatibility tables:

https://github.com/koal44/xterm.js/tree/share/width-explorer

There are three widths measured here: COL, MOV, and DEL. COL is the traditional width that xterm.js uses today to stay aligned with the backend. DEL is the number of delete key presses required to eat through a payload, and MOV is the number of cursor movements required to pass over a payload. If we ever want to patch over the app's poor Unicode handling and support atomic grapheme editing at the terminal layer, we need these additional measurements.

The demo can sweep the full Unicode set of about 1.1 million code points. Runtime varies widely from seconds to hours to days, depending on the shell and which measurements are enabled.

// bashCompatTableCompact (partial) [ranges=776; codePoints=1111999]
[
  { start: 0x0020, end: 0x007e, widths: { col: 1, mov: 1, del: 1 } }, // ASCII
  { start: 0x00a0, end: 0x02ff, widths: { col: 1, mov: 1, del: 1 } }, // Latin + misc
  { start: 0x0300, end: 0x036f, widths: { col: 0, mov: 0, del: 0 } }, // combining marks
  { start: 0x0370, end: 0x0482, widths: { col: 1, mov: 1, del: 1 } }, // Greek/Coptic
  ...
]

// pwshCompatTable (partial) [ranges=24; codePoints=89111] (yup, not finished, pwsh is the slowest!)
[
  { start: 0x0020, end: 0x007e, widths: { col: 1, mov: 1, del: 1 } }, // ASCII
  { start: 0x00a0, end: 0x10ff, widths: { col: 1, mov: 1, del: 1 } }, // BMP (mostly)
  { start: 0x1100, end: 0x115f, widths: { col: 2, mov: 1, del: 1 } }, // wide col, narrow mov/del
  { start: 0x1160, end: 0x2328, widths: { col: 1, mov: 1, del: 1 } },
  ...
]

Within a given environment, such as POSIX versus Windows, there is strong selective pressure for COL alignment. If column widths did not broadly line up, terminals would constantly break across apps, so that consistency does emerge in practice. Where things diverge much more is MOV and DEL behavior, which varies widely even when COL agrees.

My candid opinion is that apps never should've attempted to handle Unicode themselves and just left it to us. We are now in the awkward position of not being able to query the backend for how it actually behaves, while still being expected to somehow infer its column alignment and related behavior after the fact.

I'm not sure yet how we can reliably know which app we are interacting with, or whether we can safely sneak in measurements during an active session, but I'm assuming something along those lines is possible. I've been working on a proof of concept around this, with the basic idea of a new xterm.js add-on that swaps out the current parser's print handler and replaces it with an updated buffer line and cell data model. That work is not finished yet, and I got pulled into building the measurement demo first, but I'm continuing to explore it here:

https://github.com/koal44/xterm.js/tree/poc/unicode-hell

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype Unicode addon + xterm-detect comparisons #5584

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Prototype Unicode addon + xterm-detect comparisons #5584

Uh oh!

koal44 Jan 3, 2026

Replies: 3 comments · 7 replies

Uh oh!

Tyriar Jan 4, 2026 Maintainer

Uh oh!

Uh oh!

koal44 Jan 4, 2026 Author

Uh oh!

koal44 Jan 5, 2026 Author

Uh oh!

koal44 Jan 6, 2026 Author

Uh oh!

Tyriar Jan 6, 2026 Maintainer

Uh oh!

jerch Jan 11, 2026 Collaborator

Uh oh!

koal44 Jan 11, 2026 Author

Uh oh!

koal44 Feb 2, 2026 Author

koal44
Jan 3, 2026

Replies: 3 comments 7 replies

Tyriar
Jan 4, 2026
Maintainer

koal44
Jan 4, 2026
Author

koal44 Jan 5, 2026
Author

koal44 Jan 6, 2026
Author

Tyriar Jan 6, 2026
Maintainer

jerch Jan 11, 2026
Collaborator

koal44 Jan 11, 2026
Author

koal44
Feb 2, 2026
Author