Skip to content

Commit 7142f4e

Browse files
authored
fix(math): split multi-char m:r text per character to match Word (SD-2632) (#2875)
* fix(math): split multi-char m:r text per character to match Word (SD-2632) Word's OMML2MML.XSL classifies each character in an m:r run individually — digits group into a single <mn> (with one optional decimal point between digits), operator characters each become their own <mo>, and everything else becomes its own <mi>. SuperDoc was emitting the entire run text as one element, so runs like "→∞" or "x+1" rendered as a single <mi>, losing operator spacing and semantics. tokenizeMathText implements the per-character classification. convertMathRun returns a single Element for one-atom runs and a DocumentFragment when multiple atoms are emitted, so siblings flow directly into the parent's <mrow> without an extra wrapper. m:fName is the documented exception — Word keeps multi-letter function names like "sin" or "lim" as one <mi> inside the function-name slot. convertFunction routes m:r children through convertMathRunWhole (no splitting), and a new collapseFunctionNameBases pass re-merges the base slot of any structural MathML element (munder/mover/msub/…) that Word nested inside m:fName — without this, "lim" inside m:limLow would incorrectly split to three <mi>. Also drops U+221E (∞) from OPERATOR_CHARS — it's a mathematical constant per Word's XSL, not an operator. MathObjectConverter's return type widens from Element|null to Node|null so convertMathRun can return a DocumentFragment. All other converters already return Element, which is assignable to Node — no other changes. Verified against real Word-native fixtures: `→∞` in the limit-tests fixture case 1 now renders as <mi>n</mi><mo>→</mo><mi>∞</mi> (matches Word OMML2MML.XSL byte-for-byte), and nested limits keep their function names intact. Ref ECMA-376 §22.1.2.116, Annex L.6.1.13, §22.1.2.58. * fix(math): group letters inside m:fName mixed runs + cover mmultiscripts (SD-2632) Follow-up to /review feedback backed by fresh Word OMML2MML.XSL evidence: - `convertMathRunAsFunctionName` (renamed from `convertMathRunWhole` since "whole" no longer fits) now groups consecutive non-digit / non-operator characters into one <mi> while still splitting digits and operators. Word's XSL for `<m:fName><m:r>log_2</m:r></m:fName>` produces `<mi>log</mi><mo>_</mo><mn>2</mn>` — not `<mi>l</mi><mi>o</mi><mi>g</mi>…`. - `BASE_BEARING_ELEMENTS` gains `mmultiscripts` — Word emits it when an `m:sPre` sits inside `m:fName`; our base-collapse pass needs to know to merge the first-child <mi> run. - CONTRIBUTING.md now documents the widened `Node | null` return type. Tests added: - Direct `tokenizeMathText` edge cases: `.5` / `5.` / `1.2.3` / `2x+1` / consecutive operators / empty / standalone ∞. - m:fName mixed-content: `log_2` stays `<mi>log</mi><mo>_</mo><mn>2</mn>`. - Base collapse inside nested `m:sSub` under `m:fName`. - Base collapse inside nested `m:sPre` (mmultiscripts) under `m:fName`. - Behavior test tightened to pin the full 3-atom sequence for `n→∞`. Disputed during review and deferred with evidence: - Opus claim that standalone `m:limLow` with "lim" base regresses to italic: Word XSL itself splits "lim" per-char in that shape (with or without m:sty=p), so our output matches Word. - Codex claim that Arabic-Indic digits should be `<mn>`: Word XSL also classifies them as `<mi>`, so our behavior matches. - Non-BMP surrogate-pair support: edge case in extreme mathematical alphanumerics; Word XSL itself errored on U+1D465. Separate ticket worth. * fix(math): iterate tokenizer by code point to preserve surrogate pairs Addresses Codex bot review on PR #2875. Astral-plane mathematical alphanumerics (e.g. U+1D465 mathematical italic x, U+1D7D9 mathematical double-struck 1) are UTF-16 surrogate pairs. Walking text by code unit split them into two half-pair <mi> atoms with invalid content. `codePointUnitLength` returns 2 when the current position starts a surrogate pair so tokenizeMathText and tokenizeFunctionNameText step across the full code point.
1 parent f3bfc71 commit 7142f4e

6 files changed

Lines changed: 576 additions & 39 deletions

File tree

packages/layout-engine/painters/dom/src/features/math/CONTRIBUTING.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,9 @@ type MathObjectConverter = (
3636
doc: Document, // For creating DOM elements
3737
convertChildren: (children: OmmlJsonNode[]) => DocumentFragment,
3838
// Recursively converts nested OMML content
39-
) => Element | null;
39+
) => Node | null; // Return a single Element for one atom, or a
40+
// DocumentFragment when your converter produces
41+
// multiple sibling elements (see m:r / math-run).
4042
```
4143

4244
`convertChildren` is the important one. Pass it any child elements that contain nested math content (`m:e`, `m:num`, `m:sub`, etc.). It handles everything inside them, including other math objects.

packages/layout-engine/painters/dom/src/features/math/converters/function.ts

Lines changed: 64 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import type { MathObjectConverter } from '../types.js';
2+
import { convertMathRunAsFunctionName } from './math-run.js';
23

34
const MATHML_NS = 'http://www.w3.org/1998/Math/MathML';
45
const FUNCTION_APPLY_OPERATOR = '\u2061';
@@ -37,6 +38,56 @@ function forceNormalMathVariant(root: ParentNode): void {
3738
}
3839
}
3940

41+
/**
42+
* Structural MathML elements whose FIRST child is the "function-name base"
43+
* when nested inside m:fName (e.g. m:limLow → <munder>, m:limUpp → <mover>,
44+
* m:sSub → <msub>, etc.). Word's OMML2MML.XSL keeps the base text whole
45+
* (e.g. "lim" as one <mi>) even though it splits regular runs per-character.
46+
*/
47+
const BASE_BEARING_ELEMENTS = new Set([
48+
'munder',
49+
'mover',
50+
'munderover',
51+
'msub',
52+
'msup',
53+
'msubsup',
54+
'mmultiscripts', // m:sPre inside m:fName
55+
]);
56+
57+
/**
58+
* After per-character splitting in convertMathRun, the base of a nested
59+
* limit/script inside m:fName comes out as multiple single-char <mi> siblings
60+
* wrapped in an <mrow>. Word's XSL keeps that base whole — merge the siblings
61+
* back into a single <mi> if they all share the same (or no) mathvariant.
62+
*/
63+
function collapseFunctionNameBases(root: ParentNode): void {
64+
for (const child of Array.from(root.children)) {
65+
if (BASE_BEARING_ELEMENTS.has(child.localName)) {
66+
const base = child.children[0];
67+
if (base) {
68+
collapseMrowToSingleMi(base);
69+
collapseFunctionNameBases(base);
70+
}
71+
} else {
72+
collapseFunctionNameBases(child);
73+
}
74+
}
75+
}
76+
77+
function collapseMrowToSingleMi(container: Element): void {
78+
const children = Array.from(container.children);
79+
if (children.length < 2) return;
80+
if (!children.every((c) => c.localName === 'mi')) return;
81+
const variant = children[0]!.getAttribute('mathvariant');
82+
if (!children.every((c) => c.getAttribute('mathvariant') === variant)) return;
83+
84+
const merged = container.ownerDocument!.createElementNS(MATHML_NS, 'mi');
85+
merged.textContent = children.map((c) => c.textContent ?? '').join('');
86+
if (variant) merged.setAttribute('mathvariant', variant);
87+
container.insertBefore(merged, children[0]!);
88+
for (const c of children) c.remove();
89+
}
90+
4091
/**
4192
* Convert m:func (function apply) to MathML.
4293
*
@@ -59,7 +110,19 @@ export const convertFunction: MathObjectConverter = (node, doc, convertChildren)
59110
const wrapper = doc.createElementNS(MATHML_NS, 'mrow');
60111

61112
const functionNameRow = doc.createElementNS(MATHML_NS, 'mrow');
62-
functionNameRow.appendChild(convertChildren(functionName?.elements ?? []));
113+
// m:r children of m:fName stay whole (Word's OMML2MML.XSL keeps multi-letter
114+
// function names like "sin" or "lim" as a single <mi>). Non-m:r children —
115+
// like a nested m:limLow — go through the normal recursive path.
116+
for (const child of functionName?.elements ?? []) {
117+
if (child.name === 'm:r') {
118+
const atom = convertMathRunAsFunctionName(child, doc);
119+
if (atom) functionNameRow.appendChild(atom);
120+
} else {
121+
const converted = convertChildren([child]);
122+
if (converted.childNodes.length > 0) functionNameRow.appendChild(converted);
123+
}
124+
}
125+
collapseFunctionNameBases(functionNameRow);
63126
forceNormalMathVariant(functionNameRow);
64127

65128
if (functionNameRow.childNodes.length > 0) {

packages/layout-engine/painters/dom/src/features/math/converters/math-run.ts

Lines changed: 179 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -46,8 +46,7 @@ const OPERATOR_CHARS = new Set([
4646
'\u220C', // ∈, ∉, ∋, ∌
4747
'\u2211',
4848
'\u220F', // ∑, ∏
49-
'\u221A',
50-
'\u221E', // √, ∞
49+
'\u221A', // √ (radical sign — prefix operator)
5150
'\u2227',
5251
'\u2228',
5352
'\u2229',
@@ -65,16 +64,70 @@ const OPERATOR_CHARS = new Set([
6564
'\u2287', // ⊂, ⊃, ⊆, ⊇
6665
]);
6766

67+
type MathAtomTag = 'mi' | 'mo' | 'mn';
68+
69+
function isDigit(ch: string): boolean {
70+
return ch >= '0' && ch <= '9';
71+
}
72+
6873
/**
69-
* Classify a text string into MathML element type.
70-
* - All-digit strings → <mn> (number)
71-
* - Known operators → <mo> (operator)
72-
* - Everything else → <mi> (identifier)
74+
* Length in UTF-16 code units of the code point starting at `text[i]`.
75+
* Handles surrogate pairs so astral-plane characters (e.g. mathematical
76+
* italic U+1D465) don't get split into two bogus <mi> atoms.
7377
*/
74-
function classifyMathText(text: string): 'mn' | 'mo' | 'mi' {
75-
if (/^\d*\.?\d+$/.test(text)) return 'mn';
76-
if (text.length === 1 && OPERATOR_CHARS.has(text)) return 'mo';
77-
return 'mi';
78+
function codePointUnitLength(text: string, i: number): number {
79+
const hi = text.charCodeAt(i);
80+
if (hi >= 0xd800 && hi <= 0xdbff && i + 1 < text.length) {
81+
const lo = text.charCodeAt(i + 1);
82+
if (lo >= 0xdc00 && lo <= 0xdfff) return 2;
83+
}
84+
return 1;
85+
}
86+
87+
/**
88+
* Split a math run's text into MathML atoms, matching Word's OMML2MML.XSL.
89+
*
90+
* Rules (ECMA-376 §22.1.2.116 example + Annex L.6.1.13):
91+
* - Consecutive digits — optionally containing one decimal point between digits —
92+
* group into a single `<mn>`.
93+
* - Each recognized operator character becomes its own `<mo>`.
94+
* - Every other character becomes its own `<mi>`.
95+
*
96+
* Example: `"n+1"` → `[<mi>n</mi>, <mo>+</mo>, <mn>1</mn>]`.
97+
*/
98+
export function tokenizeMathText(text: string): Array<{ tag: MathAtomTag; content: string }> {
99+
const atoms: Array<{ tag: MathAtomTag; content: string }> = [];
100+
let i = 0;
101+
while (i < text.length) {
102+
const step = codePointUnitLength(text, i);
103+
const ch = text.slice(i, i + step);
104+
if (step === 1 && isDigit(ch)) {
105+
let end = i + 1;
106+
let sawDot = false;
107+
while (end < text.length) {
108+
const c = text[end]!;
109+
if (isDigit(c)) {
110+
end++;
111+
continue;
112+
}
113+
if (c === '.' && !sawDot && end + 1 < text.length && isDigit(text[end + 1]!)) {
114+
sawDot = true;
115+
end++;
116+
continue;
117+
}
118+
break;
119+
}
120+
atoms.push({ tag: 'mn', content: text.slice(i, end) });
121+
i = end;
122+
} else if (step === 1 && OPERATOR_CHARS.has(ch)) {
123+
atoms.push({ tag: 'mo', content: ch });
124+
i++;
125+
} else {
126+
atoms.push({ tag: 'mi', content: ch });
127+
i += step;
128+
}
129+
}
130+
return atoms;
78131
}
79132

80133
/** ECMA-376 m:sty → MathML mathvariant (§22.1.2 math run properties). */
@@ -115,47 +168,140 @@ function resolveMathVariant(rPr: OmmlJsonNode | undefined): string | null {
115168
return null;
116169
}
117170

171+
function extractText(node: OmmlJsonNode): string {
172+
let text = '';
173+
for (const child of node.elements ?? []) {
174+
if (child.name === 'm:t') {
175+
for (const tc of child.elements ?? []) {
176+
if (tc.type === 'text' && typeof tc.text === 'string') text += tc.text;
177+
}
178+
}
179+
}
180+
return text;
181+
}
182+
118183
/**
119-
* Convert an m:r (math run) element to MathML.
184+
* Convert an m:r (math run) element to MathML atoms.
120185
*
121186
* m:r contains:
122187
* - m:rPr (math run properties: script, style, normal text flag)
123188
* - m:t (text content)
124189
* - Optionally w:rPr (WordprocessingML run properties for formatting)
125190
*
126-
* The text is classified as <mi>, <mo>, or <mn> based on content.
191+
* The run's text is split per-character into `<mi>` / `<mo>` / `<mn>` atoms
192+
* per Word's OMML2MML.XSL. For a single-atom run (common case — a one-letter
193+
* variable, single operator, or an all-digit number) the converter returns a
194+
* single Element. For a multi-atom run (e.g. "→∞", "x+1") it returns a
195+
* DocumentFragment whose children become siblings of the parent mrow.
196+
*
197+
* @spec ECMA-376 §22.1.2.116 (t) — example shows multi-char mixed runs as the
198+
* normal authored shape; §22.1.2.58 (lit) implies operators are classified
199+
* per-character by default.
127200
*/
128201
export const convertMathRun: MathObjectConverter = (node, doc) => {
129-
const elements = node.elements ?? [];
202+
const text = extractText(node);
203+
if (!text) return null;
130204

131-
// Extract text from m:t children
132-
let text = '';
133-
for (const child of elements) {
134-
if (child.name === 'm:t') {
135-
const textChildren = child.elements ?? [];
136-
for (const tc of textChildren) {
137-
if (tc.type === 'text' && typeof tc.text === 'string') {
138-
text += tc.text;
205+
const rPr = (node.elements ?? []).find((el) => el.name === 'm:rPr');
206+
const variant = resolveMathVariant(rPr);
207+
const atoms = tokenizeMathText(text);
208+
209+
const createAtom = (atom: { tag: MathAtomTag; content: string }): Element => {
210+
const el = doc.createElementNS(MATHML_NS, atom.tag);
211+
el.textContent = atom.content;
212+
// Apply m:rPr-derived variant to every atom in the run. Omitted attribute
213+
// means "use the MathML default" (italic for single-char <mi>, normal
214+
// for multi-char <mi>/<mo>/<mn>).
215+
if (variant) el.setAttribute('mathvariant', variant);
216+
return el;
217+
};
218+
219+
if (atoms.length === 1) return createAtom(atoms[0]!);
220+
221+
const fragment = doc.createDocumentFragment();
222+
for (const atom of atoms) fragment.appendChild(createAtom(atom));
223+
return fragment;
224+
};
225+
226+
/**
227+
* Tokenize a math run's text for the m:fName context: consecutive non-digit,
228+
* non-operator characters stay grouped in one `<mi>` (so "log" in "log_2"
229+
* remains a single identifier), while digits still group into `<mn>` and
230+
* each operator character is its own `<mo>`.
231+
*
232+
* Matches Word's OMML2MML.XSL run-internal classification for m:fName
233+
* content: `log_2` → `<mi>log</mi><mo>_</mo><mn>2</mn>`.
234+
*/
235+
function tokenizeFunctionNameText(text: string): Array<{ tag: MathAtomTag; content: string }> {
236+
const atoms: Array<{ tag: MathAtomTag; content: string }> = [];
237+
let i = 0;
238+
while (i < text.length) {
239+
const step = codePointUnitLength(text, i);
240+
const ch = text.slice(i, i + step);
241+
if (step === 1 && isDigit(ch)) {
242+
let end = i + 1;
243+
let sawDot = false;
244+
while (end < text.length) {
245+
const c = text[end]!;
246+
if (isDigit(c)) {
247+
end++;
248+
continue;
249+
}
250+
if (c === '.' && !sawDot && end + 1 < text.length && isDigit(text[end + 1]!)) {
251+
sawDot = true;
252+
end++;
253+
continue;
139254
}
255+
break;
140256
}
257+
atoms.push({ tag: 'mn', content: text.slice(i, end) });
258+
i = end;
259+
} else if (step === 1 && OPERATOR_CHARS.has(ch)) {
260+
atoms.push({ tag: 'mo', content: ch });
261+
i++;
262+
} else {
263+
// Group consecutive non-digit, non-operator code points into one <mi>.
264+
let end = i + step;
265+
while (end < text.length) {
266+
const s = codePointUnitLength(text, end);
267+
const c = text.slice(end, end + s);
268+
if (s === 1 && (isDigit(c) || OPERATOR_CHARS.has(c))) break;
269+
end += s;
270+
}
271+
atoms.push({ tag: 'mi', content: text.slice(i, end) });
272+
i = end;
141273
}
142274
}
275+
return atoms;
276+
}
143277

278+
/**
279+
* Convert an m:r inside m:fName (m:func's function-name slot). Word's
280+
* OMML2MML.XSL keeps each letter-sequence whole while still splitting out
281+
* digits and operators — so `sin` stays `<mi>sin</mi>`, but `log_2` becomes
282+
* `<mi>log</mi><mo>_</mo><mn>2</mn>`.
283+
*
284+
* Returns a single Element for single-atom runs or a DocumentFragment when
285+
* the run emits multiple atoms. Returns null for empty text.
286+
*/
287+
export function convertMathRunAsFunctionName(node: OmmlJsonNode, doc: Document): Node | null {
288+
const text = extractText(node);
144289
if (!text) return null;
145290

146-
const rPr = elements.find((el) => el.name === 'm:rPr');
291+
const rPr = (node.elements ?? []).find((el) => el.name === 'm:rPr');
147292
const variant = resolveMathVariant(rPr);
148-
const tag = classifyMathText(text);
293+
const atoms = tokenizeFunctionNameText(text);
149294

150-
const el = doc.createElementNS(MATHML_NS, tag);
151-
el.textContent = text;
295+
const createAtom = (atom: { tag: MathAtomTag; content: string }): Element => {
296+
const el = doc.createElementNS(MATHML_NS, atom.tag);
297+
el.textContent = atom.content;
298+
if (variant) el.setAttribute('mathvariant', variant);
299+
return el;
300+
};
152301

153-
// Apply mathvariant when the spec properties resolve to one. The default
154-
// for single-char <mi> is italic and for multi-char <mi>/<mo>/<mn> is
155-
// normal — we only set an attribute when m:rPr explicitly specifies it.
156-
if (variant) {
157-
el.setAttribute('mathvariant', variant);
158-
}
302+
if (atoms.length === 1) return createAtom(atoms[0]!);
159303

160-
return el;
161-
};
304+
const fragment = doc.createDocumentFragment();
305+
for (const atom of atoms) fragment.appendChild(createAtom(atom));
306+
return fragment;
307+
}

0 commit comments

Comments
 (0)