Summary
The editor's Lezer grammar cannot parse non-ASCII (e.g. Chinese) participant names,
even though the renderer (ANTLR) accepts them and draws the participant. ZenUML has a
large Chinese-speaking user base, so 用户 / 服务 as participant names is extremely common
— and today every such name mis-parses in the editor (error node, broken highlighting,
missing from autocomplete).
Steps to reproduce
Type any of these in the editor:
用户 (bare participant)
@Actor 用户 (annotated participant)
用户->服务: 请求 (Chinese message endpoints)
Each produces a Lezer error node.
Expected vs Actual (editor Lezer vs renderer ANTLR oracle)
| Input |
Lezer (editor) |
ANTLR (renderer) |
用户 |
ERROR |
{用户} |
@Actor 用户 |
ERROR |
{用户} |
用户->服务: 请求 |
ERROR |
{用户, 服务} |
Order_Service2 |
ok |
{Order_Service2} |
🚀Rocket |
error |
{} (ANTLR also rejects — emoji is not a letter; NOT a bug) |
The editor's participant set is meant to be a SUBSET of the ANTLR oracle's
(conformance/oracle.ts); for Unicode names it is empty where the oracle is non-empty.
Location
web/src/editor/grammar/zenuml.grammar line ~99:
Identifier { $[a-zA-Z_] $[a-zA-Z_0-9]* }
ASCII-only. The renderer's ANTLR ID rule accepts Unicode letters.
Fix sketch
Broaden the Identifier token to accept Unicode letters (matching the ANTLR ID rule as
closely as Lezer allows — at minimum CJK; ideally all \p{L}), while staying a SUBSET of
the oracle (do not accept symbols/emoji the renderer rejects). Regenerate the parser and
keep the conformance corpus green; add Unicode cases to the corpus.
Found via the editor-improvement campaign (i18n / Chinese authoring).
Summary
The editor's Lezer grammar cannot parse non-ASCII (e.g. Chinese) participant names,
even though the renderer (ANTLR) accepts them and draws the participant. ZenUML has a
large Chinese-speaking user base, so
用户/服务as participant names is extremely common— and today every such name mis-parses in the editor (error node, broken highlighting,
missing from autocomplete).
Steps to reproduce
Type any of these in the editor:
用户(bare participant)@Actor 用户(annotated participant)用户->服务: 请求(Chinese message endpoints)Each produces a Lezer error node.
Expected vs Actual (editor Lezer vs renderer ANTLR oracle)
用户{用户}@Actor 用户{用户}用户->服务: 请求{用户, 服务}Order_Service2{Order_Service2}🚀Rocket{}(ANTLR also rejects — emoji is not a letter; NOT a bug)The editor's participant set is meant to be a SUBSET of the ANTLR oracle's
(
conformance/oracle.ts); for Unicode names it is empty where the oracle is non-empty.Location
web/src/editor/grammar/zenuml.grammarline ~99:ASCII-only. The renderer's ANTLR
IDrule accepts Unicode letters.Fix sketch
Broaden the
Identifiertoken to accept Unicode letters (matching the ANTLR ID rule asclosely as Lezer allows — at minimum CJK; ideally all
\p{L}), while staying a SUBSET ofthe oracle (do not accept symbols/emoji the renderer rejects). Regenerate the parser and
keep the conformance corpus green; add Unicode cases to the corpus.
Found via the editor-improvement campaign (i18n / Chinese authoring).