Skip to content

Editor grammar rejects non-ASCII (Chinese) participant names that the renderer accepts #809

@MrCoder

Description

@MrCoder

Summary

The editor's Lezer grammar cannot parse non-ASCII (e.g. Chinese) participant names,
even though the renderer (ANTLR) accepts them and draws the participant. ZenUML has a
large Chinese-speaking user base, so 用户 / 服务 as participant names is extremely common
— and today every such name mis-parses in the editor (error node, broken highlighting,
missing from autocomplete).

Steps to reproduce

Type any of these in the editor:

  • 用户 (bare participant)
  • @Actor 用户 (annotated participant)
  • 用户->服务: 请求 (Chinese message endpoints)

Each produces a Lezer error node.

Expected vs Actual (editor Lezer vs renderer ANTLR oracle)

Input Lezer (editor) ANTLR (renderer)
用户 ERROR {用户}
@Actor 用户 ERROR {用户}
用户->服务: 请求 ERROR {用户, 服务}
Order_Service2 ok {Order_Service2}
🚀Rocket error {} (ANTLR also rejects — emoji is not a letter; NOT a bug)

The editor's participant set is meant to be a SUBSET of the ANTLR oracle's
(conformance/oracle.ts); for Unicode names it is empty where the oracle is non-empty.

Location

web/src/editor/grammar/zenuml.grammar line ~99:

Identifier { $[a-zA-Z_] $[a-zA-Z_0-9]* }

ASCII-only. The renderer's ANTLR ID rule accepts Unicode letters.

Fix sketch

Broaden the Identifier token to accept Unicode letters (matching the ANTLR ID rule as
closely as Lezer allows — at minimum CJK; ideally all \p{L}), while staying a SUBSET of
the oracle (do not accept symbols/emoji the renderer rejects). Regenerate the parser and
keep the conformance corpus green; add Unicode cases to the corpus.

Found via the editor-improvement campaign (i18n / Chinese authoring).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions