Skip to content

[Bug] Java syntax highlighting does not handle unicode escape sequences #5233

@mattgodbolt-molty

Description

@mattgodbolt-molty

Reproducible in vscode.dev or in VS Code Desktop?

  • Not reproducible in vscode.dev or VS Code Desktop

Reproducible in the monaco editor playground?

Monaco Editor Playground Code

monaco.editor.create(document.getElementById('container'), {
	value: '// \\u000a is a line break in Java unicode escapes\n// This comment \\u000a int[] x = {0}; // is actually code\nclass Test {\n    // \\u0048\\u0065\\u006C\\u006C\\u006F\n    public static void main(String[] args) {}\n}',
	language: 'java'
});

Description

Java processes unicode escape sequences (\uXXXX) at a very early stage — before tokenisation. This means \u000a inside a comment is actually a line break, and code after it is executable. The syntax highlighter doesn't account for this, so what appears to be a comment can hide real code.

This is a known Java "feature" that can be used to hide malicious code: https://wh0.github.io/2019/11/16/easter-egg-inspection.html

Ideally the Java tokeniser would process \uXXXX sequences the same way javac does, or at minimum flag them visually.

Cross-reference: compiler-explorer/compiler-explorer#4223

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions