Implement ES2025 RegExp pattern modifiers#1520
Closed
andreasrosdal wants to merge 3 commits into
Closed
Conversation
Add support for inline flag modifier groups (?ims:...), (?-ims:...), and mixed forms like (?i-m:...) that locally enable or disable the i (ignoreCase), m (multiline), and s (dotAll) flags for a subpattern. Parsing: - re_parse_term now recognizes (? followed by i/m/s/- as a modifier group, parses the add flags, an optional '-' and remove flags, then ':'. It validates that only i/m/s are used, no flag is duplicated or appears on both sides, and at least one flag is present (so (?-:...) is a SyntaxError while the empty (?:...) remains the plain non-capturing group). The current i/m/s parser state is saved, the modifier applied, the disjunction parsed recursively, and the state restored afterwards so flags only affect the group. The i and s flags are consumed at parse time (case folding and dot semantics), so toggling the parser state handles them directly. Multiline (m) is decided at match time in the original engine, so per group m required moving the decision to parse time. ^ and $ now emit REOP_line_start / REOP_line_end (multiline semantics, matching at any line boundary) when multiline is in effect, or the new REOP_bol / REOP_eol opcodes (absolute string start/end) otherwise. The matcher handles all four unconditionally with no flag check. Because case sensitivity can now differ between a group and the global flag, case folding can no longer be driven by a single match-time flag. Case-insensitive character, range and back reference matches now use dedicated opcodes (char*_ci, range*_ci, back_reference*_ci) that canonicalize the input, while the plain opcodes compare literally. The emitter chooses the variant from the effective ignore_case state, and the bytecode walkers, stack-size computation and dumper handle the new opcodes. Default behavior for patterns without modifiers and for the global i/m/s flags is unchanged. Enable the regexp-modifiers test262 feature in test262.conf. https://claude.ai/code/session_01MhkkobYvut7A4oP4w8eV1b
The new opcodes for RegExp pattern modifiers (char*_ci, bol/eol, back_reference_ci/backward_back_reference_ci, range*_ci) were inserted in the middle of the opcode list, which renumbered every opcode after them. lre-test.c builds bytecode from hardcoded opcode byte values (e.g. 0x0C = REOP_save_start) to exercise the out-of-bounds save-index validation, so the renumbering made that test's bytecode mean something else and the assertion aborted. Move all new opcodes to the end of libregexp-opcode.h so the existing opcode values stay stable. The only adjacency constraint among the new opcodes (backward_back_reference_ci must immediately follow back_reference_ci, used as REOP_back_reference_ci + is_backward_dir) is preserved. All opcodes are referenced by name elsewhere, so moving them is otherwise transparent. https://claude.ai/code/session_01MhkkobYvut7A4oP4w8eV1b
Two CI failures on this branch:
1. `make codegen` left gen/repl.c dirty. repl.js contains regexp literals
whose compiled bytecode changed because the regexp opcodes were
renumbered, but gen/repl.c was never regenerated. Regenerate it so the
CI clean-tree check passes.
2. Enabling the regexp-modifiers test262 feature surfaced three tests that
the engine cannot pass:
- add-ignoreCase-affects-slash-lower-b.js (\b after U+017F)
- add-ignoreCase-affects-slash-lower-p.js (\p{Lu} under i)
- add-ignoreCase-affects-slash-upper-b.js (\B between Z and U+017F)
These are pre-existing limitations: \b/\B (is_word_char) and \p{...}
character classes do not apply Unicode case folding under ignoreCase,
and they fail identically with the global /i flag. They are not
regressions from the modifiers feature, so record them in
test262_errors.txt as known errors (matching how other known
limitations are tracked).
https://claude.ai/code/session_01MhkkobYvut7A4oP4w8eV1b
Contributor
|
Have you gone through the code yourself? |
|
I have looked at the code, it is implemented by Claude. It's only a proposal, I hope it will be useful and that you will like it. Seems like a good idea. Since I am using this quickjs in the browser Nordstjernen.org I am trying to improve the JavaScript engine implementation in multiple ways to improve the browser, this is part of that effort. Feel free to give it a try, as it's the only current web browser using Quickjs |
Contributor
Author
|
Closing, since it's overly complicated |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add support for inline flag modifier groups (?ims:...), (?-ims:...), and mixed forms like (?i-m:...) that locally enable or disable the i (ignoreCase), m (multiline), and s (dotAll) flags for a subpattern.
Another Claude coded improvement here, I hope it is good.
Parsing:
The i and s flags are consumed at parse time (case folding and dot semantics), so toggling the parser state handles them directly.
Multiline (m) is decided at match time in the original engine, so per group m required moving the decision to parse time. ^ and $ now emit REOP_line_start / REOP_line_end (multiline semantics, matching at any line boundary) when multiline is in effect, or the new REOP_bol / REOP_eol opcodes (absolute string start/end) otherwise. The matcher handles all four unconditionally with no flag check.
Because case sensitivity can now differ between a group and the global flag, case folding can no longer be driven by a single match-time flag. Case-insensitive character, range and back reference matches now use dedicated opcodes (char*_ci, range*_ci, back_reference*_ci) that canonicalize the input, while the plain opcodes compare literally. The emitter chooses the variant from the effective ignore_case state, and the bytecode walkers, stack-size computation and dumper handle the new opcodes.
Default behavior for patterns without modifiers and for the global i/m/s flags is unchanged.
Enable the regexp-modifiers test262 feature in test262.conf.