Commit 9cb6949
authored
fix(lapis): allow non-ASCII characters in advanced queries (#1628)
resolves #1603
## Problem
Non-ASCII characters (umlauts, accented letters, Cyrillic, CJK, etc.) in
unquoted advanced query values were silently dropped by the ANTLR lexer,
producing wrong results with no error. For example:
- `division=Zürich` was parsed as `division=Zrich` → 0 results
- `division.regex=Graubünden` was parsed as `division.regex=Graubnden` →
0 results
Quoted values like `division='Zürich'` already worked correctly, since
the `QUOTED_STRING` lexer rule accepts any character.
## Fix
Added a `UNICODE_LETTER` lexer rule (`[\p{Letter}]`) and included it in
the `charOrNumber` parser rule. This makes unquoted values behave
consistently with quoted ones for any Unicode letter. ASCII letters
continue to be matched by the existing `A`–`Z` lexer rules (which take
priority by rule order), so all existing parsing — nucleotide/amino acid
symbols, keywords (`NOT`, `MAYBE`, `ISNULL`, etc.) — is unaffected.
Non-ASCII characters are also now valid in field name and gene/segment
name positions, where they will produce a meaningful "field/gene not
found" error rather than a silent wrong result or a confusing syntax
error.
(also see antlr/antlr4#1688 for some background
info)
## PR Checklist
- [x] All necessary documentation has been adapted.
- [x] All necessary changes are explained in the `llms.txt`.
- [x] The implemented feature is covered by an appropriate test.1 parent 6fcb130 commit 9cb6949
2 files changed
Lines changed: 94 additions & 1 deletion
File tree
- lapis/src
- main/antlr/org/genspectrum/lapis/model/advancedqueryparser
- test/kotlin/org/genspectrum/lapis/model
Lines changed: 2 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
84 | 84 | | |
85 | 85 | | |
86 | 86 | | |
87 | | - | |
| 87 | + | |
88 | 88 | | |
89 | 89 | | |
90 | 90 | | |
| |||
122 | 122 | | |
123 | 123 | | |
124 | 124 | | |
| 125 | + | |
125 | 126 | | |
126 | 127 | | |
127 | 128 | | |
| |||
Lines changed: 92 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
341 | 341 | | |
342 | 342 | | |
343 | 343 | | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
344 | 349 | | |
345 | 350 | | |
346 | 351 | | |
| |||
625 | 630 | | |
626 | 631 | | |
627 | 632 | | |
| 633 | + | |
| 634 | + | |
| 635 | + | |
| 636 | + | |
| 637 | + | |
| 638 | + | |
| 639 | + | |
| 640 | + | |
| 641 | + | |
| 642 | + | |
| 643 | + | |
| 644 | + | |
| 645 | + | |
| 646 | + | |
| 647 | + | |
| 648 | + | |
| 649 | + | |
| 650 | + | |
| 651 | + | |
| 652 | + | |
| 653 | + | |
| 654 | + | |
| 655 | + | |
| 656 | + | |
| 657 | + | |
| 658 | + | |
| 659 | + | |
| 660 | + | |
| 661 | + | |
| 662 | + | |
| 663 | + | |
| 664 | + | |
| 665 | + | |
| 666 | + | |
| 667 | + | |
| 668 | + | |
| 669 | + | |
| 670 | + | |
| 671 | + | |
| 672 | + | |
| 673 | + | |
| 674 | + | |
| 675 | + | |
| 676 | + | |
| 677 | + | |
| 678 | + | |
| 679 | + | |
| 680 | + | |
| 681 | + | |
| 682 | + | |
| 683 | + | |
| 684 | + | |
| 685 | + | |
| 686 | + | |
| 687 | + | |
| 688 | + | |
| 689 | + | |
| 690 | + | |
| 691 | + | |
| 692 | + | |
| 693 | + | |
| 694 | + | |
| 695 | + | |
| 696 | + | |
| 697 | + | |
| 698 | + | |
| 699 | + | |
| 700 | + | |
| 701 | + | |
| 702 | + | |
| 703 | + | |
| 704 | + | |
628 | 705 | | |
629 | 706 | | |
630 | 707 | | |
| |||
726 | 803 | | |
727 | 804 | | |
728 | 805 | | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
729 | 816 | | |
730 | 817 | | |
731 | 818 | | |
| |||
797 | 884 | | |
798 | 885 | | |
799 | 886 | | |
| 887 | + | |
| 888 | + | |
| 889 | + | |
| 890 | + | |
| 891 | + | |
800 | 892 | | |
801 | 893 | | |
802 | 894 | | |
| |||
0 commit comments