Fix: Unicode lower-casing does not preserve length#19
Fix: Unicode lower-casing does not preserve length#19Lasse Borgholt (borgholt) merged 4 commits intomainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #19 +/- ##
==========================================
- Coverage 93.38% 93.13% -0.26%
==========================================
Files 9 9
Lines 635 641 +6
Branches 102 104 +2
==========================================
+ Hits 593 597 +4
- Misses 15 17 +2
Partials 27 27 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Fixes a Unicode edge case where text.lower() can expand a token’s length (breaking ensure_length_preservation) by removing non-spacing mark (Mn) code points after lowercasing.
Changes:
- Add a Unicode-category-based mechanism to remove Mn characters after lowercasing.
- Update
basic_normalizerto apply the Mn-removal translation.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…y string in character classification utils functions
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Description
İwithIbefore lowercasing.unidecodetranslate to the empty string.