Fix: Unicode lower-casing does not preserve length by borgholt · Pull Request #19 · corticph/error-align

Lasse Borgholt (borgholt) · 2026-03-29T12:32:58Z

Description

Replace expanding character İ with I before lowercasing.
Implement fail safe when characters converted with unidecode translate to the empty string.

… lower-casing

codecov · 2026-03-29T12:33:54Z

Codecov Report

❌ Patch coverage is 77.77778% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.13%. Comparing base (6ae4f14) to head (63b5d3d).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/error_align/utils.py	77.77%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #19      +/-   ##
==========================================
- Coverage   93.38%   93.13%   -0.26%     
==========================================
  Files           9        9              
  Lines         635      641       +6     
  Branches      102      104       +2     
==========================================
+ Hits          593      597       +4     
- Misses         15       17       +2     
  Partials       27       27

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Fixes a Unicode edge case where text.lower() can expand a token’s length (breaking ensure_length_preservation) by removing non-spacing mark (Mn) code points after lowercasing.

Changes:

Add a Unicode-category-based mechanism to remove Mn characters after lowercasing.
Update basic_normalizer to apply the Mn-removal translation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…y string in character classification utils functions

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Added a translate call to remove non-spacing unicode characters after…

4540176

… lower-casing

Lasse Borgholt (borgholt) requested a review from Arne Nix (ArneNx) March 29, 2026 12:32

Lasse Borgholt (borgholt) self-assigned this Mar 29, 2026

Lasse Borgholt (borgholt) added the patch label Mar 29, 2026

Copilot AI review requested due to automatic review settings March 29, 2026 12:32

Copilot started reviewing on behalf of Lasse Borgholt (borgholt) March 29, 2026 12:33 View session

Copilot AI reviewed Mar 29, 2026

View reviewed changes

Comment thread src/error_align/utils.py Outdated

Comment thread src/error_align/utils.py Outdated

Comment thread src/error_align/utils.py Outdated

Lasse Borgholt (borgholt) added 2 commits March 30, 2026 11:11

Added fallback to False for characters that transliterate to the empt…

5a8fe71

…y string in character classification utils functions

Simplified fix for non-spacing lower-casing characters

d572efa

Lasse Borgholt (borgholt) requested a review from Copilot March 30, 2026 09:50

Copilot started reviewing on behalf of Lasse Borgholt (borgholt) March 30, 2026 09:51 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

Comment thread src/error_align/utils.py

Comment thread src/error_align/utils.py

Comment thread src/error_align/utils.py

Added tests for empty unidecode chars

63b5d3d

Arne Nix (ArneNx) approved these changes Mar 31, 2026

View reviewed changes

Lasse Borgholt (borgholt) merged commit b1ac420 into main Apr 2, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Unicode lower-casing does not preserve length#19

Fix: Unicode lower-casing does not preserve length#19
Lasse Borgholt (borgholt) merged 4 commits intomainfrom
fix-unicode-non-spacing-lower

Lasse Borgholt (borgholt) commented Mar 29, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 29, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Lasse Borgholt (borgholt) commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

codecov bot commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Lasse Borgholt (borgholt) commented Mar 29, 2026 •

edited

Loading

codecov bot commented Mar 29, 2026 •

edited

Loading