Skip to content

Fix: Unicode lower-casing does not preserve length#19

Merged
Lasse Borgholt (borgholt) merged 4 commits intomainfrom
fix-unicode-non-spacing-lower
Apr 2, 2026
Merged

Fix: Unicode lower-casing does not preserve length#19
Lasse Borgholt (borgholt) merged 4 commits intomainfrom
fix-unicode-non-spacing-lower

Conversation

@borgholt
Copy link
Copy Markdown
Collaborator

@borgholt Lasse Borgholt (borgholt) commented Mar 29, 2026

Description

  • Replace expanding character İ with I before lowercasing.
  • Implement fail safe when characters converted with unidecode translate to the empty string.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 29, 2026

Codecov Report

❌ Patch coverage is 77.77778% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.13%. Comparing base (6ae4f14) to head (63b5d3d).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/error_align/utils.py 77.77% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #19      +/-   ##
==========================================
- Coverage   93.38%   93.13%   -0.26%     
==========================================
  Files           9        9              
  Lines         635      641       +6     
  Branches      102      104       +2     
==========================================
+ Hits          593      597       +4     
- Misses         15       17       +2     
  Partials       27       27              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a Unicode edge case where text.lower() can expand a token’s length (breaking ensure_length_preservation) by removing non-spacing mark (Mn) code points after lowercasing.

Changes:

  • Add a Unicode-category-based mechanism to remove Mn characters after lowercasing.
  • Update basic_normalizer to apply the Mn-removal translation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/error_align/utils.py Outdated
Comment thread src/error_align/utils.py Outdated
Comment thread src/error_align/utils.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/error_align/utils.py
Comment thread src/error_align/utils.py
Comment thread src/error_align/utils.py
@borgholt Lasse Borgholt (borgholt) merged commit b1ac420 into main Apr 2, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants