Skip to content

Normalize curly apostrophes before POS tagging#101

Open
cnrmurphy wants to merge 1 commit into
hexgrad:mainfrom
cnrmurphy:fix/normalize-apostrophe-issue286
Open

Normalize curly apostrophes before POS tagging#101
cnrmurphy wants to merge 1 commit into
hexgrad:mainfrom
cnrmurphy:fix/normalize-apostrophe-issue286

Conversation

@cnrmurphy

@cnrmurphy cnrmurphy commented Jun 8, 2026

Copy link
Copy Markdown

Fixes #100. Originally reported in hexgrad/kokoro#286.

A curly apostrophe (U+2019) in an n't contraction can cause spaCy to tag n't as punctuation instead of a negation, most reliably when the contraction is at the end of a clause. Misaki then drops that token, so the negation is lost.

Some examples I tested with:

from misaki import en
g2p = en.G2P(british=False, fallback=None)

print(g2p("No, it wasn't.")[0])   # straight '  -> nˈO, ɪt wˈʌzᵊnt.
print(g2p("No, it wasn’t.")[0])   # curly U+2019 -> nˈO, ɪt wʌz.   ("wasn't" -> "was")

The Lexicon already normalizes these quotes in __call__, but that runs after tagging, so it can't influence the tag. This change normalizes curly single quotes to a straight apostrophe in preprocess, before the text reaches spaCy, which gives the expected output in every case I tested.

spaCy does not always tag "n't" correctly when it is written with a curly
apostrophe (U+2019). When the negation is at the end of a sentence, spaCy
tags it as punctuation and the Lexicon then drops it. For example,
"No, it wasn't." becomes "No, it was." and loses the negation.

Normalizing curly single quotes to a straight apostrophe in preprocess,
before the text reaches spaCy, lets the tagger see the form it handles
correctly.

This is separate from the normalization the Lexicon does at lookup time:
that one runs after tagging, so it cannot influence the tag, and it is
still needed for the preprocess=False path. Both are kept on purpose.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Curly apostrophes (U+2019) cause 'n't' contractions to lose their negation

1 participant