Normalize curly apostrophes before POS tagging#101
Open
cnrmurphy wants to merge 1 commit into
Open
Conversation
spaCy does not always tag "n't" correctly when it is written with a curly apostrophe (U+2019). When the negation is at the end of a sentence, spaCy tags it as punctuation and the Lexicon then drops it. For example, "No, it wasn't." becomes "No, it was." and loses the negation. Normalizing curly single quotes to a straight apostrophe in preprocess, before the text reaches spaCy, lets the tagger see the form it handles correctly. This is separate from the normalization the Lexicon does at lookup time: that one runs after tagging, so it cannot influence the tag, and it is still needed for the preprocess=False path. Both are kept on purpose.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #100. Originally reported in hexgrad/kokoro#286.
A curly apostrophe (U+2019) in an
n'tcontraction can cause spaCy to tagn'tas punctuation instead of a negation, most reliably when the contraction is at the end of a clause. Misaki then drops that token, so the negation is lost.Some examples I tested with:
The Lexicon already normalizes these quotes in
__call__, but that runs after tagging, so it can't influence the tag. This change normalizes curly single quotes to a straight apostrophe inpreprocess, before the text reaches spaCy, which gives the expected output in every case I tested.