fix: unicode is handled properly in strdist#280
Conversation
| } | ||
| _ = stop | ||
| if cut != 0 && stop { | ||
| if cut != 0 && len(b) > 0 && stop { |
There was a problem hiding this comment.
This is not strictly related to unicode but it made the test fail when the second string was empty so I fixed it as well.
For context, stop is meant to represent that the minimum edit cost is greater than the threshold. However, if the second string is empty, stop will not be set to false by the inner loop, which is a bug (see test case).
There was a problem hiding this comment.
Can you put into words why this is the right fix here? Note that this is disabling the cut logic altogether when b is empty.
There was a problem hiding this comment.
Your intuition was right this is only a partial fix. I added several more test cases that were failing previously. I also changed the logic to finish as soon as cut is reached. The issue was the following, stop is computed by iterating over b[i] and seeing if:
- cost of swapping it for
a[i] - cost of inserting
b[i] - cost of removing
a[i] - 0 if
a[i] == b[i]
any of these was less than cut. The problem is that to enter the loop b could not be empty. The calculation above the for loop correctly accounted for this case in lst[0], that is comparing a[:j] to b[:0]. The problem is that we were not using the cost here to compute min.
You were the one devising the algorithm so I hope I got a good understanding, let me know if we need to discuss it in more depth in person.
niemeyer
left a comment
There was a problem hiding this comment.
Thanks for the fix! A question about the added logic.
| } | ||
| _ = stop | ||
| if cut != 0 && stop { | ||
| if cut != 0 && len(b) > 0 && stop { |
There was a problem hiding this comment.
Can you put into words why this is the right fix here? Note that this is disabling the cut logic altogether when b is empty.
While doing more performance work I noticed a couple of bugs in our implementation of distance. I will fix each one of them in a different PR so that the perf work can land. I have added a TODO for the next bug. I don't want to fix both as part of the same PR as it will be much harder to reason about the code, the fix for unicode in itself is pretty easy.
This PR fixes a common bug when handling strings in Go. The for loop:
Is iterating over a string and using
biwhich is the byte offset to store things inlstinstead of the current rune count. In the case where the characters are ASCII it works because each rune is 1 byte but in the case of runes that take more than 1 byte this fails (see the added test case).