←back to thread

196 points ashvardanian | 1 comments | | HN request time: 0.201s | source
Show context
mgaunard ◴[] No.46287778[source]
In practice you should always normalize your Unicode data, then all you need to do is memcmp + boundary check.

Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.

replies(2): >>46287938 #>>46287993 #
1. orthoxerox ◴[] No.46287993[source]
In practice the data is not always yours to normalize. You're not going to case-fold your library, but you still want to be able to search it.