←back to thread

199 points ashvardanian | 1 comments | | HN request time: 0.321s | source
Show context
mgaunard ◴[] No.46287778[source]
In practice you should always normalize your Unicode data, then all you need to do is memcmp + boundary check.

Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.

replies(2): >>46287938 #>>46287993 #
stingraycharles ◴[] No.46287938[source]
That’s not practical in many situations, as the normalization alone may very well be more expensive than the search.

If you’re in control of all data representations in your entire stack, then yes of course, but that’s hardly ever the case and different tradeoffs are made at different times (eg storage in UTF-8 because of efficiency, but in-memory representation in UTF-32 because of speed).

replies(1): >>46288010 #
mgaunard ◴[] No.46288010[source]
That doesn't make sense; the search is doing on-the-fly normalization as part of its algorithm, so it cannot be faster than normalization alone.
replies(3): >>46288133 #>>46288181 #>>46288218 #
1. ashvardanian ◴[] No.46288181[source]
I get why it sounds that way, but it’s not actually true.

StringZilla added full Unicode case folding in an earlier release, and had a state-of-the-art exact case-sensitive substring search for years. However, doing a full fold of the entire haystack is significantly slower than the new case-insensitive search path.

The key point is that you don’t need to fully normalize the haystack to correctly answer most substring queries. The search algorithm can rule out the vast majority of positions using cheap, SIMD-friendly probes and only apply fold logic on a very small subset of candidates.

I go into the details in the “Ideation & Challenges in Substring Search” section of the article