(ashvardanian.com)

196 points ashvardanian | 1 comments | 15 Dec 25 16:42 UTC | HN request time: 0.201s | source

Show context

mgaunard ◴[16 Dec 25 12:38 UTC] No.46287778[source]▶

In practice you should always normalize your Unicode data, then all you need to do is memcmp + boundary check.

Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.

replies(2): >>46287938 #>>46287993 #

1. orthoxerox ◴[16 Dec 25 13:03 UTC] No.46287993[source]▶

>>46287778 #

In practice the data is not always yours to normalize. You're not going to case-fold your library, but you still want to be able to search it.

↑

Full Unicode Search at 50× ICU Speed with AVX‑512