Full Unicode Search at 50× ICU Speed with AVX‑512

(ashvardanian.com)

196 points ashvardanian | 1 comments | 15 Dec 25 16:42 UTC | HN request time: 0.193s | source

Show context

mgaunard ◴[16 Dec 25 12:38 UTC] No.46287778[source]▶

In practice you should always normalize your Unicode data, then all you need to do is memcmp + boundary check.

Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.

replies(2): >>46287938 #>>46287993 #

stingraycharles ◴[16 Dec 25 12:57 UTC] No.46287938[source]▶

>>46287778 #

That’s not practical in many situations, as the normalization alone may very well be more expensive than the search.

If you’re in control of all data representations in your entire stack, then yes of course, but that’s hardly ever the case and different tradeoffs are made at different times (eg storage in UTF-8 because of efficiency, but in-memory representation in UTF-32 because of speed).

replies(1): >>46288010 #

mgaunard ◴[16 Dec 25 13:06 UTC] No.46288010[source]▶

>>46287938 #

That doesn't make sense; the search is doing on-the-fly normalization as part of its algorithm, so it cannot be faster than normalization alone.

replies(3): >>46288133 #>>46288181 #>>46288218 #

Const-me ◴[16 Dec 25 13:24 UTC] No.46288218[source]▶

>>46288010 #

> it cannot be faster than normalization alone

Modern processors are generally computing stuff way faster than they can load and store bytes from main memory.

The code which does on the fly normalization only needs to normalize a small window. If you’re careful, you can even keep that window in registers, which have single CPU cycle access latency and ridiculously high throughput like 500GB/sec. Even if you have to store and reload, on-the-fly normalization is likely to handle tiny windows which fit in the in-core L1D cache. The access cost for L1D is like ~5 cycles of latency, and equally high throughput because many modern processors can load two 64-bytes vectors and store one vector each and every cycle.

replies(1): >>46288792 #

1. mgaunard ◴[16 Dec 25 14:20 UTC] No.46288792[source]▶

>>46288218 #

The author published the bandwidth of its algo, it's one fifth of a typical memory bandwidth (it's not possible to go faster than memory obviously for this benchmark, since we're assuming the data is not in cache).

↑