Full Unicode Search at 50× ICU Speed with AVX‑512

(ashvardanian.com)

196 points ashvardanian | 2 comments | 15 Dec 25 16:42 UTC | HN request time: 0.319s | source

Show context

mgaunard ◴[16 Dec 25 12:38 UTC] No.46287778[source]▶

In practice you should always normalize your Unicode data, then all you need to do is memcmp + boundary check.

Interestingly enough this library doesn't provide grapheme cluster tokenization and/or boundary checking which is one of the most useful primitive for this.

replies(2): >>46287938 #>>46287993 #

stingraycharles ◴[16 Dec 25 12:57 UTC] No.46287938[source]▶

>>46287778 #

That’s not practical in many situations, as the normalization alone may very well be more expensive than the search.

If you’re in control of all data representations in your entire stack, then yes of course, but that’s hardly ever the case and different tradeoffs are made at different times (eg storage in UTF-8 because of efficiency, but in-memory representation in UTF-32 because of speed).

replies(1): >>46288010 #

mgaunard ◴[16 Dec 25 13:06 UTC] No.46288010[source]▶

>>46287938 #

That doesn't make sense; the search is doing on-the-fly normalization as part of its algorithm, so it cannot be faster than normalization alone.

replies(3): >>46288133 #>>46288181 #>>46288218 #

stingraycharles ◴[16 Dec 25 13:17 UTC] No.46288133[source]▶

>>46288010 #

It can, because of how CPUs work with registers and hot code paths and all that.

First normalizing everything and then comparing normalized versions isn’t as fast.

And it also enables “stopping early” when a match has been found / not found, you may not actually have to convert everything.

replies(1): >>46288760 #

1. mgaunard ◴[16 Dec 25 14:18 UTC] No.46288760{3}[source]▶

>>46288133 #

Running more code per unit of data does not make the code hotter or reduce the register pressure, quite the opposite...

replies(1): >>46288861 #

2. stingraycharles ◴[16 Dec 25 14:25 UTC] No.46288861[source]▶

>>46288760 (TP) #

You’re misunderstanding: you just convert to 32 bits once and reuse that same register all the time.

You’re running the exact same code, but are more more efficient in terms of “I immediately use the data for comparison after converting it”, which means it’s likely either in a register or L1 cache already.

↑