Full Unicode Search at 50× ICU Speed with AVX‑512

1. unwind ◴[16 Dec 25 13:28 UTC] No.46288255[source]▶

Very cool and impressive performance.

I was worried (I find it confusing when Unicode "shadows" of normal letters exist, and those are of course also dangerous in some cases when they can be mis-interpreted for the letter they look more or less exactly like) by the article's use of U+212A (Kelvin symbol) as sample text, so I had to look it up [1].

Anyway, according to Wikipedia the dedicated symbol should not be used:

However, this is a compatibility character provided for compatibility with legacy encodings. The Unicode standard recommends using U+004B K LATIN CAPITAL LETTER K instead; that is, a normal capital K.

That was comforting, to me. :)

[1]: https://en.wikipedia.org/wiki/Kelvin#Orthography

replies(1): >>46288569 #

2. jjmarr ◴[16 Dec 25 14:01 UTC] No.46288569[source]▶

>>46288255 (TP) #

> I find it confusing when Unicode "shadows" of normal letters exist, and those are of course also dangerous in some cases when they can be mis-interpreted for the letter they look more or less exactly like

Isn't this why Unicode normalization exists? This would let you compare Unicode letters and determine if they are canonically equivalent.

replies(2): >>46289094 #>>46289684 #

3. ComputerGuru ◴[16 Dec 25 14:41 UTC] No.46289094[source]▶

>>46288569 #

Normalization wouldn’t address this.

replies(2): >>46289219 #>>46289262 #

4. happytoexplain ◴[16 Dec 25 14:51 UTC] No.46289219{3}[source]▶

>>46289094 #

What do you mean? All four normal forms of the Kelvin 'K' are the Latin 'K', as far as I can tell.

5. nwellnhof ◴[16 Dec 25 14:54 UTC] No.46289262{3}[source]▶

>>46289094 #

Normalization forms NFKC and NFKD that also handle compatibility equivalence do.

replies(1): >>46289599 #

6. mananaysiempre ◴[16 Dec 25 15:20 UTC] No.46289599{4}[source]▶

>>46289262 #

A few deprecated characters, including the Kelvin and Ångström symbols, are in fact canonically equivalent to their replacements and not just compatibility equivalent, so plain NFC/NFD is enough. (It’s generally better to avoid NFKC/NFKD normalizations unless you fully understand the implications, as they do lose meaning and at the same time do not account for all possible confusables.)

7. Sesse__ ◴[16 Dec 25 15:25 UTC] No.46289684[source]▶

>>46288569 #

It's why the Unicode Collation Algorithm exists.

If you look in allkeys.txt (the base UCA data, used if you don't have language-specific stuff in your comparisons) for the two code points in question, you'll find:

  004B  ; [.2514.0020.0008] # LATIN CAPITAL LETTER K
  212A  ; [.2514.0020.0008] # KELVIN SIGN

The numbers in the brackets are values on level 1 (base), level 2 (typically used for accents), level 3 (typically used for case). So they are to compare identical under the UCA, in almost every case except for if you really need a tiebreaker.

Compare e.g. :

  1D424 ; [.2514.0020.0005] # MATHEMATICAL BOLD SMALL K

which would compare equal to those under a case-insensitive accent-sensitive collation, but _not_a case-sensitive one (case-sensitive collations are always accent-sensitive, too).

replies(1): >>46290805 #

8. happytoexplain ◴[16 Dec 25 16:46 UTC] No.46290805{3}[source]▶

>>46289684 #

Are the meanings for the levels for each code point defined somewhere (accent, casing, etc)?

replies(1): >>46291714 #

9. Sesse__ ◴[16 Dec 25 17:51 UTC] No.46291714{4}[source]▶

>>46290805 #

Typically it is defined by the collation. For the default collation, where all the weights are as in the file, it's none/accent/accent+case. But if you go to e.g. Japanese, you can have a fourth level of “kana-sensitive” (which distinguishes between e.g. katakana and hiragana).