Full Unicode Search at 50× ICU Speed with AVX‑512

1. orthoxerox ◴[16 Dec 25 13:17 UTC] No.46288136[source]▶

Is it possible to extend this to support additional transformation rules like Any-Latin;Latin-ASCII? To make it possible to find "Վարդանյան" in a haystack by searching for "vardanyan"?

replies(1): >>46288259 #

2. ashvardanian ◴[16 Dec 25 13:29 UTC] No.46288259[source]▶

>>46288136 (TP) #

Yes — fuzzy and phonetic matching across languages is part of the roadmap. That space is still poorly standardized, so I wanted to start with something widely understood and well-defined (ICU-style transforms) before layering on more advanced behavior.

Also, as shown in the later tables, the Armenian and Georgian fast paths still have room for improvement. Before introducing higher-level APIs, I need to tighten the existing Armenian kernel and add a dedicated one for Georgian. It’s not a true bicameral script, but some characters are folding fold targets for older scripts, which currently forces too many fallbacks to the serial path.

replies(1): >>46295355 #

3. janc_ ◴[16 Dec 25 22:14 UTC] No.46295355[source]▶

>>46288259 #

Even when transliteration is somewhat de-facto standardised, it usually is dependent on the target/host language. So e.g. Arabic & Russian are transliterated differently in e.g. English, French, German, Dutch, etc.