←back to thread

198 points ashvardanian | 3 comments | | HN request time: 0.627s | source
1. orthoxerox ◴[] No.46288136[source]
Is it possible to extend this to support additional transformation rules like Any-Latin;Latin-ASCII? To make it possible to find "Վարդանյան" in a haystack by searching for "vardanyan"?
replies(1): >>46288259 #
2. ashvardanian ◴[] No.46288259[source]
Yes — fuzzy and phonetic matching across languages is part of the roadmap. That space is still poorly standardized, so I wanted to start with something widely understood and well-defined (ICU-style transforms) before layering on more advanced behavior.

Also, as shown in the later tables, the Armenian and Georgian fast paths still have room for improvement. Before introducing higher-level APIs, I need to tighten the existing Armenian kernel and add a dedicated one for Georgian. It’s not a true bicameral script, but some characters are folding fold targets for older scripts, which currently forces too many fallbacks to the serial path.

replies(1): >>46295355 #
3. janc_ ◴[] No.46295355[source]
Even when transliteration is somewhat de-facto standardised, it usually is dependent on the target/host language. So e.g. Arabic & Russian are transliterated differently in e.g. English, French, German, Dutch, etc.