←back to thread

197 points ashvardanian | 4 comments | | HN request time: 0.807s | source
Show context
andersa ◴[] No.46287769[source]
From a German user perspective, ICU and your fancy library are incorrect, actually. Mass is not a different casing of Maß, they are different characters. Google likely changed this because it didn't do what users wanted.
replies(5): >>46287929 #>>46288240 #>>46288242 #>>46288366 #>>46288467 #
1. pjmlp ◴[] No.46288242[source]
It isn't until it is, how would you write it when ß isn't available on the keyboard?

Which is why we also have to deal with the ue, ae, oe kind of trick, also known as Ersatzschreibweise.

Then German language users from de-CH region, consider Mass the correct way.

Yeah, localization and internalization is a mess to get right.

replies(1): >>46288357 #
2. wat10000 ◴[] No.46288357[source]
Case insensitivity is localized like anything else. I and i are equivalent, right? Not if you’re doing Turkish, then it’s I and ı, and İ and i.

In practice you can do pretty well with a universal approach, but it can’t be 100% correct.

replies(1): >>46288699 #
3. ashvardanian ◴[] No.46288699[source]
This is a very good example! Still, “correct” needs context. You can be 100% “correct with respect to ICU”. It’s definitely not perfect, but it’s the best standard we have. And luckily for me, it also defines the locale-independent rules. I can expand to support locale-specific adjustments in the future, but waiting for the adoption to grow before investing even more engineering effort into this feature. Maybe worth opening a GitHub issue for that :)
replies(1): >>46288797 #
4. wat10000 ◴[] No.46288797{3}[source]
Right, nothing wrong with delegating the decision to a bunch of people who have thought long and hard about the best compromise, as long as it’s understood that it’s not perfect.