Most active commenters

    ←back to thread

    196 points ashvardanian | 13 comments | | HN request time: 1.176s | source | bottom
    1. andersa ◴[] No.46287769[source]
    From a German user perspective, ICU and your fancy library are incorrect, actually. Mass is not a different casing of Maß, they are different characters. Google likely changed this because it didn't do what users wanted.
    replies(5): >>46287929 #>>46288240 #>>46288242 #>>46288366 #>>46288467 #
    2. b2ccb2 ◴[] No.46287929[source]
    The confusion likely stems from the relatively new introduction of the capitalized ẞ https://de.wikipedia.org/wiki/Gro%C3%9Fes_%C3%9F

    Maß capitalized (used to be) MASS.

    Funnily enough, Mass means one liter beer (think Oktoberfest).

    replies(2): >>46288188 #>>46288190 #
    3. andersa ◴[] No.46288188[source]
    It's strange, because I would expect "maß" as the case insensitive search to match "MASS" in the search text, but "mass" should not match "Maß".
    replies(1): >>46295593 #
    4. looperhacks ◴[] No.46288190[source]
    Both Maß and Mass are valid spellings for a liter of beer ;) Not to confuse it with Maß, which just means any measurement, of course.
    5. looperhacks ◴[] No.46288240[source]
    MASS is allowed casing of Maß, but not the preferred casing: https://www.rechtschreibrat.com/DOX/RfdR_Amtliches-Regelwerk... Page 48
    6. pjmlp ◴[] No.46288242[source]
    It isn't until it is, how would you write it when ß isn't available on the keyboard?

    Which is why we also have to deal with the ue, ae, oe kind of trick, also known as Ersatzschreibweise.

    Then German language users from de-CH region, consider Mass the correct way.

    Yeah, localization and internalization is a mess to get right.

    replies(1): >>46288357 #
    7. wat10000 ◴[] No.46288357[source]
    Case insensitivity is localized like anything else. I and i are equivalent, right? Not if you’re doing Turkish, then it’s I and ı, and İ and i.

    In practice you can do pretty well with a universal approach, but it can’t be 100% correct.

    replies(1): >>46288699 #
    8. Arnt ◴[] No.46288366[source]
    Ah, let's have a long discussion of this.

    Unicode avoids "different" and "same", https://www.unicode.org/reports/tr15/ uses phrases like compatibility equivalence.

    The whole thing is complicated, because it actually is complicated in the real world. You can spell the name of Gießen "Giessen" and most Germans consider it correct even if not ideal, but spelling Massachusetts "Maßachusetts" is plainly wrong in German text. The relationship between ß and ss isn't symmetric. Unicode captures that complexity, when you get into the fine details.

    9. mxmlnkn ◴[] No.46288467[source]
    I never understood why the recommended replacement for ß is ss. It is a ligature of sz (similar to & being a ligature of et) and is even pronounced ess-zet. The only logical replacement would have been sz, and it would have avoided the clash of Masse (mass) and Maße (measurements). Then again, it only affects whether the vowel before it is pronounced short or long, and there are better ways to encode that in written language in the first place.
    replies(1): >>46293516 #
    10. ashvardanian ◴[] No.46288699{3}[source]
    This is a very good example! Still, “correct” needs context. You can be 100% “correct with respect to ICU”. It’s definitely not perfect, but it’s the best standard we have. And luckily for me, it also defines the locale-independent rules. I can expand to support locale-specific adjustments in the future, but waiting for the adoption to grow before investing even more engineering effort into this feature. Maybe worth opening a GitHub issue for that :)
    replies(1): >>46288797 #
    11. wat10000 ◴[] No.46288797{4}[source]
    Right, nothing wrong with delegating the decision to a bunch of people who have thought long and hard about the best compromise, as long as it’s understood that it’s not perfect.
    12. adrian_b ◴[] No.46293516[source]
    I agree that writing it "sz" might have created less problems.

    However, it is likely that it has never been pronounced "sz", but always "ss" and the habit of writing "sz" for the double consonant may have had the same reason as the writing of "ck" instead of the double "kk".

    13. janc_ ◴[] No.46295593{3}[source]
    I think all of those should be "tentative matches" for each other.