Most active commenters

    ←back to thread

    238 points GalaxySnail | 11 comments | | HN request time: 0.221s | source | bottom
    1. Macha ◴[] No.40168944[source]
    > And many other popular programming languages, including Node.js, Go, Rust, and Java uses UTF-8 by default.

    Oh, I missed Java moving from UTF-16 to UTF-8.

    replies(3): >>40169012 #>>40169499 #>>40169530 #
    2. PurpleRamen ◴[] No.40169012[source]
    Seems it happened two years ago, with Java 18.
    3. rootext ◴[] No.40169499[source]
    It seems you are mixing two things: inner string representation and read/write encoding. Java has never used UTF-16 as default for the second.
    replies(2): >>40170808 #>>40176650 #
    4. hashmash ◴[] No.40169530[source]
    With Java, the default encoding when converting bytes to strings was originally platform independent, but now it's UTF-8. UTF-16 and latin-1 encodings are (still*) used internally by the String class, and the JVM uses a modified UTF-8 encoding like it always has.

    * The String class originally only used UTF-16 encoding, but since Java 9 it also uses a single-byte-per-character latin-1 encoding when possible.

    5. cryptonector ◴[] No.40170808[source]
    Not even on Windows?
    replies(1): >>40171134 #
    6. layer8 ◴[] No.40171134{3}[source]
    No, file I/O on Windows in general doesn’t use UTF-16, but the regional code page, or nowadays UTF-8 if the application decides so.
    replies(1): >>40174570 #
    7. int_19h ◴[] No.40174570{4}[source]
    Depends on what you define as "file I/O", though. NTFS filenames are UTF-16 (or rather UCS2). As far as file contents, there isn't really a standard, but FWIW for a long time most Windows apps - Notepad being the canonical example when asked to save anything as "Unicode" would save it as UTF-16.
    replies(1): >>40174958 #
    8. layer8 ◴[] No.40174958{5}[source]
    I'm talking about the default behavior of Microsoft's C runtime (MSVCRT.DLL) that everyone is/was using.

    UTF-16 text files are rather rare, as is using Notepad's UTF-16 options. The only semi-common use I know of is *.reg files saved from regedit. One issue with UTF-16 is that it has two different serializations (BE and LE), and hence generally requires a BOM to disambiguate.

    replies(2): >>40176676 #>>40178858 #
    9. Dwedit ◴[] No.40176650[source]
    Or possibly confusing it with JavaScript, which treats strings as sequences of UTF-16 characters?
    10. TheCycoONE ◴[] No.40176676{6}[source]
    Powershell use to output utf-16 by default on Windows. It might still but it's been awhile since I needed to try.
    11. int_19h ◴[] No.40178858{6}[source]
    Then you're talking about the C stdlib, which, yeah, is meant to use the locale-specific encoding on any platform, so it's not really a Windows thing specifically. But even then someone could use the CRT but call wfopen() rather than fopen() etc - this was actually not uncommon for Windows software precisely because it let you handle Unicode without having to work with Win32 API directly.

    Microsoft's implementation of fopen() also supports "ccs=..." to open Unicode text files in Unicode, and interestingly "ccs=UNICODE" will get you UTF-16LE, not UTF-8 (but you can do "ccs=UTF-8"). .NET also has this weird naming quirk where Encoding.Unicode is UTF-16, although there at least UTF-8 is the default for all text I/O classes like StreamReader if you don't specify the encoding. Still, many people didn't know better, and so some early .NET software would use UTF-16 for text I/O for no reason other than its developers believing that Encoding.Unicode is obviously what they are supposed to be using to "support Unicode", and so explicitly passing it everywhere.