←back to thread

238 points GalaxySnail | 6 comments | | HN request time: 1.317s | source | bottom
Show context
Macha ◴[] No.40168944[source]
> And many other popular programming languages, including Node.js, Go, Rust, and Java uses UTF-8 by default.

Oh, I missed Java moving from UTF-16 to UTF-8.

replies(3): >>40169012 #>>40169499 #>>40169530 #
rootext ◴[] No.40169499[source]
It seems you are mixing two things: inner string representation and read/write encoding. Java has never used UTF-16 as default for the second.
replies(2): >>40170808 #>>40176650 #
1. cryptonector ◴[] No.40170808[source]
Not even on Windows?
replies(1): >>40171134 #
2. layer8 ◴[] No.40171134[source]
No, file I/O on Windows in general doesn’t use UTF-16, but the regional code page, or nowadays UTF-8 if the application decides so.
replies(1): >>40174570 #
3. int_19h ◴[] No.40174570[source]
Depends on what you define as "file I/O", though. NTFS filenames are UTF-16 (or rather UCS2). As far as file contents, there isn't really a standard, but FWIW for a long time most Windows apps - Notepad being the canonical example when asked to save anything as "Unicode" would save it as UTF-16.
replies(1): >>40174958 #
4. layer8 ◴[] No.40174958{3}[source]
I'm talking about the default behavior of Microsoft's C runtime (MSVCRT.DLL) that everyone is/was using.

UTF-16 text files are rather rare, as is using Notepad's UTF-16 options. The only semi-common use I know of is *.reg files saved from regedit. One issue with UTF-16 is that it has two different serializations (BE and LE), and hence generally requires a BOM to disambiguate.

replies(2): >>40176676 #>>40178858 #
5. TheCycoONE ◴[] No.40176676{4}[source]
Powershell use to output utf-16 by default on Windows. It might still but it's been awhile since I needed to try.
6. int_19h ◴[] No.40178858{4}[source]
Then you're talking about the C stdlib, which, yeah, is meant to use the locale-specific encoding on any platform, so it's not really a Windows thing specifically. But even then someone could use the CRT but call wfopen() rather than fopen() etc - this was actually not uncommon for Windows software precisely because it let you handle Unicode without having to work with Win32 API directly.

Microsoft's implementation of fopen() also supports "ccs=..." to open Unicode text files in Unicode, and interestingly "ccs=UNICODE" will get you UTF-16LE, not UTF-8 (but you can do "ccs=UTF-8"). .NET also has this weird naming quirk where Encoding.Unicode is UTF-16, although there at least UTF-8 is the default for all text I/O classes like StreamReader if you don't specify the encoding. Still, many people didn't know better, and so some early .NET software would use UTF-16 for text I/O for no reason other than its developers believing that Encoding.Unicode is obviously what they are supposed to be using to "support Unicode", and so explicitly passing it everywhere.