PEP 686 – Make UTF-8 mode default

1. Macha ◴[26 Apr 24 13:02 UTC] No.40168944[source]▶

>>40168242 (OP) #

> And many other popular programming languages, including Node.js, Go, Rust, and Java uses UTF-8 by default.

Oh, I missed Java moving from UTF-16 to UTF-8.

replies(3): >>40169012 #>>40169499 #>>40169530 #

2. PurpleRamen ◴[26 Apr 24 13:08 UTC] No.40169012[source]▶

>>40168944 (TP) #

Seems it happened two years ago, with Java 18.

3. rootext ◴[26 Apr 24 14:00 UTC] No.40169499[source]▶

>>40168944 (TP) #

It seems you are mixing two things: inner string representation and read/write encoding. Java has never used UTF-16 as default for the second.

replies(2): >>40170808 #>>40176650 #

4. hashmash ◴[26 Apr 24 14:01 UTC] No.40169530[source]▶

>>40168944 (TP) #

With Java, the default encoding when converting bytes to strings was originally platform independent, but now it's UTF-8. UTF-16 and latin-1 encodings are (still*) used internally by the String class, and the JVM uses a modified UTF-8 encoding like it always has.

* The String class originally only used UTF-16 encoding, but since Java 9 it also uses a single-byte-per-character latin-1 encoding when possible.

5. cryptonector ◴[26 Apr 24 15:39 UTC] No.40170808[source]▶

>>40169499 #

Not even on Windows?

replies(1): >>40171134 #

6. layer8 ◴[26 Apr 24 16:16 UTC] No.40171134{3}[source]▶

>>40170808 #

No, file I/O on Windows in general doesn’t use UTF-16, but the regional code page, or nowadays UTF-8 if the application decides so.

replies(1): >>40174570 #

7. int_19h ◴[26 Apr 24 21:34 UTC] No.40174570{4}[source]▶

>>40171134 #

Depends on what you define as "file I/O", though. NTFS filenames are UTF-16 (or rather UCS2). As far as file contents, there isn't really a standard, but FWIW for a long time most Windows apps - Notepad being the canonical example when asked to save anything as "Unicode" would save it as UTF-16.

replies(1): >>40174958 #

8. layer8 ◴[26 Apr 24 22:21 UTC] No.40174958{5}[source]▶

>>40174570 #

I'm talking about the default behavior of Microsoft's C runtime (MSVCRT.DLL) that everyone is/was using.

UTF-16 text files are rather rare, as is using Notepad's UTF-16 options. The only semi-common use I know of is *.reg files saved from regedit. One issue with UTF-16 is that it has two different serializations (BE and LE), and hence generally requires a BOM to disambiguate.

replies(2): >>40176676 #>>40178858 #

9. Dwedit ◴[27 Apr 24 02:36 UTC] No.40176650[source]▶

>>40169499 #

Or possibly confusing it with JavaScript, which treats strings as sequences of UTF-16 characters?

10. TheCycoONE ◴[27 Apr 24 02:43 UTC] No.40176676{6}[source]▶

>>40174958 #

Powershell use to output utf-16 by default on Windows. It might still but it's been awhile since I needed to try.

11. int_19h ◴[27 Apr 24 10:27 UTC] No.40178858{6}[source]▶

>>40174958 #

Then you're talking about the C stdlib, which, yeah, is meant to use the locale-specific encoding on any platform, so it's not really a Windows thing specifically. But even then someone could use the CRT but call wfopen() rather than fopen() etc - this was actually not uncommon for Windows software precisely because it let you handle Unicode without having to work with Win32 API directly.

Microsoft's implementation of fopen() also supports "ccs=..." to open Unicode text files in Unicode, and interestingly "ccs=UNICODE" will get you UTF-16LE, not UTF-8 (but you can do "ccs=UTF-8"). .NET also has this weird naming quirk where Encoding.Unicode is UTF-16, although there at least UTF-8 is the default for all text I/O classes like StreamReader if you don't specify the encoding. Still, many people didn't know better, and so some early .NET software would use UTF-16 for text I/O for no reason other than its developers believing that Encoding.Unicode is obviously what they are supposed to be using to "support Unicode", and so explicitly passing it everywhere.