PEP 686 – Make UTF-8 mode default

(peps.python.org)

238 points GalaxySnail | 1 comments | 26 Apr 24 11:55 UTC | HN request time: 0s | source

Show context

nerdponx ◴[26 Apr 24 14:44 UTC] No.40169967[source]▶

Default text file encoding being platform-dependent always drove me nuts. This is a welcome change.

I also appreciate that they did not attempt to tackle filesystem encoding here, which is a separate issue that drives me nuts, but separately.

replies(4): >>40171063 #>>40171211 #>>40172228 #>>40173633 #

layer8 ◴[26 Apr 24 16:10 UTC] No.40171063[source]▶

>>40169967 #

Historically it made sense, when most software was local-only, and text files were expected to be in the local encoding. Not just platform-dependent, but user’s preferred locale-dependent. This is also how the C standard library operates.

For example, on Unix/Linux, using iso-8859-1 was common when using Western-European languages, and in Europe it became common to switch to iso-8859-15 after the Euro was introduced, because it contained the € symbol. UTF-8 only began to work flawlessly in the later aughts. Debian switched to it as the default with the Etch release in 2010.

replies(4): >>40172024 #>>40172052 #>>40172183 #>>40177841 #

da_chicken ◴[26 Apr 24 17:46 UTC] No.40172183[source]▶

>>40171063 #

It's still not that uncommon to see programs on Linux not understanding multibyte UTF-8.

It's also true that essentially nothing on Linux supports the UTF-8 byte order mark. Yes, it's meaningless for UTF-8, but it is explicitly allowed in the specifications. Since Microsoft tends to always include a BOM in any flavor of Unicode, this means Linux often chokes on valid UTF-8 text files from Windows systems.

replies(3): >>40172423 #>>40172441 #>>40176569 #

nerdponx ◴[26 Apr 24 18:11 UTC] No.40172423[source]▶

>>40172183 #

Interestingly, Python is one of those programs.

You need to use the special "utf-8-sig" encoding for that, which is not prominently advertised anywhere in the documentation (but it is stated deep inside the "Unicode HOWTO").

I never understood why ignoring this special character requires a totally separate encoding.

replies(1): >>40173468 #

1. duskwuff ◴[26 Apr 24 19:54 UTC] No.40173468{3}[source]▶

>>40172423 #

> I never understood why ignoring this special character requires a totally separate encoding.

Because the BOM is indistinguishable from the "real" UTF-8 encoding of U+FEFF (zero-width no-break space). Trimming that codepoint in the UTF-8 decoder means that some strings like "\uFEFF" can't be safely round-tripped; adding it in the encoder is invalid in many contexts.

↑