PEP 686 – Make UTF-8 mode default

(peps.python.org)

Show context

nerdponx ◴[26 Apr 24 14:44 UTC] No.40169967[source]▶

Default text file encoding being platform-dependent always drove me nuts. This is a welcome change.

I also appreciate that they did not attempt to tackle filesystem encoding here, which is a separate issue that drives me nuts, but separately.

replies(4): >>40171063 #>>40171211 #>>40172228 #>>40173633 #

1. layer8 ◴[26 Apr 24 16:10 UTC] No.40171063[source]▶

>>40169967 #

Historically it made sense, when most software was local-only, and text files were expected to be in the local encoding. Not just platform-dependent, but user’s preferred locale-dependent. This is also how the C standard library operates.

For example, on Unix/Linux, using iso-8859-1 was common when using Western-European languages, and in Europe it became common to switch to iso-8859-15 after the Euro was introduced, because it contained the € symbol. UTF-8 only began to work flawlessly in the later aughts. Debian switched to it as the default with the Etch release in 2010.

replies(4): >>40172024 #>>40172052 #>>40172183 #>>40177841 #

2. anthk ◴[26 Apr 24 17:33 UTC] No.40172024[source]▶

>>40171063 (TP) #

Emacs was amazing for that; builtin text encoders/decoders/transcoders for everything.

replies(1): >>40172110 #

3. Dylan16807 ◴[26 Apr 24 17:35 UTC] No.40172052[source]▶

>>40171063 (TP) #

> Not just platform-dependent, but user’s preferred locale-dependent.

Historically it made sense to be locale-dependent, but even then it was annoying to be platform-dependent.

One is not a subset of the other.

replies(2): >>40172171 #>>40172645 #

4. hollerith ◴[26 Apr 24 17:40 UTC] No.40172110[source]▶

>>40172024 #

My experience was that brittleness around text encoding in Emacs (versions 22 and 23 or so) was a constant source of annoyance for years.

IIRC, the main way this brittleness bit me was that every time a buffer containing a non-ASCII character was saved, Emacs would engage me in a conversation (which I found tedious and distracting) about what coding system I would like to use to save the file, and I never found a sane way to configure it to avoid such conversations even after spending hours learning about how Emacs does coding systems: I simply had to wait (a year or 3) for a new version of Emacs in which the code for saving buffers worked better.

I think some people like engaging in these conversations with their computers even though the conversations are very boring and repetitive and that such conversation-likers are numerous among Emacs users or at least Emacs maintainers.

replies(1): >>40179122 #

5. hermitdev ◴[26 Apr 24 17:45 UTC] No.40172171[source]▶

>>40172052 #

> platform-dependent.

It's 2024 and we still can't all agree on line endings. Mac vs Win vs Unix...

replies(2): >>40172265 #>>40172368 #

6. da_chicken ◴[26 Apr 24 17:46 UTC] No.40172183[source]▶

>>40171063 (TP) #

It's still not that uncommon to see programs on Linux not understanding multibyte UTF-8.

It's also true that essentially nothing on Linux supports the UTF-8 byte order mark. Yes, it's meaningless for UTF-8, but it is explicitly allowed in the specifications. Since Microsoft tends to always include a BOM in any flavor of Unicode, this means Linux often chokes on valid UTF-8 text files from Windows systems.

replies(3): >>40172423 #>>40172441 #>>40176569 #

7. Y-bar ◴[26 Apr 24 17:55 UTC] No.40172265{3}[source]▶

>>40172171 #

Mac OS and Unix agreed about twenty years ago to use the same ending: https://superuser.com/a/439443

replies(1): >>40172390 #

8. Longhanks ◴[26 Apr 24 18:05 UTC] No.40172368{3}[source]▶

>>40172171 #

It's 2024, everything but Windows is UTF-8 \n since twenty years.

replies(1): >>40174531 #

9. Dylan16807 ◴[26 Apr 24 18:08 UTC] No.40172390{4}[source]▶

>>40172265 #

By which time XP was already in the middle of releasing, so it was too late to get Windows on board.

It's too bad, with a bit more planning and an earlier realization that Unicode cannot in fact fit into 16 bits then Windows might have used UTF-8 internally.

replies(2): >>40174470 #>>40196503 #

10. nerdponx ◴[26 Apr 24 18:11 UTC] No.40172423[source]▶

>>40172183 #

Interestingly, Python is one of those programs.

You need to use the special "utf-8-sig" encoding for that, which is not prominently advertised anywhere in the documentation (but it is stated deep inside the "Unicode HOWTO").

I never understood why ignoring this special character requires a totally separate encoding.

replies(1): >>40173468 #

11. tialaramex ◴[26 Apr 24 18:14 UTC] No.40172441[source]▶

>>40172183 #

The BOM cases are at best a consequence of trying to use poor quality Windows software to do stuff it's not suited to. It's true that in terms of Unicode text it's valid for a UTF-8 string to have a BOM, but just because that's true in the text itself doesn't magically change file formats which long pre-dated that.

Most obviously shebang (the practice of writing #!/path/to/interpreter at the start of a script) is specifically defined on those first two bytes. It doesn't make any sense have a BOM here because that's not the format, and inventing a new rule later which says you can do it doesn't make that true, any more than in 2024 the German government can decide Germany didn't invade Poland in 1939, that's not how Time's Arrow works.

replies(1): >>40175407 #

12. layer8 ◴[26 Apr 24 18:33 UTC] No.40172645[source]▶

>>40172052 #

Not sure what you mean by that with regard to encodings. The C APIs were explicitly designed to abstract from that, and together with libraries like iconv is was rather straightforward. You only needed to be aware that there is a difference between internal and external encoding, and maybe decide between char and wchar_t.

replies(1): >>40175699 #

13. duskwuff ◴[26 Apr 24 19:54 UTC] No.40173468{3}[source]▶

>>40172423 #

> I never understood why ignoring this special character requires a totally separate encoding.

Because the BOM is indistinguishable from the "real" UTF-8 encoding of U+FEFF (zero-width no-break space). Trimming that codepoint in the UTF-8 decoder means that some strings like "\uFEFF" can't be safely round-tripped; adding it in the encoder is invalid in many contexts.

14. jmb99 ◴[26 Apr 24 21:23 UTC] No.40174470{5}[source]▶

>>40172390 #

Unless I’m mistaken, Rhapsody (released 1997) used LF, not CR. At that point it was pretty clear Mac was moving towards Unix through NeXTSTEP, meaning every OS except windows would be using LF. Microsoft would’ve had around 6 years before the release of XP, and probably would’ve had time to start the transition with Win2K at the end of 1999.

replies(1): >>40177690 #

15. int_19h ◴[26 Apr 24 21:29 UTC] No.40174531{4}[source]▶

>>40172368 #

Linux was definitely not uniformly UTF-8 twenty years ago. It was one of the many available locales, but it was still common to use other encodings, and plenty of software didn't handle multibyte well in general.

16. tremon ◴[26 Apr 24 23:20 UTC] No.40175407{3}[source]▶

>>40172441 #

poor quality Windows software to do stuff it's not suited to

Depends how wide your definition of "poor quality" is. All powershell files (ps1, psm1, psd1) are assumed to be in the local charset unless they have a byte order mark, in which case they're treated as whatever the BOM says.

replies(1): >>40196450 #

17. Dylan16807 ◴[27 Apr 24 00:02 UTC] No.40175699{3}[source]▶

>>40172645 #

Not everything is C, and nothing like that saves you when you move your floppy between computers.

18. thayne ◴[27 Apr 24 02:23 UTC] No.40176569[source]▶

>>40172183 #

Really? In my experience it's pretty rare for Linux programs not to understand any multibyte utf-8 (which would be anything that isn't ascii). What is somewhat common is failing on code points outside the basic multilingual plane (codepoints that don't fit in 16 bits).

19. mixmastamyk ◴[27 Apr 24 06:11 UTC] No.40177690{6}[source]▶

>>40174470 #

Every OS except the one that had 95% market share in late 90s. Apple was only propped up “Weekend at Bernies” style to appease regulators.

20. andrewshadura ◴[27 Apr 24 06:45 UTC] No.40177841[source]▶

>>40171063 (TP) #

Etch came out in 2007, not 2010.

replies(1): >>40179280 #

21. anthk ◴[27 Apr 24 11:24 UTC] No.40179122{3}[source]▶

>>40172110 #

TBH Gvim and most editors did the same on saving prompts, but for sure you could edit that under Emacs with M-x configure, and Emacs supported weirdly encoded files on the spot.

22. layer8 ◴[27 Apr 24 11:57 UTC] No.40179280[source]▶

>>40177841 #

Ah, I had misremembered, and misread https://www.debian.org/releases/etch/.

23. account42 ◴[29 Apr 24 10:01 UTC] No.40196450{4}[source]▶

>>40175407 #

> Depends how wide your definition of "poor quality" is.

This is an example of poor quality software:

> All powershell files (ps1, psm1, psd1) are assumed to be in the local charset unless they have a byte order mark, in which case they're treated as whatever the BOM says.

Powershell is not that old. Assuming local encoding is inexcusable here.

24. account42 ◴[29 Apr 24 10:11 UTC] No.40196503{5}[source]▶

>>40172390 #

> and an earlier realization that Unicode cannot in fact fit into 16 bits

The Unicode consortium already realized it when they decided on Han unification, they just didn't accept it yet.

↑