PEP 686 – Make UTF-8 mode default

1. nerdponx ◴[26 Apr 24 14:44 UTC] No.40169967[source]▶

Default text file encoding being platform-dependent always drove me nuts. This is a welcome change.

I also appreciate that they did not attempt to tackle filesystem encoding here, which is a separate issue that drives me nuts, but separately.

replies(4): >>40171063 #>>40171211 #>>40172228 #>>40173633 #

2. layer8 ◴[26 Apr 24 16:10 UTC] No.40171063[source]▶

>>40169967 (TP) #

Historically it made sense, when most software was local-only, and text files were expected to be in the local encoding. Not just platform-dependent, but user’s preferred locale-dependent. This is also how the C standard library operates.

For example, on Unix/Linux, using iso-8859-1 was common when using Western-European languages, and in Europe it became common to switch to iso-8859-15 after the Euro was introduced, because it contained the € symbol. UTF-8 only began to work flawlessly in the later aughts. Debian switched to it as the default with the Etch release in 2010.

replies(4): >>40172024 #>>40172052 #>>40172183 #>>40177841 #

3. avidphantasm ◴[26 Apr 24 16:22 UTC] No.40171211[source]▶

>>40169967 (TP) #

Yeah, this has bitten me several times as soon as a people use the code on Windows.

4. anthk ◴[26 Apr 24 17:33 UTC] No.40172024[source]▶

>>40171063 #

Emacs was amazing for that; builtin text encoders/decoders/transcoders for everything.

replies(1): >>40172110 #

5. Dylan16807 ◴[26 Apr 24 17:35 UTC] No.40172052[source]▶

>>40171063 #

> Not just platform-dependent, but user’s preferred locale-dependent.

Historically it made sense to be locale-dependent, but even then it was annoying to be platform-dependent.

One is not a subset of the other.

replies(2): >>40172171 #>>40172645 #

6. hollerith ◴[26 Apr 24 17:40 UTC] No.40172110{3}[source]▶

>>40172024 #

My experience was that brittleness around text encoding in Emacs (versions 22 and 23 or so) was a constant source of annoyance for years.

IIRC, the main way this brittleness bit me was that every time a buffer containing a non-ASCII character was saved, Emacs would engage me in a conversation (which I found tedious and distracting) about what coding system I would like to use to save the file, and I never found a sane way to configure it to avoid such conversations even after spending hours learning about how Emacs does coding systems: I simply had to wait (a year or 3) for a new version of Emacs in which the code for saving buffers worked better.

I think some people like engaging in these conversations with their computers even though the conversations are very boring and repetitive and that such conversation-likers are numerous among Emacs users or at least Emacs maintainers.

replies(1): >>40179122 #

7. hermitdev ◴[26 Apr 24 17:45 UTC] No.40172171{3}[source]▶

>>40172052 #

> platform-dependent.

It's 2024 and we still can't all agree on line endings. Mac vs Win vs Unix...

replies(2): >>40172265 #>>40172368 #

8. da_chicken ◴[26 Apr 24 17:46 UTC] No.40172183[source]▶

>>40171063 #

It's still not that uncommon to see programs on Linux not understanding multibyte UTF-8.

It's also true that essentially nothing on Linux supports the UTF-8 byte order mark. Yes, it's meaningless for UTF-8, but it is explicitly allowed in the specifications. Since Microsoft tends to always include a BOM in any flavor of Unicode, this means Linux often chokes on valid UTF-8 text files from Windows systems.

replies(3): >>40172423 #>>40172441 #>>40176569 #

9. fbdab103 ◴[26 Apr 24 17:50 UTC] No.40172228[source]▶

>>40169967 (TP) #

A different one that just bit me the other day was implicitly changing line endings. Local testing on my corporate laptop all went according to plan. Deploy to linux host and downstream application cannot consume it because it requires CRLF.

Just one of those stupid little things you have to remember from time to time. Although, why does newly written software require a specific line terminator is a valid question.

10. Y-bar ◴[26 Apr 24 17:55 UTC] No.40172265{4}[source]▶

>>40172171 #

Mac OS and Unix agreed about twenty years ago to use the same ending: https://superuser.com/a/439443

replies(1): >>40172390 #

11. Longhanks ◴[26 Apr 24 18:05 UTC] No.40172368{4}[source]▶

>>40172171 #

It's 2024, everything but Windows is UTF-8 \n since twenty years.

replies(1): >>40174531 #

12. Dylan16807 ◴[26 Apr 24 18:08 UTC] No.40172390{5}[source]▶

>>40172265 #

By which time XP was already in the middle of releasing, so it was too late to get Windows on board.

It's too bad, with a bit more planning and an earlier realization that Unicode cannot in fact fit into 16 bits then Windows might have used UTF-8 internally.

replies(2): >>40174470 #>>40196503 #

13. nerdponx ◴[26 Apr 24 18:11 UTC] No.40172423{3}[source]▶

>>40172183 #

Interestingly, Python is one of those programs.

You need to use the special "utf-8-sig" encoding for that, which is not prominently advertised anywhere in the documentation (but it is stated deep inside the "Unicode HOWTO").

I never understood why ignoring this special character requires a totally separate encoding.

replies(1): >>40173468 #

14. tialaramex ◴[26 Apr 24 18:14 UTC] No.40172441{3}[source]▶

>>40172183 #

The BOM cases are at best a consequence of trying to use poor quality Windows software to do stuff it's not suited to. It's true that in terms of Unicode text it's valid for a UTF-8 string to have a BOM, but just because that's true in the text itself doesn't magically change file formats which long pre-dated that.

Most obviously shebang (the practice of writing #!/path/to/interpreter at the start of a script) is specifically defined on those first two bytes. It doesn't make any sense have a BOM here because that's not the format, and inventing a new rule later which says you can do it doesn't make that true, any more than in 2024 the German government can decide Germany didn't invade Poland in 1939, that's not how Time's Arrow works.

replies(1): >>40175407 #

15. layer8 ◴[26 Apr 24 18:33 UTC] No.40172645{3}[source]▶

>>40172052 #

Not sure what you mean by that with regard to encodings. The C APIs were explicitly designed to abstract from that, and together with libraries like iconv is was rather straightforward. You only needed to be aware that there is a difference between internal and external encoding, and maybe decide between char and wchar_t.

replies(1): >>40175699 #

16. duskwuff ◴[26 Apr 24 19:54 UTC] No.40173468{4}[source]▶

>>40172423 #

> I never understood why ignoring this special character requires a totally separate encoding.

Because the BOM is indistinguishable from the "real" UTF-8 encoding of U+FEFF (zero-width no-break space). Trimming that codepoint in the UTF-8 decoder means that some strings like "\uFEFF" can't be safely round-tripped; adding it in the encoder is invalid in many contexts.

17. Dwedit ◴[26 Apr 24 20:08 UTC] No.40173633[source]▶

>>40169967 (TP) #

With system-default code pages on Windows, it's not only platform-dependent, it's also System Locale dependent.

Windows badly dropped the ball here by not providing a simple opt-in way to make all the Ansi functions (TextOutA, etc) use the UTF-8 code page, until many many years later with the manifest file. This should have been a feature introduced in NT4 or Windows 98, not something that's put off until midway through Windows 10's development cycle.

replies(3): >>40174790 #>>40175724 #>>40179961 #

18. jmb99 ◴[26 Apr 24 21:23 UTC] No.40174470{6}[source]▶

>>40172390 #

Unless I’m mistaken, Rhapsody (released 1997) used LF, not CR. At that point it was pretty clear Mac was moving towards Unix through NeXTSTEP, meaning every OS except windows would be using LF. Microsoft would’ve had around 6 years before the release of XP, and probably would’ve had time to start the transition with Win2K at the end of 1999.

replies(1): >>40177690 #

19. int_19h ◴[26 Apr 24 21:29 UTC] No.40174531{5}[source]▶

>>40172368 #

Linux was definitely not uniformly UTF-8 twenty years ago. It was one of the many available locales, but it was still common to use other encodings, and plenty of software didn't handle multibyte well in general.

20. sheepscreek ◴[26 Apr 24 22:00 UTC] No.40174790[source]▶

>>40173633 #

I suspect that is a symptom of Microsoft being an enormously large organization. Coordinating a change like this that cuts across all apps, services and drivers is monumental. Honestly it is quite refreshing to see them do it with Copilot integration across all things MS. I don’t use it though, just admire the valiant effort and focus it takes to pull off something like this.

Of course - goes without saying, only works when the directive comes from all the way at the top. Otherwise there will be just too many conflicting incentives for any real change to happen.

While I am on this topic - I want to mention Apple. It is absolutely bonkers how they have done exactly the is countless times. Like changing your entire platform architecture! It could have been like opening a can of worms but they knew what they were doing. Kudos to them.

Also..(sorry, this is becoming a long post) civil and industrial engineering firms routinely pull off projects like that. But the point I wanted to emphasize is that it’s very uncommon in tech which prides on having decentralized and semi-autonomous teams vs centralized and highly aligned teams.

replies(1): >>40177752 #

21. tremon ◴[26 Apr 24 23:20 UTC] No.40175407{4}[source]▶

>>40172441 #

poor quality Windows software to do stuff it's not suited to

Depends how wide your definition of "poor quality" is. All powershell files (ps1, psm1, psd1) are assumed to be in the local charset unless they have a byte order mark, in which case they're treated as whatever the BOM says.

replies(1): >>40196450 #

22. Dylan16807 ◴[27 Apr 24 00:02 UTC] No.40175699{4}[source]▶

>>40172645 #

Not everything is C, and nothing like that saves you when you move your floppy between computers.

23. kevin_thibedeau ◴[27 Apr 24 00:05 UTC] No.40175724[source]▶

>>40173633 #

"UCS-2 is enough for anyone"

replies(1): >>40176354 #

24. Dwedit ◴[27 Apr 24 01:46 UTC] No.40176354{3}[source]▶

>>40175724 #

UCS-2 is why we have the WTF-8 encoding standard, which allows mismatched UTF-16 surrogate pairs to survive a round-trip through an 8-bit encoding.

https://simonsapin.github.io/wtf-8/

25. thayne ◴[27 Apr 24 02:23 UTC] No.40176569{3}[source]▶

>>40172183 #

Really? In my experience it's pretty rare for Linux programs not to understand any multibyte utf-8 (which would be anything that isn't ascii). What is somewhat common is failing on code points outside the basic multilingual plane (codepoints that don't fit in 16 bits).

26. mixmastamyk ◴[27 Apr 24 06:11 UTC] No.40177690{7}[source]▶

>>40174470 #

Every OS except the one that had 95% market share in late 90s. Apple was only propped up “Weekend at Bernies” style to appease regulators.

27. samus ◴[27 Apr 24 06:26 UTC] No.40177752{3}[source]▶

>>40174790 #

> While I am on this topic - I want to mention Apple. It is absolutely bonkers how they have done exactly the is countless times. Like changing your entire platform architecture! It could have been like opening a can of worms but they knew what they were doing. Kudos to them.

Apple has a walled garden approach to managing their ecosystem, and within the confines of their garden they just do what's necessary. AFAIK, Apple doesn't care about the possibilty to run binaries from the '90s on a modern stack.

Edit: even though it's expensive, it's possible to conduct such ecosystem-wide changes if you hold all cards in your hand. Microsoft was able to reengineer the graphical subsystem somewhere between XP and 8. Doing something like this is magnitudes more difficult on Linux (Wayland says hi). Google could maybe do it withij their Android corner, but they generally give a sh*t about backwards compatibility.

replies(1): >>40199368 #

28. andrewshadura ◴[27 Apr 24 06:45 UTC] No.40177841[source]▶

>>40171063 #

Etch came out in 2007, not 2010.

replies(1): >>40179280 #

29. anthk ◴[27 Apr 24 11:24 UTC] No.40179122{4}[source]▶

>>40172110 #

TBH Gvim and most editors did the same on saving prompts, but for sure you could edit that under Emacs with M-x configure, and Emacs supported weirdly encoded files on the spot.

30. layer8 ◴[27 Apr 24 11:57 UTC] No.40179280{3}[source]▶

>>40177841 #

Ah, I had misremembered, and misread https://www.debian.org/releases/etch/.

31. hprotagonist ◴[27 Apr 24 13:52 UTC] No.40179961[source]▶

>>40173633 #

I still see people getting nailed by CP1251.

Recently, i've gotten bit by UTF16 (because somewhere along the line somewhere on a windows machine generated a file by piping it in powershell)

32. account42 ◴[29 Apr 24 10:01 UTC] No.40196450{5}[source]▶

>>40175407 #

> Depends how wide your definition of "poor quality" is.

This is an example of poor quality software:

> All powershell files (ps1, psm1, psd1) are assumed to be in the local charset unless they have a byte order mark, in which case they're treated as whatever the BOM says.

Powershell is not that old. Assuming local encoding is inexcusable here.

33. account42 ◴[29 Apr 24 10:11 UTC] No.40196503{6}[source]▶

>>40172390 #

> and an earlier realization that Unicode cannot in fact fit into 16 bits

The Unicode consortium already realized it when they decided on Han unification, they just didn't accept it yet.

34. fl0ki ◴[29 Apr 24 15:12 UTC] No.40199368{4}[source]▶

>>40177752 #

> Apple has a walled garden approach to managing their ecosystem, and within the confines of their garden they just do what's necessary.

I don't think the walled garden makes much of a difference when it comes to compatibility on, say, macOS. They still have to carefully weigh the ecosystem-wide cost of deprecating old APIs against the ecosystem-wide long-term benefits. Yes the decision remains entirely their own, but a lot of stakeholders indirectly weigh on the decision.

GTK and Qt also make backwards-incompatible new versions as they evolve. The biggest difference here is that in theory someone could keep maintaining the old library code if they decided that updating their application code was always going to be harder. How rarely this actually happens gives weight to the argument that developers can accept occasional API overhauls in exchange for staying on the well-supported low-tech-debt path.

So walled or open made no difference here, even in the open platform, application developers are largely at the mercy of where development effort on libraries and frameworks is going. Nobody can afford to make their own exclusive frameworks to an acceptable standard, and if we want to get away from the technical debt of the 90s then the shared frameworks have to make breaking changes occasionally and strategically.

> AFAIK, Apple doesn't care about the possibilty to run binaries from the '90s on a modern stack.

Definitely, and I don't either. It's kind of a silver lining that Apple wasn't the enterprise heavy-hitter that Microsoft was at the time, because if it had been, its entire culture and landscape would be shaped by it like Microsoft's was. I think we have enough of that in the industry already.

When an old platform is that old, it's really hard to justify making it a seamless subset of the modern platform, and it makes more sense to talk about some form of virtualization. This is where even Windows falls down on both counts. How well modern Windows runs old software is far more variable than people assume until they try it. Anything with 8-bit colors may not work at all.

VirtualBox, qemu, etc. have increasingly poor support for DOS-based Windows (95, 98, ME) because not enough people care about that even in the context of virtualization. After trying every free virtualization option to run some 90s Windows software, I ended up finding that WINE was more compatible with that era than modern Windows is, without any of the jank of running a real Windows in qemu or VirtualBox.

So even with the OS most famous for backwards-compatibility and the enormous technical debt that carries, compatibility has been slowly sliding, even worse than open source projects with no direct lineage to the same platform and no commercial motives.

It's perfectly justifiable to reset technical debt here, whether walled or open. If people have enough need to run old software, there should be a market of solutions to that problem, yet it generally remains niche or hobbyist, and even the big commercial vendors overestimate how well they're doing it.