←back to thread

238 points GalaxySnail | 1 comments | | HN request time: 0.001s | source
Show context
nerdponx ◴[] No.40169967[source]
Default text file encoding being platform-dependent always drove me nuts. This is a welcome change.

I also appreciate that they did not attempt to tackle filesystem encoding here, which is a separate issue that drives me nuts, but separately.

replies(4): >>40171063 #>>40171211 #>>40172228 #>>40173633 #
layer8 ◴[] No.40171063[source]
Historically it made sense, when most software was local-only, and text files were expected to be in the local encoding. Not just platform-dependent, but user’s preferred locale-dependent. This is also how the C standard library operates.

For example, on Unix/Linux, using iso-8859-1 was common when using Western-European languages, and in Europe it became common to switch to iso-8859-15 after the Euro was introduced, because it contained the € symbol. UTF-8 only began to work flawlessly in the later aughts. Debian switched to it as the default with the Etch release in 2010.

replies(4): >>40172024 #>>40172052 #>>40172183 #>>40177841 #
da_chicken ◴[] No.40172183[source]
It's still not that uncommon to see programs on Linux not understanding multibyte UTF-8.

It's also true that essentially nothing on Linux supports the UTF-8 byte order mark. Yes, it's meaningless for UTF-8, but it is explicitly allowed in the specifications. Since Microsoft tends to always include a BOM in any flavor of Unicode, this means Linux often chokes on valid UTF-8 text files from Windows systems.

replies(3): >>40172423 #>>40172441 #>>40176569 #
1. thayne ◴[] No.40176569[source]
Really? In my experience it's pretty rare for Linux programs not to understand any multibyte utf-8 (which would be anything that isn't ascii). What is somewhat common is failing on code points outside the basic multilingual plane (codepoints that don't fit in 16 bits).