←back to thread

238 points GalaxySnail | 7 comments | | HN request time: 0.601s | source | bottom
Show context
nerdponx ◴[] No.40169967[source]
Default text file encoding being platform-dependent always drove me nuts. This is a welcome change.

I also appreciate that they did not attempt to tackle filesystem encoding here, which is a separate issue that drives me nuts, but separately.

replies(4): >>40171063 #>>40171211 #>>40172228 #>>40173633 #
layer8 ◴[] No.40171063[source]
Historically it made sense, when most software was local-only, and text files were expected to be in the local encoding. Not just platform-dependent, but user’s preferred locale-dependent. This is also how the C standard library operates.

For example, on Unix/Linux, using iso-8859-1 was common when using Western-European languages, and in Europe it became common to switch to iso-8859-15 after the Euro was introduced, because it contained the € symbol. UTF-8 only began to work flawlessly in the later aughts. Debian switched to it as the default with the Etch release in 2010.

replies(4): >>40172024 #>>40172052 #>>40172183 #>>40177841 #
1. da_chicken ◴[] No.40172183[source]
It's still not that uncommon to see programs on Linux not understanding multibyte UTF-8.

It's also true that essentially nothing on Linux supports the UTF-8 byte order mark. Yes, it's meaningless for UTF-8, but it is explicitly allowed in the specifications. Since Microsoft tends to always include a BOM in any flavor of Unicode, this means Linux often chokes on valid UTF-8 text files from Windows systems.

replies(3): >>40172423 #>>40172441 #>>40176569 #
2. nerdponx ◴[] No.40172423[source]
Interestingly, Python is one of those programs.

You need to use the special "utf-8-sig" encoding for that, which is not prominently advertised anywhere in the documentation (but it is stated deep inside the "Unicode HOWTO").

I never understood why ignoring this special character requires a totally separate encoding.

replies(1): >>40173468 #
3. tialaramex ◴[] No.40172441[source]
The BOM cases are at best a consequence of trying to use poor quality Windows software to do stuff it's not suited to. It's true that in terms of Unicode text it's valid for a UTF-8 string to have a BOM, but just because that's true in the text itself doesn't magically change file formats which long pre-dated that.

Most obviously shebang (the practice of writing #!/path/to/interpreter at the start of a script) is specifically defined on those first two bytes. It doesn't make any sense have a BOM here because that's not the format, and inventing a new rule later which says you can do it doesn't make that true, any more than in 2024 the German government can decide Germany didn't invade Poland in 1939, that's not how Time's Arrow works.

replies(1): >>40175407 #
4. duskwuff ◴[] No.40173468[source]
> I never understood why ignoring this special character requires a totally separate encoding.

Because the BOM is indistinguishable from the "real" UTF-8 encoding of U+FEFF (zero-width no-break space). Trimming that codepoint in the UTF-8 decoder means that some strings like "\uFEFF" can't be safely round-tripped; adding it in the encoder is invalid in many contexts.

5. tremon ◴[] No.40175407[source]
poor quality Windows software to do stuff it's not suited to

Depends how wide your definition of "poor quality" is. All powershell files (ps1, psm1, psd1) are assumed to be in the local charset unless they have a byte order mark, in which case they're treated as whatever the BOM says.

replies(1): >>40196450 #
6. thayne ◴[] No.40176569[source]
Really? In my experience it's pretty rare for Linux programs not to understand any multibyte utf-8 (which would be anything that isn't ascii). What is somewhat common is failing on code points outside the basic multilingual plane (codepoints that don't fit in 16 bits).
7. account42 ◴[] No.40196450{3}[source]
> Depends how wide your definition of "poor quality" is.

This is an example of poor quality software:

> All powershell files (ps1, psm1, psd1) are assumed to be in the local charset unless they have a byte order mark, in which case they're treated as whatever the BOM says.

Powershell is not that old. Assuming local encoding is inexcusable here.