←back to thread

238 points GalaxySnail | 8 comments | | HN request time: 0.274s | source | bottom
Show context
a-french-anon ◴[] No.40170353[source]
Why not utf-8-sig, though? It handles optional BOMs. Had to fix a script last week that choked on it.
replies(3): >>40170707 #>>40170832 #>>40171048 #
shellac ◴[] No.40171048[source]
At this point nothing ought to be inserting BOMs in utf-8. It's not recommended, and I think choking on it is reasonable behaviour these days.
replies(3): >>40171192 #>>40173969 #>>40178398 #
1. Athas ◴[] No.40171192[source]
Why were BOMs ever allowed for UTF-8?
replies(5): >>40171419 #>>40171452 #>>40172241 #>>40175549 #>>40177110 #
2. plorkyeran ◴[] No.40171419[source]
When UTF-8 was still very much not the default encoding for text files it was useful to have a way to signal that a file was UTF-8 and not the local system encoding.
3. josefx ◴[] No.40171452[source]
Some editors used them to help detect UTF-8 encoded files. Since they are also valid zero length space characters they also served as a nice easter egg for people who ended up editing their linux shell scripts with a windows text editor.
4. da_chicken ◴[] No.40172241[source]
Some algorithms can operate much easier if they can assume that multibyte or variable byte characters don't exist. The BOM means that you don't have to scan the entire document to know if you can do that.
replies(1): >>40196745 #
5. ◴[] No.40175549[source]
6. stubish ◴[] No.40177110[source]
An attempt to store the encoding needed to decode the data with the data, rather than requiring the reader to know it somehow. Your program wouldn't have to care if its source data had been encoded as UTF-8, UTF-16, UTF-32 or some future standard. The usual sort of compromise that comes out of committees, in this case where every committee member wanted to be able to spit their preferred in-memory Unicode string representation to disk with no encoding overhead.
7. account42 ◴[] No.40196745[source]
No, a missing BOM does not guarantee that the file doesn't contain multi-byte code points and never has.
replies(1): >>40201400 #
8. da_chicken ◴[] No.40201400{3}[source]
That isn't what I said.

I said that if a BOM is present, then you explicitly know that multi-byte characters are possibly present. Therefore, if it's present you know that assuming that the Nth byte is the Nth code point is unsafe.

The opposite is irrelevant. There's never any way to safely determine a text file's encoding if there is no BOM present.