←back to thread

238 points GalaxySnail | 3 comments | | HN request time: 0.001s | source
Show context
a-french-anon ◴[] No.40170353[source]
Why not utf-8-sig, though? It handles optional BOMs. Had to fix a script last week that choked on it.
replies(3): >>40170707 #>>40170832 #>>40171048 #
shellac ◴[] No.40171048[source]
At this point nothing ought to be inserting BOMs in utf-8. It's not recommended, and I think choking on it is reasonable behaviour these days.
replies(3): >>40171192 #>>40173969 #>>40178398 #
Athas ◴[] No.40171192[source]
Why were BOMs ever allowed for UTF-8?
replies(5): >>40171419 #>>40171452 #>>40172241 #>>40175549 #>>40177110 #
1. da_chicken ◴[] No.40172241[source]
Some algorithms can operate much easier if they can assume that multibyte or variable byte characters don't exist. The BOM means that you don't have to scan the entire document to know if you can do that.
replies(1): >>40196745 #
2. account42 ◴[] No.40196745[source]
No, a missing BOM does not guarantee that the file doesn't contain multi-byte code points and never has.
replies(1): >>40201400 #
3. da_chicken ◴[] No.40201400[source]
That isn't what I said.

I said that if a BOM is present, then you explicitly know that multi-byte characters are possibly present. Therefore, if it's present you know that assuming that the Nth byte is the Nth code point is unsafe.

The opposite is irrelevant. There's never any way to safely determine a text file's encoding if there is no BOM present.