(peps.python.org)

238 points GalaxySnail | 3 comments | 26 Apr 24 11:55 UTC | HN request time: 0.001s | source

Show context

a-french-anon ◴[26 Apr 24 15:08 UTC] No.40170353[source]▶

Why not utf-8-sig, though? It handles optional BOMs. Had to fix a script last week that choked on it.

replies(3): >>40170707 #>>40170832 #>>40171048 #

shellac ◴[26 Apr 24 16:09 UTC] No.40171048[source]▶

At this point nothing ought to be inserting BOMs in utf-8. It's not recommended, and I think choking on it is reasonable behaviour these days.

replies(3): >>40171192 #>>40173969 #>>40178398 #

Athas ◴[26 Apr 24 16:21 UTC] No.40171192[source]▶

>>40171048 #

Why were BOMs ever allowed for UTF-8?

replies(5): >>40171419 #>>40171452 #>>40172241 #>>40175549 #>>40177110 #

1. da_chicken ◴[26 Apr 24 17:51 UTC] No.40172241[source]▶

>>40171192 #

Some algorithms can operate much easier if they can assume that multibyte or variable byte characters don't exist. The BOM means that you don't have to scan the entire document to know if you can do that.

replies(1): >>40196745 #

2. account42 ◴[29 Apr 24 10:53 UTC] No.40196745[source]▶

>>40172241 (TP) #

No, a missing BOM does not guarantee that the file doesn't contain multi-byte code points and never has.

replies(1): >>40201400 #

3. da_chicken ◴[29 Apr 24 17:33 UTC] No.40201400[source]▶

>>40196745 #

That isn't what I said.

I said that if a BOM is present, then you explicitly know that multi-byte characters are possibly present. Therefore, if it's present you know that assuming that the Nth byte is the Nth code point is unsafe.

The opposite is irrelevant. There's never any way to safely determine a text file's encoding if there is no BOM present.

↑

PEP 686 – Make UTF-8 mode default