PEP 686 – Make UTF-8 mode default

(peps.python.org)

Show context

a-french-anon ◴[26 Apr 24 15:08 UTC] No.40170353[source]▶

>>40168242 (OP) #

Why not utf-8-sig, though? It handles optional BOMs. Had to fix a script last week that choked on it.

replies(3): >>40170707 #>>40170832 #>>40171048 #

shellac ◴[26 Apr 24 16:09 UTC] No.40171048[source]▶

>>40170353 #

At this point nothing ought to be inserting BOMs in utf-8. It's not recommended, and I think choking on it is reasonable behaviour these days.

replies(3): >>40171192 #>>40173969 #>>40178398 #

1. Athas ◴[26 Apr 24 16:21 UTC] No.40171192[source]▶

>>40171048 #

Why were BOMs ever allowed for UTF-8?

replies(5): >>40171419 #>>40171452 #>>40172241 #>>40175549 #>>40177110 #

2. plorkyeran ◴[26 Apr 24 16:38 UTC] No.40171419[source]▶

>>40171192 (TP) #

When UTF-8 was still very much not the default encoding for text files it was useful to have a way to signal that a file was UTF-8 and not the local system encoding.

3. josefx ◴[26 Apr 24 16:41 UTC] No.40171452[source]▶

>>40171192 (TP) #

Some editors used them to help detect UTF-8 encoded files. Since they are also valid zero length space characters they also served as a nice easter egg for people who ended up editing their linux shell scripts with a windows text editor.

4. da_chicken ◴[26 Apr 24 17:51 UTC] No.40172241[source]▶

>>40171192 (TP) #

Some algorithms can operate much easier if they can assume that multibyte or variable byte characters don't exist. The BOM means that you don't have to scan the entire document to know if you can do that.

replies(1): >>40196745 #

5. ◴[26 Apr 24 23:43 UTC] No.40175549[source]▶

>>40171192 (TP) #

6. stubish ◴[27 Apr 24 04:03 UTC] No.40177110[source]▶

>>40171192 (TP) #

An attempt to store the encoding needed to decode the data with the data, rather than requiring the reader to know it somehow. Your program wouldn't have to care if its source data had been encoded as UTF-8, UTF-16, UTF-32 or some future standard. The usual sort of compromise that comes out of committees, in this case where every committee member wanted to be able to spit their preferred in-memory Unicode string representation to disk with no encoding overhead.

7. account42 ◴[29 Apr 24 10:53 UTC] No.40196745[source]▶

>>40172241 #

No, a missing BOM does not guarantee that the file doesn't contain multi-byte code points and never has.

replies(1): >>40201400 #

8. da_chicken ◴[29 Apr 24 17:33 UTC] No.40201400{3}[source]▶

>>40196745 #

That isn't what I said.

I said that if a BOM is present, then you explicitly know that multi-byte characters are possibly present. Therefore, if it's present you know that assuming that the Nth byte is the Nth code point is unsafe.

The opposite is irrelevant. There's never any way to safely determine a text file's encoding if there is no BOM present.

↑