PEP 686 – Make UTF-8 mode default

(peps.python.org)

238 points GalaxySnail | 3 comments | 26 Apr 24 11:55 UTC | HN request time: 0.001s | source

Show context

Euphorbium ◴[26 Apr 24 14:53 UTC] No.40170122[source]▶

>>40168242 (OP) #

I thought it was default since python 3.

replies(2): >>40170798 #>>40173404 #

lucb1e ◴[26 Apr 24 15:39 UTC] No.40170798[source]▶

>>40170122 #

You may be thinking of strings where the u"" prefix was made obsolete in python3. Then again, trying on Python 2.7 just now, typing "éķů" results in it printing the UTF-8 bytes for those characters so I don't actually know what that u prefix ever did, but one of the big py2-to-3 changes was strings having an encoding and byte strings being for byte sequences without encodings

This change seems to be about things like open('filename', mode='r') mainly on Windows where the default encoding is not UTF-8 and so you'd have to specify open('filename', mode='r', encoding='UTF-8')

replies(2): >>40171035 #>>40171960 #

aktiur ◴[26 Apr 24 16:07 UTC] No.40171035[source]▶

>>40170798 #

> strings having an encoding and byte strings being for byte sequences without encodings

You got it kind of backwards. `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints), without reference to any encoding. `bytes` are arbitrary sequence of octets. If you have some `bytes` object that somehow stands for text, you need to know that it is text and what its encoding is to be able to interpret it correctly (by decoding it to `str`).

And, if you got a `str` and want to serialize it (for writing or transmitting), you need to choose an encoding, because different encodings will generate different `bytes`.

As an example :

>>> "évènement".encode("utf-8") b'\xc3\xa9v\xc3\xa8nement'

>>> "évènement".encode("latin-1") b'\xe9v\xe8nement'

replies(2): >>40171529 #>>40178954 #

1. chrismorgan ◴[27 Apr 24 10:46 UTC] No.40178954[source]▶

>>40171035 #

> `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints)

It’s worse than that, actually: UTF-8 is a specific encoding for sequences of Unicode scalar values (which means: code points minus the surrogate range U+D800–U+DFFF). Since str is a sequence of Unicode code points, this means you can make strings that cannot be encoded in any standard encoding:

  >>> '\udead'.encode('utf-16')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-16' codec can't encode character '\udead' in position 0: surrogates not allowed
  >>> '\ud83d\ude41'.encode('utf-8')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

Python 3’s strings are a tragedy. They seized defeat from the jaws of victory.

replies(1): >>40196637 #

2. account42 ◴[29 Apr 24 10:33 UTC] No.40196637[source]▶

>>40178954 (TP) #

Maybe we need another PEP that switches the default to WTF-8 [0] aka UTF-8 but let's ignore that a chunk of code points was reserved as surrogates and just encode them like any other code point.

[0] https://simonsapin.github.io/wtf-8/

replies(1): >>40202858 #

3. chrismorgan ◴[29 Apr 24 19:25 UTC] No.40202858[source]▶

>>40196637 #

My comment was completely unrelated to PEP 686. WTF-8 is emphatically not intended to be used as a file encoding.

↑