←back to thread

238 points GalaxySnail | 1 comments | | HN request time: 0s | source
Show context
Euphorbium ◴[] No.40170122[source]
I thought it was default since python 3.
replies(2): >>40170798 #>>40173404 #
lucb1e ◴[] No.40170798[source]
You may be thinking of strings where the u"" prefix was made obsolete in python3. Then again, trying on Python 2.7 just now, typing "éķů" results in it printing the UTF-8 bytes for those characters so I don't actually know what that u prefix ever did, but one of the big py2-to-3 changes was strings having an encoding and byte strings being for byte sequences without encodings

This change seems to be about things like open('filename', mode='r') mainly on Windows where the default encoding is not UTF-8 and so you'd have to specify open('filename', mode='r', encoding='UTF-8')

replies(2): >>40171035 #>>40171960 #
aktiur ◴[] No.40171035[source]
> strings having an encoding and byte strings being for byte sequences without encodings

You got it kind of backwards. `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints), without reference to any encoding. `bytes` are arbitrary sequence of octets. If you have some `bytes` object that somehow stands for text, you need to know that it is text and what its encoding is to be able to interpret it correctly (by decoding it to `str`).

And, if you got a `str` and want to serialize it (for writing or transmitting), you need to choose an encoding, because different encodings will generate different `bytes`.

As an example :

>>> "évènement".encode("utf-8") b'\xc3\xa9v\xc3\xa8nement'

>>> "évènement".encode("latin-1") b'\xe9v\xe8nement'

replies(2): >>40171529 #>>40178954 #
chrismorgan ◴[] No.40178954[source]
> `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints)

It’s worse than that, actually: UTF-8 is a specific encoding for sequences of Unicode scalar values (which means: code points minus the surrogate range U+D800–U+DFFF). Since str is a sequence of Unicode code points, this means you can make strings that cannot be encoded in any standard encoding:

  >>> '\udead'.encode('utf-16')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-16' codec can't encode character '\udead' in position 0: surrogates not allowed
  >>> '\ud83d\ude41'.encode('utf-8')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Python 3’s strings are a tragedy. They seized defeat from the jaws of victory.
replies(1): >>40196637 #
account42 ◴[] No.40196637{3}[source]
Maybe we need another PEP that switches the default to WTF-8 [0] aka UTF-8 but let's ignore that a chunk of code points was reserved as surrogates and just encode them like any other code point.

[0] https://simonsapin.github.io/wtf-8/

replies(1): >>40202858 #
1. chrismorgan ◴[] No.40202858{4}[source]
My comment was completely unrelated to PEP 686. WTF-8 is emphatically not intended to be used as a file encoding.