PEP 686 – Make UTF-8 mode default

(peps.python.org)

238 points GalaxySnail | 1 comments | 26 Apr 24 11:55 UTC | HN request time: 0s | source

Show context

Euphorbium ◴[26 Apr 24 14:53 UTC] No.40170122[source]▶

>>40168242 (OP) #

I thought it was default since python 3.

replies(2): >>40170798 #>>40173404 #

lucb1e ◴[26 Apr 24 15:39 UTC] No.40170798[source]▶

>>40170122 #

You may be thinking of strings where the u"" prefix was made obsolete in python3. Then again, trying on Python 2.7 just now, typing "éķů" results in it printing the UTF-8 bytes for those characters so I don't actually know what that u prefix ever did, but one of the big py2-to-3 changes was strings having an encoding and byte strings being for byte sequences without encodings

This change seems to be about things like open('filename', mode='r') mainly on Windows where the default encoding is not UTF-8 and so you'd have to specify open('filename', mode='r', encoding='UTF-8')

replies(2): >>40171035 #>>40171960 #

1. jcranmer ◴[26 Apr 24 17:27 UTC] No.40171960[source]▶

>>40170798 #

Python has two types of strings: byte strings (every character is in the range of 0-255) and Unicode strings (every character is a Unicode codepoint). In Python 2.x, "" maps to a byte string and u"" maps to a Unicode string; in Python 3.x, "" maps to a unicode string and b"" maps to a byte string.

If you typed in "éķů" in Python 2.7, what you get is a string consisting of the hex chars 0xC3 0xA9 0xC4 0xB7 0xC5 0xAF, which if you printed it out and displayed it as UTF-8--the default of most terminals--would appear to be éķů. But "éķů"[1] would return a byte string of \xa9 which isn't valid UTF-8 and would likely display as garbage.

If you instead had used u"éķů", you'd instead get a string of three Unicode code points, U+00E9 U+0137 U+016F. And u"éķů"[1] would return u"ķ", which is a valid Unicode character.

↑