PEP 686 – Make UTF-8 mode default

(peps.python.org)

238 points GalaxySnail | 3 comments | 26 Apr 24 11:55 UTC | HN request time: 0.721s | source

Show context

Myrmornis ◴[26 Apr 24 13:08 UTC] No.40169014[source]▶

Hm TIL, I thought that the string encoding argument to .decode() and .encode() was required, but now I see it defaults to "utf-8". Did that change at some point?

replies(2): >>40169160 #>>40170907 #

_ache_ ◴[26 Apr 24 13:31 UTC] No.40169160[source]▶

>>40169014 #

You can verify on the documentation by switching the version.

So ... since 3.2: https://docs.python.org/3.2/library/stdtypes.html#bytes.deco... In 3.1 it was the default encoding of string (the type str I guess). https://docs.python.org/3.1/library/stdtypes.html#bytes.deco...

replies(1): >>40171396 #

aktiur ◴[26 Apr 24 16:36 UTC] No.40171396[source]▶

>>40169160 #

> In 3.1 it was the default encoding of string (the type str I guess).

No, what was used was what sys.getdefaultencoding(), which was already UTF-8 in 3.1 (I checked the source code).

At that time, the format used for representing `str` objects in memory depended on if you used a "narrow" (UTF-16) or "wide" (UTF-32) build of Python.

Fortunately, wide and narrow builds were abandonned in Python 3.2, with a new way of representing strings : current Python will use ASCII if there's no non-ASCII char, UCS-2 –UTF-16 without surrogate pairs — if there is no codepoint higher than U+FFFF, and UTF-32 else. But that did not exist in 3.1, where you could either use the "narrow" build of python (that used UTF-16) or the "wide" build (that used UTF-32).

See this article for a good overview of the history of strings in Python : https://tenthousandmeters.com/blog/python-behind-the-scenes-...

replies(1): >>40171671 #

_ache_ ◴[26 Apr 24 16:57 UTC] No.40171671[source]▶

>>40171396 #

Thank you ! The documentation was misleading about "default encoding of string".

replies(1): >>40174638 #

1. int_19h ◴[26 Apr 24 21:41 UTC] No.40174638[source]▶

>>40171671 #

The simple thing to remember is that for all versions of Python going back 12 years, there's no such thing as "default encoding of string". A Python string is defined as a sequence of 32-bit Unicode codepoints, and that is how Python code perceives it in all respects. How it is stored internally is an implementation detail that does not affect you.

replies(1): >>40175904 #

2. Dylan16807 ◴[27 Apr 24 00:30 UTC] No.40175904[source]▶

>>40174638 (TP) #

32 bit specifically?

The most expansive Unicode has ever been was 31 bits, and UTF-8 is also capable of at most 31 bits.

replies(1): >>40178795 #

3. int_19h ◴[27 Apr 24 10:12 UTC] No.40178795[source]▶

>>40175904 #

You're right, the docs just say "Unicode codepoints", and standard facilities like "\U..." or chr() will refuse anything above U+10FFFF. However I'm not sure that still holds true when third-party native modules are in the picture.

↑