←back to thread

238 points GalaxySnail | 3 comments | | HN request time: 0.721s | source
Show context
Myrmornis ◴[] No.40169014[source]
Hm TIL, I thought that the string encoding argument to .decode() and .encode() was required, but now I see it defaults to "utf-8". Did that change at some point?
replies(2): >>40169160 #>>40170907 #
_ache_ ◴[] No.40169160[source]
You can verify on the documentation by switching the version.

So ... since 3.2: https://docs.python.org/3.2/library/stdtypes.html#bytes.deco... In 3.1 it was the default encoding of string (the type str I guess). https://docs.python.org/3.1/library/stdtypes.html#bytes.deco...

replies(1): >>40171396 #
aktiur ◴[] No.40171396[source]
> In 3.1 it was the default encoding of string (the type str I guess).

No, what was used was what sys.getdefaultencoding(), which was already UTF-8 in 3.1 (I checked the source code).

At that time, the format used for representing `str` objects in memory depended on if you used a "narrow" (UTF-16) or "wide" (UTF-32) build of Python.

Fortunately, wide and narrow builds were abandonned in Python 3.2, with a new way of representing strings : current Python will use ASCII if there's no non-ASCII char, UCS-2 –UTF-16 without surrogate pairs — if there is no codepoint higher than U+FFFF, and UTF-32 else. But that did not exist in 3.1, where you could either use the "narrow" build of python (that used UTF-16) or the "wide" build (that used UTF-32).

See this article for a good overview of the history of strings in Python : https://tenthousandmeters.com/blog/python-behind-the-scenes-...

replies(1): >>40171671 #
_ache_ ◴[] No.40171671[source]
Thank you ! The documentation was misleading about "default encoding of string".
replies(1): >>40174638 #
1. int_19h ◴[] No.40174638[source]
The simple thing to remember is that for all versions of Python going back 12 years, there's no such thing as "default encoding of string". A Python string is defined as a sequence of 32-bit Unicode codepoints, and that is how Python code perceives it in all respects. How it is stored internally is an implementation detail that does not affect you.
replies(1): >>40175904 #
2. Dylan16807 ◴[] No.40175904[source]
32 bit specifically?

The most expansive Unicode has ever been was 31 bits, and UTF-8 is also capable of at most 31 bits.

replies(1): >>40178795 #
3. int_19h ◴[] No.40178795[source]
You're right, the docs just say "Unicode codepoints", and standard facilities like "\U..." or chr() will refuse anything above U+10FFFF. However I'm not sure that still holds true when third-party native modules are in the picture.