PEP 686 – Make UTF-8 mode default

(peps.python.org)

238 points GalaxySnail | 1 comments | 26 Apr 24 11:55 UTC | HN request time: 0s | source

Show context

Euphorbium ◴[26 Apr 24 14:53 UTC] No.40170122[source]▶

>>40168242 (OP) #

I thought it was default since python 3.

replies(2): >>40170798 #>>40173404 #

lucb1e ◴[26 Apr 24 15:39 UTC] No.40170798[source]▶

>>40170122 #

You may be thinking of strings where the u"" prefix was made obsolete in python3. Then again, trying on Python 2.7 just now, typing "éķů" results in it printing the UTF-8 bytes for those characters so I don't actually know what that u prefix ever did, but one of the big py2-to-3 changes was strings having an encoding and byte strings being for byte sequences without encodings

This change seems to be about things like open('filename', mode='r') mainly on Windows where the default encoding is not UTF-8 and so you'd have to specify open('filename', mode='r', encoding='UTF-8')

replies(2): >>40171035 #>>40171960 #

aktiur ◴[26 Apr 24 16:07 UTC] No.40171035[source]▶

>>40170798 #

> strings having an encoding and byte strings being for byte sequences without encodings

You got it kind of backwards. `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints), without reference to any encoding. `bytes` are arbitrary sequence of octets. If you have some `bytes` object that somehow stands for text, you need to know that it is text and what its encoding is to be able to interpret it correctly (by decoding it to `str`).

And, if you got a `str` and want to serialize it (for writing or transmitting), you need to choose an encoding, because different encodings will generate different `bytes`.

As an example :

>>> "évènement".encode("utf-8") b'\xc3\xa9v\xc3\xa8nement'

>>> "évènement".encode("latin-1") b'\xe9v\xe8nement'

replies(2): >>40171529 #>>40178954 #

lucb1e ◴[26 Apr 24 16:47 UTC] No.40171529[source]▶

>>40171035 #

> `str` are sequence of unicode codepoints [...] without reference to any encoding

I guess I see it from the programmer's perspective: to handle bytes coming from the disk/network as a string, I need to specify an encoding, so they are (to me) byte sequences with an encoding assigned. Didn't realize strings don't have an encoding in Python's internal string handling but are, instead, something like an array of integers pointing to unicode code points. Not sure if this viewpoint means I am getting it backwards but I can see how that was phrased poorly on my part!

replies(1): >>40172187 #

tialaramex ◴[26 Apr 24 17:47 UTC] No.40172187[source]▶

>>40171529 #

There are two distinct questions here, to which implementations can provide different answers

1. Interface: How can I interact with "string" values, what kind of operations can I perform versus what can't be done ? Methods and Operators provided go here.

2. Representation: What is actually stored (in memory) ? Layout goes here.

So you may have understood (1) for Python, but you were badly off on (2). Now, at some level this doesn't matter, but, for performance obviously the choice of what you should do will depend on (2). Most obviously, if the language represents strings as UTF-8 bytes, then "encoding" a string as UTF-8 will be extremely cheap. Whereas, if the language represents them as UTF-16 code units, the UTF-8 encoding operation will be a little slower.

replies(1): >>40172653 #

lucb1e ◴[26 Apr 24 18:34 UTC] No.40172653[source]▶

>>40172187 #

Alright, but don't leave us hanging: what does Python3 use for (2) that you say I was badly off on? (Or, in actuality, never thought about or meant to make claims about.) Now we still can't make good choices for performance!

https://stackoverflow.com/questions/1838170/what-is-internal... says Python3.3 picks either a one-, two-, or four-byte representation depending on which is the smallest one that can represent all characters in a string. If you have one character in the string that requires >2 bytes to represent, it'll make every character take 4 bytes in memory such that you can have O(1) lookups on arbitrary offsets. The more you know :)

replies(3): >>40172880 #>>40174994 #>>40182518 #

1. aktiur ◴[27 Apr 24 19:05 UTC] No.40182518[source]▶

>>40172653 #

Pre-python 3.2, the format used for representing `str` objects in memory depended on if you used a "narrow" (UTF-16) or "wide" (UTF-32) build of Python.

Fortunately, wide and narrow builds were abandonned in Python 3.2, with a new way of representing strings : current Python will use ASCII if there's no non-ASCII char, UCS-2 –UTF-16 without surrogate pairs — if there is no codepoint higher than U+FFFF, and UTF-32 else.

See this article for a good overview of the history of strings in Python : https://tenthousandmeters.com/blog/python-behind-the-scenes-...

↑