Popular/hot comments

(peps.python.org)

Show context

Euphorbium ◴[26 Apr 24 14:53 UTC] No.40170122[source]▶

>>40168242 (OP) #

I thought it was default since python 3.

replies(2): >>40170798 #>>40173404 #

lucb1e ◴[26 Apr 24 15:39 UTC] No.40170798[source]▶

>>40170122 #

You may be thinking of strings where the u"" prefix was made obsolete in python3. Then again, trying on Python 2.7 just now, typing "éķů" results in it printing the UTF-8 bytes for those characters so I don't actually know what that u prefix ever did, but one of the big py2-to-3 changes was strings having an encoding and byte strings being for byte sequences without encodings

This change seems to be about things like open('filename', mode='r') mainly on Windows where the default encoding is not UTF-8 and so you'd have to specify open('filename', mode='r', encoding='UTF-8')

replies(2): >>40171035 #>>40171960 #

1. aktiur ◴[26 Apr 24 16:07 UTC] No.40171035[source]▶

>>40170798 #

> strings having an encoding and byte strings being for byte sequences without encodings

You got it kind of backwards. `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints), without reference to any encoding. `bytes` are arbitrary sequence of octets. If you have some `bytes` object that somehow stands for text, you need to know that it is text and what its encoding is to be able to interpret it correctly (by decoding it to `str`).

And, if you got a `str` and want to serialize it (for writing or transmitting), you need to choose an encoding, because different encodings will generate different `bytes`.

As an example :

>>> "évènement".encode("utf-8") b'\xc3\xa9v\xc3\xa8nement'

>>> "évènement".encode("latin-1") b'\xe9v\xe8nement'

replies(2): >>40171529 #>>40178954 #

2. lucb1e ◴[26 Apr 24 16:47 UTC] No.40171529[source]▶

>>40171035 (TP) #

> `str` are sequence of unicode codepoints [...] without reference to any encoding

I guess I see it from the programmer's perspective: to handle bytes coming from the disk/network as a string, I need to specify an encoding, so they are (to me) byte sequences with an encoding assigned. Didn't realize strings don't have an encoding in Python's internal string handling but are, instead, something like an array of integers pointing to unicode code points. Not sure if this viewpoint means I am getting it backwards but I can see how that was phrased poorly on my part!

replies(1): >>40172187 #

3. tialaramex ◴[26 Apr 24 17:47 UTC] No.40172187[source]▶

>>40171529 #

There are two distinct questions here, to which implementations can provide different answers

1. Interface: How can I interact with "string" values, what kind of operations can I perform versus what can't be done ? Methods and Operators provided go here.

2. Representation: What is actually stored (in memory) ? Layout goes here.

So you may have understood (1) for Python, but you were badly off on (2). Now, at some level this doesn't matter, but, for performance obviously the choice of what you should do will depend on (2). Most obviously, if the language represents strings as UTF-8 bytes, then "encoding" a string as UTF-8 will be extremely cheap. Whereas, if the language represents them as UTF-16 code units, the UTF-8 encoding operation will be a little slower.

replies(1): >>40172653 #

4. lucb1e ◴[26 Apr 24 18:34 UTC] No.40172653{3}[source]▶

>>40172187 #

Alright, but don't leave us hanging: what does Python3 use for (2) that you say I was badly off on? (Or, in actuality, never thought about or meant to make claims about.) Now we still can't make good choices for performance!

https://stackoverflow.com/questions/1838170/what-is-internal... says Python3.3 picks either a one-, two-, or four-byte representation depending on which is the smallest one that can represent all characters in a string. If you have one character in the string that requires >2 bytes to represent, it'll make every character take 4 bytes in memory such that you can have O(1) lookups on arbitrary offsets. The more you know :)

replies(3): >>40172880 #>>40174994 #>>40182518 #

5. tialaramex ◴[26 Apr 24 18:52 UTC] No.40172880{4}[source]▶

>>40172653 #

Yeah, I started writing about what you found (the answer to (2) for Python) and I realised that's a huge rabbit hole I was venturing down and decided to stop short and post, so, apologies I guess.

6. samus ◴[26 Apr 24 22:27 UTC] No.40174994{4}[source]▶

>>40172653 #

Since Java 9, the Java JRE does something similar: if a string contains only characters in ISO-8859-1 then it is stored as such, else the usual storage format (int16) is used.

7. chrismorgan ◴[27 Apr 24 10:46 UTC] No.40178954[source]▶

>>40171035 (TP) #

> `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints)

It’s worse than that, actually: UTF-8 is a specific encoding for sequences of Unicode scalar values (which means: code points minus the surrogate range U+D800–U+DFFF). Since str is a sequence of Unicode code points, this means you can make strings that cannot be encoded in any standard encoding:

  >>> '\udead'.encode('utf-16')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-16' codec can't encode character '\udead' in position 0: surrogates not allowed
  >>> '\ud83d\ude41'.encode('utf-8')
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

Python 3’s strings are a tragedy. They seized defeat from the jaws of victory.

replies(1): >>40196637 #

8. aktiur ◴[27 Apr 24 19:05 UTC] No.40182518{4}[source]▶

>>40172653 #

Pre-python 3.2, the format used for representing `str` objects in memory depended on if you used a "narrow" (UTF-16) or "wide" (UTF-32) build of Python.

Fortunately, wide and narrow builds were abandonned in Python 3.2, with a new way of representing strings : current Python will use ASCII if there's no non-ASCII char, UCS-2 –UTF-16 without surrogate pairs — if there is no codepoint higher than U+FFFF, and UTF-32 else.

See this article for a good overview of the history of strings in Python : https://tenthousandmeters.com/blog/python-behind-the-scenes-...

9. account42 ◴[29 Apr 24 10:33 UTC] No.40196637[source]▶

>>40178954 #

Maybe we need another PEP that switches the default to WTF-8 [0] aka UTF-8 but let's ignore that a chunk of code points was reserved as surrogates and just encode them like any other code point.

[0] https://simonsapin.github.io/wtf-8/

replies(1): >>40202858 #

10. chrismorgan ◴[29 Apr 24 19:25 UTC] No.40202858{3}[source]▶

>>40196637 #

My comment was completely unrelated to PEP 686. WTF-8 is emphatically not intended to be used as a file encoding.

↑

PEP 686 – Make UTF-8 mode default