←back to thread

238 points GalaxySnail | 1 comments | | HN request time: 0.204s | source
Show context
Animats ◴[] No.40174352[source]
Is the internal encoding in CPython UTF-8 yet?

You can index through Python strings with a subscript, but random access is rare enough that it's probably worthwhile to lazily index a string when needed. If you just need to advance or back up by 1, you don't need an index. So an internal representation of UTF-8 is quite possible.

replies(1): >>40175141 #
1. rogerbinns ◴[] No.40175141[source]
The PyUnicode object is what represents a str. If the UTF-8 bytes are ever requested, then a bytes object is created on demand and cached as part of the PyUnicode, being freed with the PyUnicode itself is freed.

Separately from that the codepoints making up the string are stored in a straight forward array allowing random access. The size of each codepoint can be 1, 2, or 4 bytes. When you create a PyUnicode you have to specify the maximum codepoint value which is rounded up to 127, 255, 65535, or 1,114,111. That determines if 1, 2, or 4 bytes is used.

If the maxiumum codepoint value is 127 then that array representation can be used for the UTF-8 directly. So the answer to your question is that many strings are stored as UTF-8 because all the codepoints are <= 127.

Separately from that, advancing through strings should not be done by codepoints anyway. A user perceived character (aka grapheme cluster) is made up of one or more codepoints. For example an e with an accent could be the e codepoint followed by a combining accent codepoint. The phoenix emoji is really the bird emoji, a zero width joiner, and then fire emoji. Some writing systems used by hundreds of millions of people are similar to having consonants, with combining marks to represent vowels.

This - - is 5 codepoints. There is a good blog post diving into it and how various languages report its "length". https://hsivonen.fi/string-length/

Source: I've just finished implementing Unicode TR29 which covers this for a Python C extension.