Oh, I missed Java moving from UTF-16 to UTF-8.
So ... since 3.2: https://docs.python.org/3.2/library/stdtypes.html#bytes.deco... In 3.1 it was the default encoding of string (the type str I guess). https://docs.python.org/3.1/library/stdtypes.html#bytes.deco...
"forget" or possibly simply aren't made well enough aware? I genuinely thought that python would only use UTF-8 for everything unless you explicitly ask it to do otherwise.
These days it's less of an issue but I would simply not rely on the os to get this right ever for this. Most uses of encodings other than UTF-8 are extremely likely to be unintentional at this point. And if it is intentional, you should be very explicit about it and not rely on weird indirect configuration through the OS that may or may not line up.
So, good change. Anything that breaks over this is probably better off with the simple fix added. And it's not worth leaving everything else as broken as it is with content corruption bugs just waiting to happen.
* The String class originally only used UTF-16 encoding, but since Java 9 it also uses a single-byte-per-character latin-1 encoding when possible.
I also appreciate that they did not attempt to tackle filesystem encoding here, which is a separate issue that drives me nuts, but separately.
This change seems to be about things like open('filename', mode='r') mainly on Windows where the default encoding is not UTF-8 and so you'd have to specify open('filename', mode='r', encoding='UTF-8')
This is not a reliable way to look up information. It doesn't know when it's wrong.
You got it kind of backwards. `str` are sequence of unicode codepoints (not UTF-8, which is a specific encoding for unicode codepoints), without reference to any encoding. `bytes` are arbitrary sequence of octets. If you have some `bytes` object that somehow stands for text, you need to know that it is text and what its encoding is to be able to interpret it correctly (by decoding it to `str`).
And, if you got a `str` and want to serialize it (for writing or transmitting), you need to choose an encoding, because different encodings will generate different `bytes`.
As an example :
>>> "évènement".encode("utf-8") b'\xc3\xa9v\xc3\xa8nement'
>>> "évènement".encode("latin-1") b'\xe9v\xe8nement'
For example, on Unix/Linux, using iso-8859-1 was common when using Western-European languages, and in Europe it became common to switch to iso-8859-15 after the Euro was introduced, because it contained the € symbol. UTF-8 only began to work flawlessly in the later aughts. Debian switched to it as the default with the Etch release in 2010.
No, what was used was what sys.getdefaultencoding(), which was already UTF-8 in 3.1 (I checked the source code).
At that time, the format used for representing `str` objects in memory depended on if you used a "narrow" (UTF-16) or "wide" (UTF-32) build of Python.
Fortunately, wide and narrow builds were abandonned in Python 3.2, with a new way of representing strings : current Python will use ASCII if there's no non-ASCII char, UCS-2 –UTF-16 without surrogate pairs — if there is no codepoint higher than U+FFFF, and UTF-32 else. But that did not exist in 3.1, where you could either use the "narrow" build of python (that used UTF-16) or the "wide" build (that used UTF-32).
See this article for a good overview of the history of strings in Python : https://tenthousandmeters.com/blog/python-behind-the-scenes-...
`bytes.decode` (and `str.encode`) have used UTF-8 as a default since at least Python 3.
However, the default encoding used for decoding the name of files use ` sys.getfilesystemencoding()`, which is also UTF-8 on Windows and macos, but will vary with the locale on linux (specifically with CODESET).
Finally, `open` will directly use `locale.getencoding()`.
I guess I see it from the programmer's perspective: to handle bytes coming from the disk/network as a string, I need to specify an encoding, so they are (to me) byte sequences with an encoding assigned. Didn't realize strings don't have an encoding in Python's internal string handling but are, instead, something like an array of integers pointing to unicode code points. Not sure if this viewpoint means I am getting it backwards but I can see how that was phrased poorly on my part!
If you typed in "éķů" in Python 2.7, what you get is a string consisting of the hex chars 0xC3 0xA9 0xC4 0xB7 0xC5 0xAF, which if you printed it out and displayed it as UTF-8--the default of most terminals--would appear to be éķů. But "éķů"[1] would return a byte string of \xa9 which isn't valid UTF-8 and would likely display as garbage.
If you instead had used u"éķů", you'd instead get a string of three Unicode code points, U+00E9 U+0137 U+016F. And u"éķů"[1] would return u"ķ", which is a valid Unicode character.
Historically it made sense to be locale-dependent, but even then it was annoying to be platform-dependent.
One is not a subset of the other.
IIRC, the main way this brittleness bit me was that every time a buffer containing a non-ASCII character was saved, Emacs would engage me in a conversation (which I found tedious and distracting) about what coding system I would like to use to save the file, and I never found a sane way to configure it to avoid such conversations even after spending hours learning about how Emacs does coding systems: I simply had to wait (a year or 3) for a new version of Emacs in which the code for saving buffers worked better.
I think some people like engaging in these conversations with their computers even though the conversations are very boring and repetitive and that such conversation-likers are numerous among Emacs users or at least Emacs maintainers.
It's also true that essentially nothing on Linux supports the UTF-8 byte order mark. Yes, it's meaningless for UTF-8, but it is explicitly allowed in the specifications. Since Microsoft tends to always include a BOM in any flavor of Unicode, this means Linux often chokes on valid UTF-8 text files from Windows systems.
1. Interface: How can I interact with "string" values, what kind of operations can I perform versus what can't be done ? Methods and Operators provided go here.
2. Representation: What is actually stored (in memory) ? Layout goes here.
So you may have understood (1) for Python, but you were badly off on (2). Now, at some level this doesn't matter, but, for performance obviously the choice of what you should do will depend on (2). Most obviously, if the language represents strings as UTF-8 bytes, then "encoding" a string as UTF-8 will be extremely cheap. Whereas, if the language represents them as UTF-16 code units, the UTF-8 encoding operation will be a little slower.
Just one of those stupid little things you have to remember from time to time. Although, why does newly written software require a specific line terminator is a valid question.
It's too bad, with a bit more planning and an earlier realization that Unicode cannot in fact fit into 16 bits then Windows might have used UTF-8 internally.
You need to use the special "utf-8-sig" encoding for that, which is not prominently advertised anywhere in the documentation (but it is stated deep inside the "Unicode HOWTO").
I never understood why ignoring this special character requires a totally separate encoding.
Most obviously shebang (the practice of writing #!/path/to/interpreter at the start of a script) is specifically defined on those first two bytes. It doesn't make any sense have a BOM here because that's not the format, and inventing a new rule later which says you can do it doesn't make that true, any more than in 2024 the German government can decide Germany didn't invade Poland in 1939, that's not how Time's Arrow works.
https://stackoverflow.com/questions/1838170/what-is-internal... says Python3.3 picks either a one-, two-, or four-byte representation depending on which is the smallest one that can represent all characters in a string. If you have one character in the string that requires >2 bytes to represent, it'll make every character take 4 bytes in memory such that you can have O(1) lookups on arbitrary offsets. The more you know :)
# string literals create str objects using utf-8 by default
Path("filenames use their own encoding").write_text("file content encoding uses yet another encoding")
The corresponding encodings are:- utf-8 [tokenize.open] - sys.getfilesystemencoding() [os.fsencode] - locale.getpreferredencoding() [open]
Because the BOM is indistinguishable from the "real" UTF-8 encoding of U+FEFF (zero-width no-break space). Trimming that codepoint in the UTF-8 decoder means that some strings like "\uFEFF" can't be safely round-tripped; adding it in the encoder is invalid in many contexts.
Windows badly dropped the ball here by not providing a simple opt-in way to make all the Ansi functions (TextOutA, etc) use the UTF-8 code page, until many many years later with the manifest file. This should have been a feature introduced in NT4 or Windows 98, not something that's put off until midway through Windows 10's development cycle.
Python 2 was charset agnostic, so it always worked, but the improvement with Python 3 was not only an improvement – how to tell a Python 3 script from a Python 2 script?
* If it contains the string "utf-8", it's Python3.
* If it only works if your locale is C.UTF-8, it's Python3.
Needless to say, I welcome this change. The way I understand it, it would "repair" Python 3.
You can index through Python strings with a subscript, but random access is rare enough that it's probably worthwhile to lazily index a string when needed. If you just need to advance or back up by 1, you don't need an index. So an internal representation of UTF-8 is quite possible.
> On Windows 10, this element forces a process to use UTF-8 as the process code page. For more information, see Use the UTF-8 code page. On Windows 10, the only valid value for activeCodePage is UTF-8.
> This element was first added in Windows 10 version 1903 (May 2019 Update). You can declare this property and target/run on earlier Windows builds, but you must handle legacy code page detection and conversion as usual. This element has no attributes.
However, on Win10+, apps themselves can explicitly opt into UTF-8 for all non-widechar Win32 APIs regardless of the current locale/codepage.
Of course - goes without saying, only works when the directive comes from all the way at the top. Otherwise there will be just too many conflicting incentives for any real change to happen.
While I am on this topic - I want to mention Apple. It is absolutely bonkers how they have done exactly the is countless times. Like changing your entire platform architecture! It could have been like opening a can of worms but they knew what they were doing. Kudos to them.
Also..(sorry, this is becoming a long post) civil and industrial engineering firms routinely pull off projects like that. But the point I wanted to emphasize is that it’s very uncommon in tech which prides on having decentralized and semi-autonomous teams vs centralized and highly aligned teams.
UTF-16 text files are rather rare, as is using Notepad's UTF-16 options. The only semi-common use I know of is *.reg files saved from regedit. One issue with UTF-16 is that it has two different serializations (BE and LE), and hence generally requires a BOM to disambiguate.
Separately from that the codepoints making up the string are stored in a straight forward array allowing random access. The size of each codepoint can be 1, 2, or 4 bytes. When you create a PyUnicode you have to specify the maximum codepoint value which is rounded up to 127, 255, 65535, or 1,114,111. That determines if 1, 2, or 4 bytes is used.
If the maxiumum codepoint value is 127 then that array representation can be used for the UTF-8 directly. So the answer to your question is that many strings are stored as UTF-8 because all the codepoints are <= 127.
Separately from that, advancing through strings should not be done by codepoints anyway. A user perceived character (aka grapheme cluster) is made up of one or more codepoints. For example an e with an accent could be the e codepoint followed by a combining accent codepoint. The phoenix emoji is really the bird emoji, a zero width joiner, and then fire emoji. Some writing systems used by hundreds of millions of people are similar to having consonants, with combining marks to represent vowels.
This - - is 5 codepoints. There is a good blog post diving into it and how various languages report its "length". https://hsivonen.fi/string-length/
Source: I've just finished implementing Unicode TR29 which covers this for a Python C extension.
Depends how wide your definition of "poor quality" is. All powershell files (ps1, psm1, psd1) are assumed to be in the local charset unless they have a byte order mark, in which case they're treated as whatever the BOM says.
The most expansive Unicode has ever been was 31 bits, and UTF-8 is also capable of at most 31 bits.
Other APIs, however, like File.WriteAllText, do not write BOM unless you explicitly pass encoding that does so (by returning non-empty preamble).
https://docs.python.org/3/library/codecs.html
The codec you're imagining would also make reading a file and writing it back change the file if it contains a BOM.
Apple has a walled garden approach to managing their ecosystem, and within the confines of their garden they just do what's necessary. AFAIK, Apple doesn't care about the possibilty to run binaries from the '90s on a modern stack.
Edit: even though it's expensive, it's possible to conduct such ecosystem-wide changes if you hold all cards in your hand. Microsoft was able to reengineer the graphical subsystem somewhere between XP and 8. Doing something like this is magnitudes more difficult on Linux (Wayland says hi). Google could maybe do it withij their Android corner, but they generally give a sh*t about backwards compatibility.
For UTF-16/32, knowing the endianness doesn't seem to be a frivolous functionality. And in fact, having to use heuristics-based detection via uchardet is a big mess, some kind of header should have been standardized since the start.
Microsoft's implementation of fopen() also supports "ccs=..." to open Unicode text files in Unicode, and interestingly "ccs=UNICODE" will get you UTF-16LE, not UTF-8 (but you can do "ccs=UTF-8"). .NET also has this weird naming quirk where Encoding.Unicode is UTF-16, although there at least UTF-8 is the default for all text I/O classes like StreamReader if you don't specify the encoding. Still, many people didn't know better, and so some early .NET software would use UTF-16 for text I/O for no reason other than its developers believing that Encoding.Unicode is obviously what they are supposed to be using to "support Unicode", and so explicitly passing it everywhere.
It’s worse than that, actually: UTF-8 is a specific encoding for sequences of Unicode scalar values (which means: code points minus the surrogate range U+D800–U+DFFF). Since str is a sequence of Unicode code points, this means you can make strings that cannot be encoded in any standard encoding:
>>> '\udead'.encode('utf-16')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-16' codec can't encode character '\udead' in position 0: surrogates not allowed
>>> '\ud83d\ude41'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Python 3’s strings are a tragedy. They seized defeat from the jaws of victory.Recently, i've gotten bit by UTF16 (because somewhere along the line somewhere on a windows machine generated a file by piping it in powershell)
Fortunately, wide and narrow builds were abandonned in Python 3.2, with a new way of representing strings : current Python will use ASCII if there's no non-ASCII char, UCS-2 –UTF-16 without surrogate pairs — if there is no codepoint higher than U+FFFF, and UTF-32 else.
See this article for a good overview of the history of strings in Python : https://tenthousandmeters.com/blog/python-behind-the-scenes-...
There are some coding standards for C# that mandate passing in the maximum number of parameters to a function, and never allow you to use the default parameter to be used. Sometimes this is a big win (prevents all that Current Culture nonsense when converting between numbers and strings, you need Invariant Culture almost all the time), and other times introduces bugs (Using the wrong value when creating Message Boxes to put them on the logon desktop instead of the user's screen).
Enforcing an overload of the highest arity of arguments sounds like a really terrible rule to have.
Culture-sensitivity is strictly different to locale as it does not act like a C locale (unsound) but simply follows delimiters/dates/currency/etc. format for parsing and formatting.
It is also in many places considered to be undesirable as it introduces environment-dependent behavior where it is not expected hence the analyzer will either suggest you to specify invariant culture or alternatively you can specify that in the project through InvariantGlobalization prop (to avoid CultureInfo.InvariantCulture spam). This is still orthogonal to text encoding however.
This is an example of poor quality software:
> All powershell files (ps1, psm1, psd1) are assumed to be in the local charset unless they have a byte order mark, in which case they're treated as whatever the BOM says.
Powershell is not that old. Assuming local encoding is inexcusable here.
I don't think the walled garden makes much of a difference when it comes to compatibility on, say, macOS. They still have to carefully weigh the ecosystem-wide cost of deprecating old APIs against the ecosystem-wide long-term benefits. Yes the decision remains entirely their own, but a lot of stakeholders indirectly weigh on the decision.
GTK and Qt also make backwards-incompatible new versions as they evolve. The biggest difference here is that in theory someone could keep maintaining the old library code if they decided that updating their application code was always going to be harder. How rarely this actually happens gives weight to the argument that developers can accept occasional API overhauls in exchange for staying on the well-supported low-tech-debt path.
So walled or open made no difference here, even in the open platform, application developers are largely at the mercy of where development effort on libraries and frameworks is going. Nobody can afford to make their own exclusive frameworks to an acceptable standard, and if we want to get away from the technical debt of the 90s then the shared frameworks have to make breaking changes occasionally and strategically.
> AFAIK, Apple doesn't care about the possibilty to run binaries from the '90s on a modern stack.
Definitely, and I don't either. It's kind of a silver lining that Apple wasn't the enterprise heavy-hitter that Microsoft was at the time, because if it had been, its entire culture and landscape would be shaped by it like Microsoft's was. I think we have enough of that in the industry already.
When an old platform is that old, it's really hard to justify making it a seamless subset of the modern platform, and it makes more sense to talk about some form of virtualization. This is where even Windows falls down on both counts. How well modern Windows runs old software is far more variable than people assume until they try it. Anything with 8-bit colors may not work at all.
VirtualBox, qemu, etc. have increasingly poor support for DOS-based Windows (95, 98, ME) because not enough people care about that even in the context of virtualization. After trying every free virtualization option to run some 90s Windows software, I ended up finding that WINE was more compatible with that era than modern Windows is, without any of the jank of running a real Windows in qemu or VirtualBox.
So even with the OS most famous for backwards-compatibility and the enormous technical debt that carries, compatibility has been slowly sliding, even worse than open source projects with no direct lineage to the same platform and no commercial motives.
It's perfectly justifiable to reset technical debt here, whether walled or open. If people have enough need to run old software, there should be a market of solutions to that problem, yet it generally remains niche or hobbyist, and even the big commercial vendors overestimate how well they're doing it.
I said that if a BOM is present, then you explicitly know that multi-byte characters are possibly present. Therefore, if it's present you know that assuming that the Nth byte is the Nth code point is unsafe.
The opposite is irrelevant. There's never any way to safely determine a text file's encoding if there is no BOM present.