Most active commenters
  • neonsunset(3)

←back to thread

238 points GalaxySnail | 16 comments | | HN request time: 0.252s | source | bottom
Show context
a-french-anon ◴[] No.40170353[source]
Why not utf-8-sig, though? It handles optional BOMs. Had to fix a script last week that choked on it.
replies(3): >>40170707 #>>40170832 #>>40171048 #
1. shellac ◴[] No.40171048[source]
At this point nothing ought to be inserting BOMs in utf-8. It's not recommended, and I think choking on it is reasonable behaviour these days.
replies(3): >>40171192 #>>40173969 #>>40178398 #
2. Athas ◴[] No.40171192[source]
Why were BOMs ever allowed for UTF-8?
replies(5): >>40171419 #>>40171452 #>>40172241 #>>40175549 #>>40177110 #
3. plorkyeran ◴[] No.40171419[source]
When UTF-8 was still very much not the default encoding for text files it was useful to have a way to signal that a file was UTF-8 and not the local system encoding.
4. josefx ◴[] No.40171452[source]
Some editors used them to help detect UTF-8 encoded files. Since they are also valid zero length space characters they also served as a nice easter egg for people who ended up editing their linux shell scripts with a windows text editor.
5. da_chicken ◴[] No.40172241[source]
Some algorithms can operate much easier if they can assume that multibyte or variable byte characters don't exist. The BOM means that you don't have to scan the entire document to know if you can do that.
replies(1): >>40196745 #
6. Dwedit ◴[] No.40173969[source]
Basically every C# program will insert BOMs into text files by default unless you opt-out.
replies(1): >>40175867 #
7. ◴[] No.40175549[source]
8. neonsunset ◴[] No.40175867[source]
Where did you get that from?
replies(1): >>40176092 #
9. Arnavion ◴[] No.40176092{3}[source]
It's the behavior when using the default `Encoding.UTF8` static. You have to create your own instance as `new UTF8Encoding(false)` if you don't want a BOM.
replies(1): >>40176601 #
10. neonsunset ◴[] No.40176601{4}[source]
This is true for `UTF8Encoding` used as an encoder (e.g. within transcoding stream, not often used today).

Other APIs, however, like File.WriteAllText, do not write BOM unless you explicitly pass encoding that does so (by returning non-empty preamble).

replies(1): >>40183062 #
11. stubish ◴[] No.40177110[source]
An attempt to store the encoding needed to decode the data with the data, rather than requiring the reader to know it somehow. Your program wouldn't have to care if its source data had been encoded as UTF-8, UTF-16, UTF-32 or some future standard. The usual sort of compromise that comes out of committees, in this case where every committee member wanted to be able to spit their preferred in-memory Unicode string representation to disk with no encoding overhead.
12. BoingBoomTschak ◴[] No.40178398[source]
Only reason I used it was to force MSVC to understand my u8"" literals. Should've forced /utf8 in our build system, in retrospective.

For UTF-16/32, knowing the endianness doesn't seem to be a frivolous functionality. And in fact, having to use heuristics-based detection via uchardet is a big mess, some kind of header should have been standardized since the start.

13. Dwedit ◴[] No.40183062{5}[source]
I actually did not know that File.WriteAllText/new StreamWriter defaulted to UTF-8 without BOM if no encoding was specified. I always passed in an encoding to those functions, and "Encoding.UTF8" has a BOM by default. Without specifying any encoding, I just assumed it would pick your system locale, because all the default String <-> Number conversion functions will indeed do that.

There are some coding standards for C# that mandate passing in the maximum number of parameters to a function, and never allow you to use the default parameter to be used. Sometimes this is a big win (prevents all that Current Culture nonsense when converting between numbers and strings, you need Invariant Culture almost all the time), and other times introduces bugs (Using the wrong value when creating Message Boxes to put them on the logon desktop instead of the user's screen).

replies(1): >>40183352 #
14. neonsunset ◴[] No.40183352{6}[source]
It's a different overload. Encoding is not an optional parameter: https://learn.microsoft.com/en-us/dotnet/api/system.io.file....

Enforcing an overload of the highest arity of arguments sounds like a really terrible rule to have.

Culture-sensitivity is strictly different to locale as it does not act like a C locale (unsound) but simply follows delimiters/dates/currency/etc. format for parsing and formatting.

It is also in many places considered to be undesirable as it introduces environment-dependent behavior where it is not expected hence the analyzer will either suggest you to specify invariant culture or alternatively you can specify that in the project through InvariantGlobalization prop (to avoid CultureInfo.InvariantCulture spam). This is still orthogonal to text encoding however.

15. account42 ◴[] No.40196745{3}[source]
No, a missing BOM does not guarantee that the file doesn't contain multi-byte code points and never has.
replies(1): >>40201400 #
16. da_chicken ◴[] No.40201400{4}[source]
That isn't what I said.

I said that if a BOM is present, then you explicitly know that multi-byte characters are possibly present. Therefore, if it's present you know that assuming that the Nth byte is the Nth code point is unsafe.

The opposite is irrelevant. There's never any way to safely determine a text file's encoding if there is no BOM present.