←back to thread

348 points dgl | 4 comments | | HN request time: 1.006s | source
Show context
10000truths ◴[] No.44502931[source]
This is a big problem with using ad-hoc DSLs for config - there's often no formal specification for the grammar, and so the source of truth for parsing is spread between the home-grown serialization implementation and the home-grown deserialization implementation. If they get out of sync (e.g. someone adds new grammar to the parser but forgets to update the writer), you end up with a parser differential, and tick goes the time bomb. The lesson: have one source of truth, and generate everything that relies on it from that.
replies(3): >>44503902 #>>44504346 #>>44507893 #
ajross ◴[] No.44503902[source]
Nitpick: the DSL here ("ini file format") is arguably ad-hoc, but it's extremely common and well-understood, and simple enough to make a common law specification work well enough in practice. The bug here wasn't due to the format. What you're actually complaining about is the hand-coded parser[1] sitting in the middle of a C program like a bomb waiting to go off. And, yes, that nonsense should have died decades ago.

There are places for clever hand code, even in C, even in the modern world. Data interchange is very not much not one of them. Just don't do this. If you want .ini, use toml. Use JSON if you don't. Even YAML is OK. Those with a penchant for pain like XML. And if you have convinced yourself your format must be binary (you're wrong, it doesn't), protobufs are there for you.

But absolutely, positively, never write a parser unless your job title is "programming language author". Use a library for this, even if you don't use libraries for anything else.

[1] Fine fine, lexer. We are nitpicking, after all.

replies(4): >>44504124 #>>44505455 #>>44506841 #>>44508225 #
heisenbit ◴[] No.44504124[source]
How many hand crafted lexers dealing with lf vs. cr-lf encodings do exist? My guess is n > ( number of people who coded > 10 KSLOC ).
replies(1): >>44506869 #
hnlmorg ◴[] No.44506869[source]
I’ve written a fair few lexers in my time. My general approach for CR is to simply ignore the character entirely.

If CR is used correctly in windows, then its behaviour is already covered by the LF case (as required for POSIX systems) and if CR is used incorrectly then you end up with all kinds of weird edge cases. So you’re much better off just jumping over that character entirely.

replies(3): >>44507067 #>>44507932 #>>44508402 #
layer8 ◴[] No.44508402[source]
More generally, any textual file format where whitespace is significant at the end of a line is calling for trouble.
replies(1): >>44508631 #
1. hnlmorg ◴[] No.44508631[source]
Maybe. But expecting people to remember a ; (or similar) at the end of lines is going to cause more frequent problems from a UX performance.

So you’re better off accepting the edge cases problems that white space introduces considering the benefits outweighs the pain.

replies(1): >>44508764 #
2. layer8 ◴[] No.44508764[source]
That’s not what I meant. It’s okay for the line break itself to be significant. But whitespace immediately preceding the line break shouldn’t be significant, due to its general invisibility.
replies(1): >>44508819 #
3. hnlmorg ◴[] No.44508819[source]
Is CR considered whitespace? I always thought that was classed as a non-printable control character. But maybe I’m wrong?

Or are you talking about SP preceding CR and/or LF?

replies(1): >>44509582 #
4. layer8 ◴[] No.44509582{3}[source]
Line breaks are considered whitespace, hence CR is considered whitespace. It is also a control character. This is similar to TAB, or indeed LF.

See here for example: https://en.cppreference.com/w/c/string/byte/isspace

Or here for Unicode: https://en.wikipedia.org/wiki/Whitespace_character#Unicode