←back to thread

180 points xnacly | 3 comments | | HN request time: 0s | source
Show context
thechao ◴[] No.44561283[source]
I like to have my lexers operate on `FILE*`, rather than string-views. This has some real-world performance implications (not good ones); but, it does mean I can operate on streams. If the user has a c-string, the string can be easily wrapped by `funopen()` or `fopencookie()` to provide a `FILE*` adapter layer. (Most of my lexers include one of these, out-of-the-box.)

Everything else, I stole from Bob Nystrom: I keep a local copy of the token's string in the token, aka, `char word[64]`. I try to minimize "decision making" during lexing. Really, at the consumption point we're only interested in an extremely small number of things: (1) does the lexeme start with a letter or a number?; (2) is it whitespace, and is that whitespace a new line?; or, (3) does it look like an operator?

The only place where I've ever considered goto-threading was in keyword identification. However, if your language keeps keywords to ≤ 8 bytes, you can just bake the keywords into `uint64_t`'s and compare against those values. You can do a crapload of 64b compares/ns.

The next level up (parsing) is slow enough to eat & memoize the decision making of the lexer; and, materially, it doesn't complicate the parser. (In fact: there's a lot of decision making that happens in the parser that'd have to be replicated in the lexer, otherwise.)

The result, overall, is you can have a pretty general-purpose lexer that you can reuse for a any old C-ish language, and tune to your heart's content, without needing a custom rewrite, each time.

replies(4): >>44562865 #>>44562886 #>>44565776 #>>44568007 #
tempodox ◴[] No.44562865[source]
The tragic thing is that you can't do `fgetwc()` on a `FILE *` produced by `fopencookie()` on Linux. glibc will crash your program deliberately as soon as there is a non-ASCII char in that stream (because, reasons?). But it does work with `funopen()` on a BSD, like macOS. I'm using that to read wide characters from UTF-8 streams.
replies(1): >>44562913 #
o11c ◴[] No.44562913[source]
Wide characters are best avoided even on platforms where it doesn't mean UTF-16. It's better to stay in UTF-8 mode, and only verify that it's well-formed.
replies(1): >>44563070 #
1. tempodox ◴[] No.44563070[source]
But at some point you'll want to know whether that code point you read `iswalpha()` or whatever, so you'll have to decode UTF-8 anyway.
replies(1): >>44564287 #
2. thechao ◴[] No.44564287[source]
At the parser-level, though; not down in the lexer. I intern unique user-defined strings (just with a hashcons or whatever the cool kids call it, these days). That defers the determination of correctness of UTF-kness to "someone else".
replies(1): >>44637982 #
3. tempodox ◴[] No.44637982[source]
Figuring out whether a character should become part of a number or a name, for instance, is typical lexer stuff though. For that you have to classify it.