Strategies for Fast Lexers

(xnacly.me)

180 points xnacly | 3 comments | 14 Jul 25 14:42 UTC | HN request time: 0s | source

Show context

thechao ◴[14 Jul 25 15:24 UTC] No.44561283[source]▶

I like to have my lexers operate on `FILE*`, rather than string-views. This has some real-world performance implications (not good ones); but, it does mean I can operate on streams. If the user has a c-string, the string can be easily wrapped by `funopen()` or `fopencookie()` to provide a `FILE*` adapter layer. (Most of my lexers include one of these, out-of-the-box.)

Everything else, I stole from Bob Nystrom: I keep a local copy of the token's string in the token, aka, `char word[64]`. I try to minimize "decision making" during lexing. Really, at the consumption point we're only interested in an extremely small number of things: (1) does the lexeme start with a letter or a number?; (2) is it whitespace, and is that whitespace a new line?; or, (3) does it look like an operator?

The only place where I've ever considered goto-threading was in keyword identification. However, if your language keeps keywords to ≤ 8 bytes, you can just bake the keywords into `uint64_t`'s and compare against those values. You can do a crapload of 64b compares/ns.

The next level up (parsing) is slow enough to eat & memoize the decision making of the lexer; and, materially, it doesn't complicate the parser. (In fact: there's a lot of decision making that happens in the parser that'd have to be replicated in the lexer, otherwise.)

The result, overall, is you can have a pretty general-purpose lexer that you can reuse for a any old C-ish language, and tune to your heart's content, without needing a custom rewrite, each time.

replies(4): >>44562865 #>>44562886 #>>44565776 #>>44568007 #

tempodox ◴[14 Jul 25 17:29 UTC] No.44562865[source]▶

>>44561283 #

The tragic thing is that you can't do `fgetwc()` on a `FILE *` produced by `fopencookie()` on Linux. glibc will crash your program deliberately as soon as there is a non-ASCII char in that stream (because, reasons?). But it does work with `funopen()` on a BSD, like macOS. I'm using that to read wide characters from UTF-8 streams.

replies(1): >>44562913 #

o11c ◴[14 Jul 25 17:33 UTC] No.44562913[source]▶

>>44562865 #

Wide characters are best avoided even on platforms where it doesn't mean UTF-16. It's better to stay in UTF-8 mode, and only verify that it's well-formed.

replies(1): >>44563070 #

1. tempodox ◴[14 Jul 25 17:46 UTC] No.44563070[source]▶

>>44562913 #

But at some point you'll want to know whether that code point you read `iswalpha()` or whatever, so you'll have to decode UTF-8 anyway.

replies(1): >>44564287 #

2. thechao ◴[14 Jul 25 19:28 UTC] No.44564287[source]▶

>>44563070 (TP) #

At the parser-level, though; not down in the lexer. I intern unique user-defined strings (just with a hashcons or whatever the cool kids call it, these days). That defers the determination of correctness of UTF-kness to "someone else".

replies(1): >>44637982 #

3. tempodox ◴[21 Jul 25 17:35 UTC] No.44637982[source]▶

>>44564287 #

Figuring out whether a character should become part of a number or a name, for instance, is typical lexer stuff though. For that you have to classify it.

↑