Strategies for Fast Lexers

(xnacly.me)

180 points xnacly | 1 comments | 14 Jul 25 14:42 UTC | HN request time: 0s | source

Show context

thechao ◴[14 Jul 25 15:24 UTC] No.44561283[source]▶

I like to have my lexers operate on `FILE*`, rather than string-views. This has some real-world performance implications (not good ones); but, it does mean I can operate on streams. If the user has a c-string, the string can be easily wrapped by `funopen()` or `fopencookie()` to provide a `FILE*` adapter layer. (Most of my lexers include one of these, out-of-the-box.)

Everything else, I stole from Bob Nystrom: I keep a local copy of the token's string in the token, aka, `char word[64]`. I try to minimize "decision making" during lexing. Really, at the consumption point we're only interested in an extremely small number of things: (1) does the lexeme start with a letter or a number?; (2) is it whitespace, and is that whitespace a new line?; or, (3) does it look like an operator?

The only place where I've ever considered goto-threading was in keyword identification. However, if your language keeps keywords to ≤ 8 bytes, you can just bake the keywords into `uint64_t`'s and compare against those values. You can do a crapload of 64b compares/ns.

The next level up (parsing) is slow enough to eat & memoize the decision making of the lexer; and, materially, it doesn't complicate the parser. (In fact: there's a lot of decision making that happens in the parser that'd have to be replicated in the lexer, otherwise.)

The result, overall, is you can have a pretty general-purpose lexer that you can reuse for a any old C-ish language, and tune to your heart's content, without needing a custom rewrite, each time.

replies(4): >>44562865 #>>44562886 #>>44565776 #>>44568007 #

teo_zero ◴[15 Jul 25 05:12 UTC] No.44568007[source]▶

>>44561283 #

> I like to have my lexers operate on `FILE*`, rather than string-views. [...] it does mean I can operate on streams.

While I understand the desire to support one input interface for composability, reuse, etc. I can't help wondering why 'FILE*'. Isn't reading from a string more "universal"?

> If the user has a c-string, the string can be easily wrapped by `funopen()` or `fopencookie()` to provide a `FILE*` adapter layer.

And if the user has a file, it's easy to read it into memory in advance.

What's the benefit of FILE* over a string?

replies(1): >>44570344 #

trealira ◴[15 Jul 25 12:15 UTC] No.44570344[source]▶

>>44568007 #

Perhaps it's that you never have to read the whole file into memory at once if it's with a `FILE *` rather than a string. I'm not that person, this is just my assumption.

replies(2): >>44570917 #>>44572431 #

tlb ◴[15 Jul 25 13:27 UTC] No.44570917{3}[source]▶

>>44570344 #

There was a time when a file of source code might not fit in memory, or would take up a significant fraction of it. But it hasn't been the case on any developer machine in 20+ years. And the overhead of FILE * accessors like fgetc is substantial. Strings in memory are always going to be faster.

replies(1): >>44571137 #

1. viega ◴[15 Jul 25 13:47 UTC] No.44571137{4}[source]▶

>>44570917 #

Well, the overhead of the stream API is in the noise. If the lexer / parser do not support incremental parsing, it doesn't really matter. But incremental parsing can be important in some situations. For instance, if you're parsing a 1GB json blob keeping the whole thing in memory at once can easily be an issue. Plus, if you stall waiting for the entire input string, you end up adding to latency, if that matters.

↑