Strategies for Fast Lexers

1. thechao ◴[14 Jul 25 15:24 UTC] No.44561283[source]▶

I like to have my lexers operate on `FILE*`, rather than string-views. This has some real-world performance implications (not good ones); but, it does mean I can operate on streams. If the user has a c-string, the string can be easily wrapped by `funopen()` or `fopencookie()` to provide a `FILE*` adapter layer. (Most of my lexers include one of these, out-of-the-box.)

Everything else, I stole from Bob Nystrom: I keep a local copy of the token's string in the token, aka, `char word[64]`. I try to minimize "decision making" during lexing. Really, at the consumption point we're only interested in an extremely small number of things: (1) does the lexeme start with a letter or a number?; (2) is it whitespace, and is that whitespace a new line?; or, (3) does it look like an operator?

The only place where I've ever considered goto-threading was in keyword identification. However, if your language keeps keywords to ≤ 8 bytes, you can just bake the keywords into `uint64_t`'s and compare against those values. You can do a crapload of 64b compares/ns.

The next level up (parsing) is slow enough to eat & memoize the decision making of the lexer; and, materially, it doesn't complicate the parser. (In fact: there's a lot of decision making that happens in the parser that'd have to be replicated in the lexer, otherwise.)

The result, overall, is you can have a pretty general-purpose lexer that you can reuse for a any old C-ish language, and tune to your heart's content, without needing a custom rewrite, each time.

replies(4): >>44562865 #>>44562886 #>>44565776 #>>44568007 #

2. tempodox ◴[14 Jul 25 17:29 UTC] No.44562865[source]▶

>>44561283 (TP) #

The tragic thing is that you can't do `fgetwc()` on a `FILE *` produced by `fopencookie()` on Linux. glibc will crash your program deliberately as soon as there is a non-ASCII char in that stream (because, reasons?). But it does work with `funopen()` on a BSD, like macOS. I'm using that to read wide characters from UTF-8 streams.

replies(1): >>44562913 #

3. o11c ◴[14 Jul 25 17:31 UTC] No.44562886[source]▶

>>44561283 (TP) #

Have you considered making your lexer operate in push mode instead?

This does mean you have to worry about partial tokens ... but if you limit yourself to feeding full lines that mostly goes away.

Besides, for reasonable-size workloads, "read the whole file ahead of time" is usually a win. The only time it's tempting not to do so is for REPLs.

replies(1): >>44564301 #

4. o11c ◴[14 Jul 25 17:33 UTC] No.44562913[source]▶

>>44562865 #

Wide characters are best avoided even on platforms where it doesn't mean UTF-16. It's better to stay in UTF-8 mode, and only verify that it's well-formed.

replies(1): >>44563070 #

5. tempodox ◴[14 Jul 25 17:46 UTC] No.44563070{3}[source]▶

>>44562913 #

But at some point you'll want to know whether that code point you read `iswalpha()` or whatever, so you'll have to decode UTF-8 anyway.

replies(1): >>44564287 #

6. thechao ◴[14 Jul 25 19:28 UTC] No.44564287{4}[source]▶

>>44563070 #

At the parser-level, though; not down in the lexer. I intern unique user-defined strings (just with a hashcons or whatever the cool kids call it, these days). That defers the determination of correctness of UTF-kness to "someone else".

replies(1): >>44637982 #

7. thechao ◴[14 Jul 25 19:29 UTC] No.44564301[source]▶

>>44562886 #

I agree. But, I also like the discipline of lexing from `FILE*`. I've ended up with cleaner separation of concerns throughout the front-end stack, because I can't dip back into the well, unless I'm thinking very clearly about that operation. For instance, I keep around coordinates of things, rather than pointers, etc.

8. codr7 ◴[14 Jul 25 21:56 UTC] No.44565776[source]▶

>>44561283 (TP) #

I'd do this in almost any other language than C :)

In C, I like just passing a const char * around as input; this also gives me ability to return progress and unget chars as an added bonus.

https://github.com/codr7/shi-c/blob/b1d5cb718b7eb166a0a93c77...

9. teo_zero ◴[15 Jul 25 05:12 UTC] No.44568007[source]▶

>>44561283 (TP) #

> I like to have my lexers operate on `FILE*`, rather than string-views. [...] it does mean I can operate on streams.

While I understand the desire to support one input interface for composability, reuse, etc. I can't help wondering why 'FILE*'. Isn't reading from a string more "universal"?

> If the user has a c-string, the string can be easily wrapped by `funopen()` or `fopencookie()` to provide a `FILE*` adapter layer.

And if the user has a file, it's easy to read it into memory in advance.

What's the benefit of FILE* over a string?

replies(1): >>44570344 #

10. trealira ◴[15 Jul 25 12:15 UTC] No.44570344[source]▶

>>44568007 #

Perhaps it's that you never have to read the whole file into memory at once if it's with a `FILE *` rather than a string. I'm not that person, this is just my assumption.

replies(2): >>44570917 #>>44572431 #

11. tlb ◴[15 Jul 25 13:27 UTC] No.44570917{3}[source]▶

>>44570344 #

There was a time when a file of source code might not fit in memory, or would take up a significant fraction of it. But it hasn't been the case on any developer machine in 20+ years. And the overhead of FILE * accessors like fgetc is substantial. Strings in memory are always going to be faster.

replies(1): >>44571137 #

12. viega ◴[15 Jul 25 13:47 UTC] No.44571137{4}[source]▶

>>44570917 #

Well, the overhead of the stream API is in the noise. If the lexer / parser do not support incremental parsing, it doesn't really matter. But incremental parsing can be important in some situations. For instance, if you're parsing a 1GB json blob keeping the whole thing in memory at once can easily be an issue. Plus, if you stall waiting for the entire input string, you end up adding to latency, if that matters.

13. cyber_kinetist ◴[15 Jul 25 15:52 UTC] No.44572431{3}[source]▶

>>44570344 #

You can just use virtual memory (mmap / VirtualAlloc) to map an address region with a file and get the same effect while just using char* pointers.

14. tempodox ◴[21 Jul 25 17:35 UTC] No.44637982{5}[source]▶

>>44564287 #

Figuring out whether a character should become part of a number or a name, for instance, is typical lexer stuff though. For that you have to classify it.