←back to thread

180 points xnacly | 1 comments | | HN request time: 0.207s | source
Show context
thechao ◴[] No.44561283[source]
I like to have my lexers operate on `FILE*`, rather than string-views. This has some real-world performance implications (not good ones); but, it does mean I can operate on streams. If the user has a c-string, the string can be easily wrapped by `funopen()` or `fopencookie()` to provide a `FILE*` adapter layer. (Most of my lexers include one of these, out-of-the-box.)

Everything else, I stole from Bob Nystrom: I keep a local copy of the token's string in the token, aka, `char word[64]`. I try to minimize "decision making" during lexing. Really, at the consumption point we're only interested in an extremely small number of things: (1) does the lexeme start with a letter or a number?; (2) is it whitespace, and is that whitespace a new line?; or, (3) does it look like an operator?

The only place where I've ever considered goto-threading was in keyword identification. However, if your language keeps keywords to ≤ 8 bytes, you can just bake the keywords into `uint64_t`'s and compare against those values. You can do a crapload of 64b compares/ns.

The next level up (parsing) is slow enough to eat & memoize the decision making of the lexer; and, materially, it doesn't complicate the parser. (In fact: there's a lot of decision making that happens in the parser that'd have to be replicated in the lexer, otherwise.)

The result, overall, is you can have a pretty general-purpose lexer that you can reuse for a any old C-ish language, and tune to your heart's content, without needing a custom rewrite, each time.

replies(4): >>44562865 #>>44562886 #>>44565776 #>>44568007 #
teo_zero ◴[] No.44568007[source]
> I like to have my lexers operate on `FILE*`, rather than string-views. [...] it does mean I can operate on streams.

While I understand the desire to support one input interface for composability, reuse, etc. I can't help wondering why 'FILE*'. Isn't reading from a string more "universal"?

> If the user has a c-string, the string can be easily wrapped by `funopen()` or `fopencookie()` to provide a `FILE*` adapter layer.

And if the user has a file, it's easy to read it into memory in advance.

What's the benefit of FILE* over a string?

replies(1): >>44570344 #
trealira ◴[] No.44570344[source]
Perhaps it's that you never have to read the whole file into memory at once if it's with a `FILE *` rather than a string. I'm not that person, this is just my assumption.
replies(2): >>44570917 #>>44572431 #
1. cyber_kinetist ◴[] No.44572431[source]
You can just use virtual memory (mmap / VirtualAlloc) to map an address region with a file and get the same effect while just using char* pointers.