Strategies for Fast Lexers

(xnacly.me)

180 points xnacly | 3 comments | 14 Jul 25 14:42 UTC | HN request time: 0.682s | source

Show context

sparkie ◴[14 Jul 25 16:33 UTC] No.44562104[source]▶

As an alternative to the computed gotos, you can use regular functions with the `[[musttail]]` attribute in Clang or GCC to achieve basically the same thing - the call in the tail position is replaced with a `jmp` instruction to the next function rather than to the label, and stack usage remains constant because the current frame is reutililzed for the called function. `musttail` requires that the calling function and callee have the same signature, and a prototype.

You'd replace the JUMP_TARGET macro:

    #define JUMP_TARGET goto *jump_table[(int32_t)l->input.p[l->pos]]

With:

    #ifdef __clang__
    #define musttail [[clang::musttail]]
    #elif __GNUC__
    #define musttail [[gnu::musttail]]
    #else
    #define musttail
    #endif
    #define JUMP_TARGET return musttail jump_table[(int32_t)l->input.p[l->pos]](l, a, out)

Then move the jump table out to the top level and replace each `&&` with `&`.

See diff (untested): https://www.diffchecker.com/V4yH3EyF/

This approach has the advantage that it will work everywhere and not only on compilers that support the computed gotos - it just won't optimize it on compilers that don't support `musttail`. (Though it has been proposed to standardize it in a future version of C).

It might also work better with code navigation tools that show functions, but not labels, and enables modularity as we can split rules over multiple translation units.

Performance wise should basically be the same - though it's been argued that it may do better in some cases because the compiler's register allocator doesn't do a great job in large functions with computed gotos - whereas in musttail approach each function is a smaller unit and optimized separately.

replies(4): >>44563630 #>>44567877 #>>44569195 #>>44572729 #

1. Sesse__ ◴[15 Jul 25 08:41 UTC] No.44569195[source]▶

>>44562104 #

As an alternative to the computed gotos, you can use switch/case in Clang or GCC to achieve basically the same thing. :-) It becomes a jump table in most cases. (The article claims that a jump table gives smaller code and fewer branch misses, but it doesn't actually give any numbers, and enough things in there are dubious enough that I'm not convinced they ever measured.)

https://blog.nelhage.com/post/cpython-tail-call/ has made the rounds a lot recently, and explores this for Python's bytecode interpreter.

replies(1): >>44573205 #

2. sparkie ◴[15 Jul 25 16:50 UTC] No.44573205[source]▶

>>44569195 (TP) #

The switch misses the point. The compiler isn't smart enough to convert it to direct-threading, to the best of my knowledge.

A switch only selects on one character. To continue lexing you need the switch inside a loop. The compiler might optimize the switch itself to a jump table - but what does each case do - jumps back to the start of the loop, after which it enters the jump table again. There are two branches involved.

The point of direct threading is that there is no loop - you simply jump directly to the handler for the next character at the end of each handler.

replies(1): >>44573284 #

3. Sesse__ ◴[15 Jul 25 16:56 UTC] No.44573284[source]▶

>>44573205 #

> The compiler isn't smart enough to convert it to direct-threading, to the best of my knowledge.

If you read the URL I linked to, you will see that it is.

> The point of direct threading is that there is no loop - you simply jump directly to the handler for the next character at the end of each handler.

No, the point of direct threading is that you give the branch predictor more context to work with (effectively, the previous opcode), which was relevant with the branch predictors in typical CPUs 10+ years ago. (Modern ones, ITTAGE-style, have per-branch history also for indirect jumps.)

↑