Strategies for Fast Lexers

1. norir ◴[14 Jul 25 16:43 UTC] No.44562212[source]▶

Lexing being the major performance bottleneck in a compiler is a great problem to have.

replies(3): >>44563135 #>>44568294 #>>44568430 #

2. norskeld ◴[14 Jul 25 17:53 UTC] No.44563135[source]▶

Is lexing ever a bottleneck though? Even if you push for lexing and parsing 10M lines/second [1], I'd argue that semantic analysis and codegen (for AOT-compiled languages) will dominate the timings.

That said, there's no reason not to squeeze every bit of performance out of it!

[1]: In this talk about the Carbon language, Chandler Carruth shows and explains some goals/challenges regarding performance: https://youtu.be/ZI198eFghJk?t=1462

replies(3): >>44563278 #>>44564311 #>>44568469 #

3. munificent ◴[14 Jul 25 18:04 UTC] No.44563278[source]▶

>>44563135 #

It depends a lot on the language.

For a statically typed language, it's very unlikely that the lexer shows up as a bottleneck. Compilation time will likely be dominated by semantic analysis, type checking, and code generation.

For a dynamically typed language where there isn't as much for the compiler to do, then the lexer might be a more noticeable chunk of compile times. As one of the V8 folks pointed out to me years ago, the lexer is the only part of the compiler that has to operate on every single individual byte of input. Everything else gets the luxury of greater granularity, so the lexer can be worth optimizing.

replies(1): >>44580487 #

4. SnowflakeOnIce ◴[14 Jul 25 19:30 UTC] No.44564311[source]▶

>>44563135 #

A simple hello world in C++ can pull in dozens of megabytes of header files.

Years back I worked at a C++ shop with a big codebase (hundreds of millions of LOC when you included vendored dependencies). Compile times there were sometimes dominated by parsing speed! Now, I don't remember the exact breakdown of lexing vs parsing, but I did look at it under a profiler.

It's very easy in C++ projects to structure your code such that you inadvertently cause hundreds of megabytes of sources to be parsed by each single #include. In such a case, lexing and parsing costs can dominate build times. Precompiled headers help, but not enough...

replies(1): >>44564478 #

5. adev_ ◴[14 Jul 25 19:46 UTC] No.44564478{3}[source]▶

>>44564311 #

> Now, I don't remember the exact breakdown of lexing vs parsing, but I did look at it under a profiler.

Lexing, parsing and even type checking are interleaved in most C++ compilers due to the ambiguous nature of many construct in the language.

It is very hard to profile only one of these in isolation. And even with compiler built-in instrumentation, the results are not very representative of the work done behind.

C++ compilers are amazing machines. They are blazing fast at parsing a language which is a nightmare of ambiguities. And they are like that mainly because how stupidly verbose and inefficient the C++ include system is.

replies(1): >>44567241 #

6. SnowflakeOnIce ◴[15 Jul 25 02:15 UTC] No.44567241{4}[source]▶

>>44564478 #

> Lexing, parsing and even type checking are interleaved in most C++ compilers due to the ambiguous nature of many construct in the language. > > It is very hard to profile only one of these in isolation. And even with compiler built-in instrumentation, the results are not very representative of the work done behind.

Indeed, precise cost attribution is difficult or impossible due to how the nature of the language imposes structure on industrial computers. But that aside, you still end up easily with hundreds of megabytes of source to deal with in each translation unit. I have so many scars from dealing with that...

replies(1): >>44571239 #

7. Phil_Latio ◴[15 Jul 25 06:07 UTC] No.44568294[source]▶

>>44562212 (TP) #

You have a point, but getting the easy part (the lexer/token) "right" is something to strive for, because it will also influence the memory and performance characteristics of later stages in the pipeline. So thinking (and writing) about this stuff makes sense.

Example: In the blog post a single token uses 32 bytes + 8 bytes for the pointer indirection in AST node. That's a lot of memory, cache line misses and indirections.

8. burnt-resistor ◴[15 Jul 25 06:32 UTC] No.44568430[source]▶

>>44562212 (TP) #

It might be worth throwing away the lexer entirely and using something like parsing with derivatives. Ultimately though, I suspect hand-rolled SIMD parsing is how to make a compiler even faster than either go or v. Maintenance would suck because each implementation would be completely architecture- and platform-specific.

PS: We really need GPSIMD compiler phases for GCC and Clang.

9. burnt-resistor ◴[15 Jul 25 06:37 UTC] No.44568469[source]▶

>>44563135 #

Extending beyond the capabilities of PCHs, but there used to be incremental (C?) compilers/IDEs (maybe there still are?) that cached ASTs and were smart enough to invalidate and reparse just those portions of the local AST that changed based on editor updates. This was back when storage was very, very slow.

10. viega ◴[15 Jul 25 13:58 UTC] No.44571239{5}[source]▶

>>44567241 #

The computational complexity for lexing C++ is linear, but for parsing C++ it's super-linear, as will be many analyses. In practice, the lexing is in the noise for almost all compliers.

replies(1): >>44579186 #

11. adev_ ◴[16 Jul 25 06:04 UTC] No.44579186{6}[source]▶

>>44571239 #

Even the lexing is not strictly linear.

Good luck to naively differentiate if '>' is effectively a chevron operator or a shift operator '>>' in a random nested template expression like : 'set<vector<int>>'.

12. norskeld ◴[16 Jul 25 09:58 UTC] No.44580487{3}[source]▶

>>44563278 #

Ah, yes, that's totally fair. In case of JS (in browsers) it's sort of a big deal, I suppose, even if scripts being loaded are not render-blocking: the faster you lex and parse source files, the faster page becomes interactive.

P.S. I absolutely loved "Crafting Interpreters" — thank you so much for writing it!