←back to thread

180 points xnacly | 1 comments | | HN request time: 0.21s | source
Show context
o11c ◴[] No.44562355[source]
Unfortunately, operating a byte at a time means there's a hard limit on performance.

A truly performant lexer needs to jump ahead as far as possible. This likely involves SIMD (or SWAR) since unfortunately the C library fails to provide most of the important interfaces.

As an example that the C library can handle tolerably, while lexing a string, you should repeatedly call `strcspn(input, "\"\\\n")` to skip over chunks of ordinary characters, then only special-case the quote, backslash, newline and (implicit!) NUL after each jump. Be sure to correctly distinguish between an embedded NUL and the one you probably append to represent EOF (or, if streaming [which requires quite a bit more logic], end of current chunk).

Unfortunately, there's a decent chance your implementation of `strcspn` doesn't optimize for the possibility of small sets, and instead constructs a full 256-bit bitset. And even if it does, this strategy won't work for larger sets such as "all characters in an identifier" (you'd actually use `strspn` since this is positive), for which you'll want to take advantage of the characters being adjacent.

Edit: yikes, is this using a hash without checking for collisions?!?

replies(3): >>44562796 #>>44562963 #>>44563056 #
kingstnap ◴[] No.44563056[source]
You can go pretty far processing one byte at a time in hardware. You just keep making the pipeline deeper and pushing the frequency. And then to combat dependent parsing you add speculative execution to avoid bubbles.

Eventually you land on recreating the modern cpu.

replies(1): >>44568411 #
1. burnt-resistor ◴[] No.44568411[source]
No, SIMD-optimized string ops are way, way faster.

https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-...