←back to thread

180 points xnacly | 1 comments | | HN request time: 0s | source
Show context
o11c ◴[] No.44562355[source]
Unfortunately, operating a byte at a time means there's a hard limit on performance.

A truly performant lexer needs to jump ahead as far as possible. This likely involves SIMD (or SWAR) since unfortunately the C library fails to provide most of the important interfaces.

As an example that the C library can handle tolerably, while lexing a string, you should repeatedly call `strcspn(input, "\"\\\n")` to skip over chunks of ordinary characters, then only special-case the quote, backslash, newline and (implicit!) NUL after each jump. Be sure to correctly distinguish between an embedded NUL and the one you probably append to represent EOF (or, if streaming [which requires quite a bit more logic], end of current chunk).

Unfortunately, there's a decent chance your implementation of `strcspn` doesn't optimize for the possibility of small sets, and instead constructs a full 256-bit bitset. And even if it does, this strategy won't work for larger sets such as "all characters in an identifier" (you'd actually use `strspn` since this is positive), for which you'll want to take advantage of the characters being adjacent.

Edit: yikes, is this using a hash without checking for collisions?!?

replies(3): >>44562796 #>>44562963 #>>44563056 #
1. dist1ll ◴[] No.44562796[source]
You can get pretty far with a branch per byte, as long as the bulk of the work is done w/ SIMD (like character classification). But yeah, LUT lookup per byte is not recommended.