←back to thread

Parser Combinators Beat Regexes

(entropicthoughts.com)
120 points mooreds | 2 comments | | HN request time: 0.475s | source
Show context
yen223 ◴[] No.43639551[source]
Parser combinators are one of those ideas that is really powerful in practice, but will never be mainstream because it had the bad fortune of coming from the functional programming world. It's a shame too, because parser combinators are what made parsers click for me.

Take the hard problem of parsing a language, break it down into the much simpler problem of reading tokens, then write simple "combinators" (and, or, one_or_many, etc etc) to combine these simple parsers into larger more complex parsers.

You could almost build an entire parser combinator library from scratch from memory, because none of the individual components are complicated.

replies(13): >>43639775 #>>43639805 #>>43639834 #>>43640597 #>>43641009 #>>43641205 #>>43641459 #>>43641675 #>>43642100 #>>43642148 #>>43643853 #>>43644151 #>>43650405 #
mrkeen ◴[] No.43641675[source]
> You could almost build an entire parser combinator library from scratch from memory, because none of the individual components are complicated.

I do this every now and then. Usually in the days leading up to advent of code.

"A parser is a function from an input to either failure or a (remaining input, result) pair" is all you need to remember to build the rest.

replies(1): >>43642360 #
1. idahoduncan ◴[] No.43642360[source]
I do this from time to time as well, although I tend to get hung up on error handling. I would say that it's a simple enough exercise if you don't care too much about reporting parse errors in a meaningful way, although I'd be happy for someone to point out an implementation of error handling that fits into one's head.
replies(1): >>43647129 #
2. debugnik ◴[] No.43647129[source]
Here's the most basic error handling to implement, which I learned from studying FParsec:

1. Keep a collection of errors in the stream. It represents errors at the current position.

2. Clear all errors when input is consumed.

3. When a token-like parser fails without consuming input, push an "expected '<literal>'" error.

4. When a labeled parser fails without consuming input, restore the error set as it was before running it, and push an "expected <label>" error instead.

This will naturally accumulate errors from choice and repetition combinators, as long as you don't have needless backtracking before the choice points.

So this already gives you reasonable errors like "foo:17:4: expected one of expression, assignment or 'end'". And it's easy to extend to nested errors, backtracking information or recoverable nodes.