Most active commenters
  • yen223(6)
  • wruza(4)
  • fooker(3)

←back to thread

Parser Combinators Beat Regexes

(entropicthoughts.com)
120 points mooreds | 37 comments | | HN request time: 0.33s | source | bottom
1. yen223 ◴[] No.43639551[source]
Parser combinators are one of those ideas that is really powerful in practice, but will never be mainstream because it had the bad fortune of coming from the functional programming world. It's a shame too, because parser combinators are what made parsers click for me.

Take the hard problem of parsing a language, break it down into the much simpler problem of reading tokens, then write simple "combinators" (and, or, one_or_many, etc etc) to combine these simple parsers into larger more complex parsers.

You could almost build an entire parser combinator library from scratch from memory, because none of the individual components are complicated.

replies(13): >>43639775 #>>43639805 #>>43639834 #>>43640597 #>>43641009 #>>43641205 #>>43641459 #>>43641675 #>>43642100 #>>43642148 #>>43643853 #>>43644151 #>>43650405 #
2. thaumasiotes ◴[] No.43639775[source]
> Take the hard problem of parsing a language, break it down into the much simpler problem of reading tokens, then write simple "combinators" (and, or, one_or_many, etc etc) to combine these simple parsers into larger more complex parsers.

That's a terrible example for "parser combinators beat regexes"; those are the three regular operations. (Except that you use zero_or_many in preference to one_or_many.)

replies(1): >>43641710 #
3. egonschiele ◴[] No.43639805[source]
Hey! I love parser combinators and wish they were more mainstream. That's why I wrote Tarsec, which is Parsec for TypeScript: https://github.com/egonSchiele/tarsec

I have a short and a long intro, both illustrated: https://github.com/egonSchiele/tarsec/blob/main/tutorials/5-... and https://github.com/egonSchiele/tarsec/blob/main/tutorials/th...

It also has a neat debug mode, and a way to benchmark performance. I've loved using it, hopefully someone reading will use it too! Use this as a chance to learn something cool (parser combinators)!

replies(2): >>43639911 #>>43640797 #
4. skybrian ◴[] No.43639834[source]
Never is a long time. Other ideas from functional language have become mainstream.
replies(1): >>43639896 #
5. yen223 ◴[] No.43639896[source]
I would be very happy to be proven wrong here.

But I view parser combinators in the same lens as lenses, another powerful approach to solving a specific problem that is unlikely to receive mainstream recognition because of all the Haskell.

replies(3): >>43639982 #>>43640004 #>>43643966 #
6. yen223 ◴[] No.43639911[source]
Love the intro articles you wrote!
7. wk_end ◴[] No.43639982{3}[source]
FWIW lenses are lovely in Typescript with monocle-ts, and extremely useful when - speaking of functional ideas hitting the mainstream! - writing Redux reducers to provide updates for immutable data.
replies(1): >>43640499 #
8. BoiledCabbage ◴[] No.43640004{3}[source]
> But I view parser combinators in the same lens as lenses

Hmm, anything else you have in that list I should be looking into? As I also agree they are two of the most interesting things I got from learning Haskell. (Along with type driven reasoning, monoids and functors).

replies(1): >>43640188 #
9. yen223 ◴[] No.43640188{4}[source]
The big one that springs to mind, though it's not a Haskell idea, are algebraic effects and effects handler
10. ◴[] No.43640499{4}[source]
11. Tabular-Iceberg ◴[] No.43640597[source]
I’ve seen it in practice. I decided to use a Java parser combinator library for one problem. As soon as I was off the project they ripped out my code and replaced it with regexes.

That what was to be parsed wasn’t regular apparently did not matter.

12. furyofantares ◴[] No.43640797[source]
These are great.

I accidentally invented parser combinators for a project once. I was quite fond of the approach but it ended up being pretty nearly unusable for anyone else on the team. This was problematic because the project was such that we were always adding new things to be parsed. The whole reason I took the approach was that I wanted people to be able to freely add parsers as new requirements came in.

It being my first shot at it, it was for sure not as usable as it could be. But looking at your stuff, which is delightful, I am filled with doubt that my goal was even possible. Your stuff looks about as clean as possible, and yet seems likely to be almost as impenetrable to someone not versed in parsing.

That said, I am gonna look for a chance to use this in appropriate side projects.

replies(2): >>43641927 #>>43645093 #
13. worldsayshi ◴[] No.43641009[source]
I remember logstash having a quite intuitive parser language, called grok I think, where recursive decent was built on top of regex. We should use something like that!
14. fooker ◴[] No.43641205[source]
Parser combinators are already mainstream. All/most real languages are implemented with hand written recursive descent parsers that closely resemble parser combinators, or at the very least borrow a bunch of concepts.

It's the name that's unlikely to become mainstream. That's squarely upon FP folks giving sci-fi names to common sense concepts.

replies(1): >>43642845 #
15. meinersbur ◴[] No.43641459[source]
Flang (LLVM Fortran Compiler) uses parser combinators.

https://github.com/llvm/llvm-project/blob/main/flang/lib/Par...

16. mrkeen ◴[] No.43641675[source]
> You could almost build an entire parser combinator library from scratch from memory, because none of the individual components are complicated.

I do this every now and then. Usually in the days leading up to advent of code.

"A parser is a function from an input to either failure or a (remaining input, result) pair" is all you need to remember to build the rest.

replies(1): >>43642360 #
17. mrkeen ◴[] No.43641710[source]
Can I see the (implied) counter-example?

A regex which turns a language's source code into an AST?

replies(1): >>43642623 #
18. miningape ◴[] No.43641927{3}[source]
Yeah parsers can be tricky "culturally" - as you've found there are many devs who haven't written a recursive descent parser and therefore they lack any meaningful context to understand what's happening under the hood.

It's a similar scenario to if you made your own RegEx and none of the devs on your team know's what it is. It just takes too long to teach them that on the job since there's very few quality resources you can point them towards - so you're stuck teaching everything.

On the other hand, if you have a team full of experienced PL developers doing this could significantly improve their productivity.

19. foldr ◴[] No.43642100[source]
As a sibling points out, recursive descent parsers have been mainstream for a long time. Higher order parser combinators are only moderately useful in an imperative language with regular looping constructs. (For example, the code to parse a list of Xs separated by Ys is a pretty straightforward loop.) A lot of parser combinator libraries also make the mistake of encouraging "short circuit" error handling, which is rarely what you want for a production-grade parser.
20. conaclos ◴[] No.43642148[source]
I once tried to use parser combinators and I was quite disappointed by the result. I find it harder to read and to handle error than regular recursive descendant parsers.
replies(1): >>43646849 #
21. idahoduncan ◴[] No.43642360[source]
I do this from time to time as well, although I tend to get hung up on error handling. I would say that it's a simple enough exercise if you don't care too much about reporting parse errors in a meaningful way, although I'd be happy for someone to point out an implementation of error handling that fits into one's head.
replies(1): >>43647129 #
22. thaumasiotes ◴[] No.43642623{3}[source]
What implied counterexample? Regular expressions define the simplest language type a CS education generally covers. You would expect anything to be more capable.

But you wouldn't prove it by demonstrating that you can recognize regular languages. That's the thing regular expressions are good at!

23. antonvs ◴[] No.43642845[source]
Presumably you're not objecting to the word "parser".

What name would you use for "combinator"? It's a pretty simple word, based on the root "combine".

The common sense concept here is functions whose results depends only on their arguments, which makes them well-suited for being used in "combination" with other combinators, as in `f . g` or `f(g(x))`. That's called function composition, which can be defined as the act of combining functions.

It's a term that has existed in logic for over a century, from before computers existed.

It seems to me that no amount of renaming of terms will fix people's lack of curiosity or lack of desire to learn or understand. That's on them.

replies(1): >>43646757 #
24. wruza ◴[] No.43643853[source]
It's never the origin but the ease of install/use/mentor and compatibility across envs.
25. wruza ◴[] No.43643966{3}[source]
It's unlikely because people don't get why jump through so many hoops when you can

  new = copy(old)
  new.foo.bar.baz = new_baz
  return new
It won't come out of Haskell because it's uniquely a solution to a specific Haskell problem. Mutating a fresh copy through a "lens" when you have `=` is a level of geekage that cannot become mainstream.
replies(2): >>43647066 #>>43649028 #
26. bazoom42 ◴[] No.43644151[source]
> will never be mainstream because it had the bad fortune of coming from the functional programming world

Nah, mainstream developer don’t care where an idea comes from as long as it solves the problem and helps them hit the deadline. E.g. linq in C# have been broadly adopted.

Regexes are disliked by many developers, so parser combinators should be an easy sell if you can show it solves realistic problems with simpler code.

27. egonschiele ◴[] No.43645093{3}[source]
The idea of functions returning functions makes a lot of people's heads spin! I've struggled to get people using this library too, but I use it in personal work and it is an absolute joy. One of my favorite things I've made.
replies(1): >>43645573 #
28. TuringTest ◴[] No.43645573{4}[source]
You can typically trick them into thinking that you're returning an object with a single method ;-)
replies(1): >>43660500 #
29. fooker ◴[] No.43646757{3}[source]
Every piece of software in existence calls functions and composes functions, there's no reason to attach 'combinator' to everything.

I do not doubt the correctness of the term, just the point of naming it. The fact that people write parsers all the time using these concepts without using the name proves that the name was not a great one.

> It's a term that has existed in logic for over a century, from before computers existed.

Yes, that's a great reason to not overload mathematical terms for random things. Monads and monoids are a great example of when this goes a bit wrong.

>It seems to me that no amount of renaming of terms will fix people's lack of curiosity or lack of desire to learn or understand.

Yes, I agree. Most people are just writing programs and solving problems, and moving on with their lives. It's weird to insist that they call something by a specific name if it's a simple idea that thousands of people have developed independently.

30. fooker ◴[] No.43646849[source]
The trick is to have your grammar or combinator setup to have error productions for all the common errors you can think of.

That way you get nice, targeted error messages.

31. kqr ◴[] No.43647066{4}[source]
This is a common misconception. Lenses (or more generally optics) are not limited to targeting single fields. They can target composites, alternatives, etc. They go way beyond a normal accessor dot and can do stuff that cannot be done in mainstream languages.

Some of the other ideas from lenses are becoming mainstream, like the .? conditional accessor in e.g. C# and EcmaScript.

replies(1): >>43648783 #
32. debugnik ◴[] No.43647129{3}[source]
Here's the most basic error handling to implement, which I learned from studying FParsec:

1. Keep a collection of errors in the stream. It represents errors at the current position.

2. Clear all errors when input is consumed.

3. When a token-like parser fails without consuming input, push an "expected '<literal>'" error.

4. When a labeled parser fails without consuming input, restore the error set as it was before running it, and push an "expected <label>" error instead.

This will naturally accumulate errors from choice and repetition combinators, as long as you don't have needless backtracking before the choice points.

So this already gives you reasonable errors like "foo:17:4: expected one of expression, assignment or 'end'". And it's easy to extend to nested errors, backtracking information or recoverable nodes.

33. wruza ◴[] No.43648783{5}[source]
They don't surpass Turing Machine level, so I'm not convinced.

The idea of ?. was there since the beginning of times. You don't have to dig deep to realize that (x ? x.y : nil) and (x ? x : y) sucks. Even gnu C had (x ?: y) for quite a while. It's just that the designers of to-become-mainstream languages all start with the same "density" and ignore the damn obvious, and then rather than fixing it, chew on the same dilemmas they all inevitably introduce. I sometimes wonder if they have written regular code at all or started inventing languages right after school.

34. yen223 ◴[] No.43649028{4}[source]
This comment proves my point about how "all the Haskell" hinders mainstream adoption of a thing. It's very easy for folks to miss the point of the thing!

Lenses and more generally optics are a lot more powerful than just dot-accessors. They are closer in spirit to XPath or jq, except that they are general enough to work on most data structures, not just XML or Json.

replies(1): >>43652398 #
35. zahlman ◴[] No.43650405[source]
> break it down into the much simpler problem of reading tokens, then write simple "combinators" (and, or, one_or_many, etc etc) to combine these simple parsers into larger more complex parsers.

The popular `pyparsing` library for Python (https://pypi.org/project/pyparsing/) expresses these concepts, just written more in traditional OO style rather than functional style (i.e. there are classes like OneOrMany that you instantiate, rather than binding a single-token-matcher argument to lambda or whatever).

36. wruza ◴[] No.43652398{5}[source]
It doesn't. "All the haskell" is your protective decoy against "it's not that useful when you're not chained to the wall". Show me a lens with a bunch of other fp tricks and I'll write dumb inline code for it that does that same thing, but without twisting anyones brain or trying to not miss its thin point.
37. yen223 ◴[] No.43660500{5}[source]
Getting Java flashbacks from this