Most active commenters

yen223(7)
DadBase(4)
fooker(4)
mrkeen(4)
wyager(4)
wruza(4)
thaumasiotes(3)
BoiledCabbage(3)
(3)
kqr(3)

Popular/hot comments

>>43639551 #
>>43639465 #
>>43639374 #
>>43639845 #
>>43639552 #
>>43639896 #
>>43639965 #

Parser Combinators Beat Regexes

(entropicthoughts.com)

1. henning ◴[09 Apr 25 23:01 UTC] No.43638920[source]▶

>>43638422 (OP) #

You can also use simple string search functions for this Advent of Code problem.

I used this approach for implementing orthography rules in a stenography application (implementing rules like `artistic + ly = artistically`) and in the benchmarks I did, a few lines of searching for character indexes was an order of magnitude faster than regexes. Each of my functions is about 3-10 ish lines of simple code compared to a 1-line regex.

You do have to handle cases like the character not being found, but I've had zero issues with the code and, in terms of what actually executes, it is vastly simpler. This is why the saying goes that if you make something twice as fast, you may have done something clever, but if you make it 10x faster, you stopped doing something dumb.

replies(1): >>43641136 #

2. austin-cheney ◴[10 Apr 25 00:10 UTC] No.43639341[source]▶

>>43638422 (OP) #

There are numerous posts, comments, articles and so forth about when to use regex versus using a parser. The general rule is this:

If you need to do a search from a string, such as needle(s) from a hat stack, regex is probably more ideal than a parser. If you need anything more intelligent than a list of search results you probably want a full formal parser.

Most languages allow a form of nested regex that allow for increased search precision. This occurs when a method that makes use of a regex returns to a function whose argument is a matching string result, which is why regex is probably enough when the business is primitive. There is a tremendously higher cost to using a full parser, considering the lexer and tokenizer plus rules, but it’s so much more intelligent that it’s not even comparable.

replies(2): >>43639533 #>>43641180 #

3. o11c ◴[10 Apr 25 00:17 UTC] No.43639374[source]▶

>>43638422 (OP) #

Note that most implementations of both parser combinators and regexes can fail very badly (exponential time). Never use either on untrusted input, unless you can prove your implementation lets you stay linear.

replies(4): >>43639552 #>>43639845 #>>43640240 #>>43641298 #

4. arn3n ◴[10 Apr 25 00:34 UTC] No.43639465[source]▶

>>43638422 (OP) #

“In other languages, it would be considered overkill to write a full parser when a simple regex can do the same thing. In Haskell, writing a parser is no big deal. We just do it and move on with our lives.”

I see a long code file filled with comments and long paragraph-level explanations. I think I’d rather just learn and use regex.

replies(5): >>43639538 #>>43639912 #>>43639965 #>>43641791 #>>43644069 #

5. giraffe_lady ◴[10 Apr 25 00:46 UTC] No.43639533[source]▶

>>43639341 #

The key thing for me in making this decision is of course predicting the future.

Parsers are more work up front but in my experience much easier to debug, maintain and extend. So if I predict something will be around for a while I'll want to do a parser and if it's just a one-off I usually go with regex.

Of course this predictably leads to me having a collection of parsers that I've never needed to touch again and a handful of monstrously complex regex grammars that I'll rewrite into a parser "the next time" I need to modify them.

I still think it's the right thing to base the decision on I just need to keep improving my prophesying.

6. OneDeuxTriSeiGo ◴[10 Apr 25 00:47 UTC] No.43639538[source]▶

>>43639465 #

I mean that's the nature of article code no? You are writing for a generic audience and want to be very explicit as to what your code is doing to teach the audience.

For your average haskell programmer you could drop all of those comments since the code isn't really doing anything that couldn't be determined by just reading it.

7. yen223 ◴[10 Apr 25 00:49 UTC] No.43639551[source]▶

>>43638422 (OP) #

Parser combinators are one of those ideas that is really powerful in practice, but will never be mainstream because it had the bad fortune of coming from the functional programming world. It's a shame too, because parser combinators are what made parsers click for me.

Take the hard problem of parsing a language, break it down into the much simpler problem of reading tokens, then write simple "combinators" (and, or, one_or_many, etc etc) to combine these simple parsers into larger more complex parsers.

You could almost build an entire parser combinator library from scratch from memory, because none of the individual components are complicated.

replies(13): >>43639775 #>>43639805 #>>43639834 #>>43640597 #>>43641009 #>>43641205 #>>43641459 #>>43641675 #>>43642100 #>>43642148 #>>43643853 #>>43644151 #>>43650405 #

8. giraffe_lady ◴[10 Apr 25 00:49 UTC] No.43639552[source]▶

>>43639374 #

PEGs don't have this problem right? Linear time over length of input? Though I guess most peg libraries probably aren't "pure" pegs to get around its limitations and may reintroduce this issue?

replies(3): >>43639629 #>>43639698 #>>43650782 #

9. yen223 ◴[10 Apr 25 01:02 UTC] No.43639629{3}[source]▶

>>43639552 #

PEGs don't have this problem only because they place restrictions on the grammar.

In practice, this isn't a problem, but it does require you write grammar rules in a specific way (e.g. no left-recursion)

10. o11c ◴[10 Apr 25 01:16 UTC] No.43639698{3}[source]▶

>>43639552 #

No, PEG is prone to exactly this in its default form. Parser combinator usually means something PEG-ish anyway.

If you memoize (packrat), it improves to polynomial time (not sure what, but it's not linear; that's false advertising and a fraudulent claim), but you're still stuck with the vulnerability to bugs in the grammar.

A better idea is to create a parser-combinator-style API on top of an LL or LR backend, for guaranteed linear time and only enough stack space for the deepest part of the AST.

11. thaumasiotes ◴[10 Apr 25 01:30 UTC] No.43639775[source]▶

>>43639551 #

> Take the hard problem of parsing a language, break it down into the much simpler problem of reading tokens, then write simple "combinators" (and, or, one_or_many, etc etc) to combine these simple parsers into larger more complex parsers.

That's a terrible example for "parser combinators beat regexes"; those are the three regular operations. (Except that you use zero_or_many in preference to one_or_many.)

replies(1): >>43641710 #

12. egonschiele ◴[10 Apr 25 01:36 UTC] No.43639805[source]▶

>>43639551 #

Hey! I love parser combinators and wish they were more mainstream. That's why I wrote Tarsec, which is Parsec for TypeScript: https://github.com/egonSchiele/tarsec

I have a short and a long intro, both illustrated: https://github.com/egonSchiele/tarsec/blob/main/tutorials/5-... and https://github.com/egonSchiele/tarsec/blob/main/tutorials/th...

It also has a neat debug mode, and a way to benchmark performance. I've loved using it, hopefully someone reading will use it too! Use this as a chance to learn something cool (parser combinators)!

replies(2): >>43639911 #>>43640797 #

13. skybrian ◴[10 Apr 25 01:42 UTC] No.43639834[source]▶

>>43639551 #

Never is a long time. Other ideas from functional language have become mainstream.

replies(1): >>43639896 #

14. thaumasiotes ◴[10 Apr 25 01:46 UTC] No.43639845[source]▶

>>43639374 #

Only PCREs are exponential time, in service of a feature you basically never need. Regexes are always linear time.

They can take exponential space, though, so I'm not sure why knowing you'll be able to process the data in linear time is supposed to keep you safe.

replies(4): >>43639936 #>>43639940 #>>43640180 #>>43640937 #

15. kazinator ◴[10 Apr 25 01:51 UTC] No.43639873[source]▶

>>43638422 (OP) #

> It uses the pcre-heavy Haskell library which in turn calls out to the system-wide pcre C library for actually compiling and running the regex.

That's not true regular expressions, but a backtracker with eclectic features.

The regex used in the regex solution is just:

  mul\((\d+),(\d+)\)

not requiring PCRE.

If Haskell happens to provide bindings to the POSIX regcomp/regexec stuff, that would be something to try instead. \d has to be replaced with [0-9], needless to say.

16. DadBase ◴[10 Apr 25 01:55 UTC] No.43639894[source]▶

>>43638422 (OP) #

Parser combinators are great until you need to parse something real, like CSV with embedded newlines and Excel quotes. That’s when you reach for the reliable trio: awk, duct tape, and prayer.

replies(2): >>43640949 #>>43640977 #

17. yen223 ◴[10 Apr 25 01:55 UTC] No.43639896{3}[source]▶

>>43639834 #

I would be very happy to be proven wrong here.

But I view parser combinators in the same lens as lenses, another powerful approach to solving a specific problem that is unlikely to receive mainstream recognition because of all the Haskell.

replies(3): >>43639982 #>>43640004 #>>43643966 #

18. yen223 ◴[10 Apr 25 01:58 UTC] No.43639911{3}[source]▶

>>43639805 #

Love the intro articles you wrote!

19. codebje ◴[10 Apr 25 01:58 UTC] No.43639912[source]▶

>>43639465 #

The main advantage of recursive descent parsing is readability. Forget the machinery of the parser, which is (a) trivial enough that AI will generate correctly it for you with next to no prompting, and (b) widely available in libraries anyway. The win is writing a parser that reads like the grammar it implements, which means changes to your grammar are easy to apply to your parser.

Regexes are effectively write-only. If you change the grammar parsed by a regex, you're probably going to discard the regex and make a new one.

20. dullcrisp ◴[10 Apr 25 02:02 UTC] No.43639936{3}[source]▶

>>43639845 #

I don’t believe space > time is possible on a Turing machine?

replies(1): >>43639958 #

21. pjscott ◴[10 Apr 25 02:03 UTC] No.43639940{3}[source]▶

>>43639845 #

This depends on the implementation of the regex engine; most are potentially superlinear in time, since that’s the easiest way of doing it, and quite fast until suddenly it isn’t. I always check the algorithm being used before I use regular expressions in production. I was surprised how many use a recursive descent strategy!

(Also, BTW, you can deal with the exponential space issue by using an NFA instead of a DFA – it’s slower that way, but the memory space required is reliably linearly proportional to the size of the regex.)

22. pjscott ◴[10 Apr 25 02:05 UTC] No.43639958{4}[source]▶

>>43639936 #

You’re right, of course, but there was a minor miscommunication: the exponential space is exponentially proportional to the size of the regular expression, and the linear time is linearly proportional to the length of the string being searched through.

replies(1): >>43639996 #

23. layer8 ◴[10 Apr 25 02:07 UTC] No.43639965[source]▶

>>43639465 #

Whenever I write a regex, I end up with a comments roughly ten times longer than the regex. That being said, regular expressions are often the right tool for the job (i.e. parsing a regular language, as opposed to a context-free language or whatever), just the syntax becomes unreadable rather quickly. I’m sure you could build a nicer regular-expression syntax in Haskell.

replies(3): >>43640776 #>>43640864 #>>43642797 #

24. wk_end ◴[10 Apr 25 02:09 UTC] No.43639982{4}[source]▶

>>43639896 #

FWIW lenses are lovely in Typescript with monocle-ts, and extremely useful when - speaking of functional ideas hitting the mainstream! - writing Redux reducers to provide updates for immutable data.

replies(1): >>43640499 #

25. dullcrisp ◴[10 Apr 25 02:13 UTC] No.43639996{5}[source]▶

>>43639958 #

Thanks! Though I imagine in most cases the regular expression itself would be a fixed part of the codebase and not given as input.

26. BoiledCabbage ◴[10 Apr 25 02:14 UTC] No.43640004{4}[source]▶

>>43639896 #

> But I view parser combinators in the same lens as lenses

Hmm, anything else you have in that list I should be looking into? As I also agree they are two of the most interesting things I got from learning Haskell. (Along with type driven reasoning, monoids and functors).

replies(1): >>43640188 #

27. moregrist ◴[10 Apr 25 02:51 UTC] No.43640180{3}[source]▶

>>43639845 #

It’s not only PCRE. Any dialect with back references will require a backtracking engine that can go exponential with the wrong expression and input.

This includes most flavors of regexp you find in the wild: Python’s re module, JavaScript regular expressions, Ruby’s regular expressions, Perl, PCRE, and even basic and extended REs used in grep.

Russ Cox has written some very accessible posts on the linear (in input) properties of NFAs and the equivalence of Thompson regular expressions [0]. There’s also quite a bit of literature on the Glushkov construction of regular expressions (and its NFA equivalence) [1] that’s worth reading if you find the topic interesting.

Both Go and Rust have non-backtracking regular expression libraries, and you can find solid non-backtracking C and C++ libraries for regular expressions (eg: libfsm and Hyperscan).

0: https://swtch.com/~rsc/regexp/ 1: My favorite is _Flexible Pattern Matching Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences_ by Gonzalo Navarro

28. maxbond ◴[10 Apr 25 02:51 UTC] No.43640182[source]▶

>>43638422 (OP) #

I've been working a lot with parser combinators lately, they are a really nice compromise between using grammar languages (I'd put regexes in this category, as well as parser generators) and hand-rolled imperative parsers. The parser combinator library ends up being a DSL for describing grammars in your preferred language, but the boundary is fluid, and you can move back and forth between writing grammars and writing code in your preferred language.

The bold implementer could integrate regexes into their parser combinator library, and you could move between all three.

29. yen223 ◴[10 Apr 25 02:53 UTC] No.43640188{5}[source]▶

>>43640004 #

The big one that springs to mind, though it's not a Haskell idea, are algebraic effects and effects handler

30. ◴[10 Apr 25 03:07 UTC] No.43640240[source]▶

>>43639374 #

31. ◴[10 Apr 25 03:59 UTC] No.43640499{5}[source]▶

>>43639982 #

32. Tabular-Iceberg ◴[10 Apr 25 04:23 UTC] No.43640597[source]▶

>>43639551 #

I’ve seen it in practice. I decided to use a Java parser combinator library for one problem. As soon as I was off the project they ripped out my code and replaced it with regexes.

That what was to be parsed wasn’t regular apparently did not matter.

33. CBLT ◴[10 Apr 25 05:01 UTC] No.43640776{3}[source]▶

>>43639965 #

I love the verbose flag[0] to regex, so I can write comments inline.

[0] https://docs.python.org/3/library/re.html#re.VERBOSE

34. furyofantares ◴[10 Apr 25 05:06 UTC] No.43640797{3}[source]▶

>>43639805 #

These are great.

I accidentally invented parser combinators for a project once. I was quite fond of the approach but it ended up being pretty nearly unusable for anyone else on the team. This was problematic because the project was such that we were always adding new things to be parsed. The whole reason I took the approach was that I wanted people to be able to freely add parsers as new requirements came in.

It being my first shot at it, it was for sure not as usable as it could be. But looking at your stuff, which is delightful, I am filled with doubt that my goal was even possible. Your stuff looks about as clean as possible, and yet seems likely to be almost as impenetrable to someone not versed in parsing.

That said, I am gonna look for a chance to use this in appropriate side projects.

replies(2): >>43641927 #>>43645093 #

35. f1shy ◴[10 Apr 25 05:16 UTC] No.43640864{3}[source]▶

>>43639965 #

Yes. Regex tend to become rather fast write only. One solution is commenting, but is still complex. What I like to do now (in C) is define parts of it. Just a crude example to get the idea:

   // STRing: matches anything inside quotes (single or double)
   #define STR "[\"'](.*)[\"']"
   // NUMber: matches decimal or hexadecimal numbers
   #define NUM "([[:digit:]]x?[[:xdigit:]]*)"
   
   regcomp(&reg_exp, STR NUM , REG_EXTENDED | REG_ICASE);

So at the end I compose the RE with the various parts, which are documented separately.

36. masklinn ◴[10 Apr 25 05:29 UTC] No.43640937{3}[source]▶

>>43639845 #

> Only PCREs are exponential time, in service of a feature you basically never need. Regexes are always linear time.

Any re dialect which supports backtracking necessarily has a non-linear worst case, and while a select few have very high resilience against exponential backtracking (e.g. never managed to make postgres fall over) most can be made to fail with a pattern a few characters long.

FA-based engines are getting more popular, but they’re far from universal.

37. masklinn ◴[10 Apr 25 05:32 UTC] No.43640949[source]▶

>>43639894 #

Seems like exactly the situation where you’d want parsers, because they can do any form of quoting \ escaping you want and have no reason to care about newlines any more than they do any other character.

replies(1): >>43643957 #

38. iamevn ◴[10 Apr 25 05:38 UTC] No.43640977[source]▶

>>43639894 #

I don't follow why parser combinators would be a bad tool for CSV. It seems like one would specify a CSV parser as (pardon the pseudocode):

  separator = ','
  quote = '"'
  quoted_quote = '""'
  newline = '\n'
  plain_field = sequence(char_except(either(separator, quote, newline)))
  quoted_field = quote + sequence(either(char_except(quote), quoted_quote)) + quote 
  field = either(quoted_field, plain_field)
  row = sequence_with_separator(field, separator)
  csv = sequence_with_separator(row, newline)

Seems fairly natural to me, although I'll readily admit I haven't had to write a CSV parser before so I'm surely glossing over some detail.

replies(2): >>43641113 #>>43643933 #

39. worldsayshi ◴[10 Apr 25 05:43 UTC] No.43641009[source]▶

>>43639551 #

I remember logstash having a quite intuitive parser language, called grok I think, where recursive decent was built on top of regex. We should use something like that!

40. kqr ◴[10 Apr 25 06:04 UTC] No.43641113{3}[source]▶

>>43640977 #

I think GP was sarcastic. We have these great technologies available but people end up using duct tape and hope anyway.

replies(1): >>43643935 #

41. kqr ◴[10 Apr 25 06:08 UTC] No.43641136[source]▶

>>43638920 #

Basic string search and indexing operations (especially when there is language/library support for cheap spans) are definitely underappreciated.

I wouldn't reach for them as a first step though, because they'd take more automated tests to give me confidence of correctness, and most problems I end up solving are not CPU bound but programmer time bound.

42. kleiba ◴[10 Apr 25 06:19 UTC] No.43641180[source]▶

>>43639341 #

Of course you could also pull out the good old Chomsky hierarchy and make an argument against regexes based on whatever the nature of your data is.

But the thing is: the beauty of regexes lies in their compactness (which, in turn, can make them quite difficult to debug). So, of course, if you want to optimize for some other measure, you'd use an alternative approach (e.g. a parser). And there are a number of worthwhile measures, such as e.g. the already mentioned debuggability, appropriateness for the problem at hand in terms of complexity, processing speed, ease of using the match result, the ability to get multiple alternative matches, support of regexes in your language of choice, etc.

But simply stating "X beats regexes" without saying in what respect leaves something to be desired.

43. fooker ◴[10 Apr 25 06:24 UTC] No.43641205[source]▶

>>43639551 #

Parser combinators are already mainstream. All/most real languages are implemented with hand written recursive descent parsers that closely resemble parser combinators, or at the very least borrow a bunch of concepts.

It's the name that's unlikely to become mainstream. That's squarely upon FP folks giving sci-fi names to common sense concepts.

replies(1): >>43642845 #

44. internet_points ◴[10 Apr 25 06:42 UTC] No.43641298[source]▶

>>43639374 #

This is one of the reasons I've been afraid to use parser combinators too heavily. With regular (finite-state) languages I know their time usage, with parser combinators I have no clue, and there are so many different libraries and implementations and some are monadic and some are applicative and few mention worst-case. There are benchmarks https://gitlab.com/FinnBender/haskell-parsing-benchmarks#byt... but what kinds of "dangerous operations" do I have to watch out for when writing with parser combinators?

(I guess part of the problem is just that regular languages have been studied since Kleene had a full head of hair, while parser combinators were more or less unknown until the 80's.)

replies(2): >>43641727 #>>43643059 #

45. torginus ◴[10 Apr 25 07:06 UTC] No.43641440[source]▶

>>43638422 (OP) #

Functional programming languages are uniquely suitable for writing parsers for grammars, their whole feature set (discriminated unions, pattern matching etc.) is very suited to turning text into ASTs. It's not often that one needs to parse a custom DSL grammar.

That said, I'd say it's a pretty rare occasion to have that need, and other languages have great libraries for accomplishing the same, I'm partial to Chevrotain for Javascript (https://chevrotain.io/docs/), which has a small DSL for describing grammars.

Parser combinators are both less efficient and less featureful than bespoke solutions, which is a problem when you're writing a parser for your custom language, and you want to add extra features, like resilience, or rich error reporting.

To add a bit of snark, functional languages seem to best suited for writing parsers/compilers for other functional programming languages, which is pretty telling of the inward-facing functional culture.

replies(1): >>43641602 #

46. meinersbur ◴[10 Apr 25 07:09 UTC] No.43641459[source]▶

>>43639551 #

Flang (LLVM Fortran Compiler) uses parser combinators.

https://github.com/llvm/llvm-project/blob/main/flang/lib/Par...

47. mrkeen ◴[10 Apr 25 07:38 UTC] No.43641602[source]▶

>>43641440 #

> Parser combinators are both less efficient and less featureful than bespoke solutions, which is a problem when you're writing a parser for your custom language, and you want to add extra features, like resilience, or rich error reporting.

This is untrue. PCs are the bespoke solutions. The article shows how to introduce a feature like state. Parser generators are the technique which just dump fixed-functionality source code on you.

replies(1): >>43641778 #

48. asQuirreL ◴[10 Apr 25 07:49 UTC] No.43641664[source]▶

>>43638422 (OP) #

This is a false dichotomy -- regexes and parsers both have their place, even when solving the same problem.

The troubles start when you try and solve the whole thing in one step, using just regular expressions, or parsers.

Regular expressions are good at tokenizing input (converting a stream of bytes into a stream of other things, e.g. picking out numbers, punctuation, keywords).

Parsers are good at identifying structure in a token stream.

Neither are good at evaluation. Leave that as its own step.

Applying this rule to the example in the article (Advent of Code 2024 Day 3), I would still use regular expressions to identify mul(\d+,\d+), do(), and don't(), I don't think I need a parser because there is no extra structure beyond that token stream, and I would leave it up to the evaluator to track the state of whether multiplication is enabled or not.

49. mrkeen ◴[10 Apr 25 07:52 UTC] No.43641675[source]▶

>>43639551 #

> You could almost build an entire parser combinator library from scratch from memory, because none of the individual components are complicated.

I do this every now and then. Usually in the days leading up to advent of code.

"A parser is a function from an input to either failure or a (remaining input, result) pair" is all you need to remember to build the rest.

replies(1): >>43642360 #

50. mrkeen ◴[10 Apr 25 07:57 UTC] No.43641710{3}[source]▶

>>43639775 #

Can I see the (implied) counter-example?

A regex which turns a language's source code into an AST?

replies(1): >>43642623 #

51. mrkeen ◴[10 Apr 25 08:00 UTC] No.43641727{3}[source]▶

>>43641298 #

> I've been afraid to use parser combinators

> With regular (finite-state) languages I know their time usage

Are you talking about parsers or grammars?

replies(1): >>43643009 #

52. torginus ◴[10 Apr 25 08:09 UTC] No.43641778{3}[source]▶

>>43641602 #

Parser combinators define a recursive parsing implementation, as well as declaration syntax. While it's very flexible in terms of what it can parse, it has no fixed guarantees on making progress on the token stream or having a bounded lookahead size before they can parse anything.

A hand written LL(1) parser, (for a suitable grammar) is much more efficient than a parser combinator, afaik most production languages use a hand-written parser, with parser combinators not really being used in high-effort implementations.

I have no extensive experience in surveying parser generators (only using them, and they seemed to be fast enough), I'd guess it's up to the implementation on how efficient parsing code is, with my ballpark assumption being that it's worse than handwritten code.

53. miningape ◴[10 Apr 25 08:31 UTC] No.43641927{4}[source]▶

>>43640797 #

Yeah parsers can be tricky "culturally" - as you've found there are many devs who haven't written a recursive descent parser and therefore they lack any meaningful context to understand what's happening under the hood.

It's a similar scenario to if you made your own RegEx and none of the devs on your team know's what it is. It just takes too long to teach them that on the job since there's very few quality resources you can point them towards - so you're stuck teaching everything.

On the other hand, if you have a team full of experienced PL developers doing this could significantly improve their productivity.

54. foldr ◴[10 Apr 25 09:04 UTC] No.43642100[source]▶

>>43639551 #

As a sibling points out, recursive descent parsers have been mainstream for a long time. Higher order parser combinators are only moderately useful in an imperative language with regular looping constructs. (For example, the code to parse a list of Xs separated by Ys is a pretty straightforward loop.) A lot of parser combinator libraries also make the mistake of encouraging "short circuit" error handling, which is rarely what you want for a production-grade parser.

55. conaclos ◴[10 Apr 25 09:12 UTC] No.43642148[source]▶

>>43639551 #

I once tried to use parser combinators and I was quite disappointed by the result. I find it harder to read and to handle error than regular recursive descendant parsers.

replies(1): >>43646849 #

56. smcl ◴[10 Apr 25 09:39 UTC] No.43642296[source]▶

>>43638422 (OP) #

Obligatory plug for Sprache (https://github.com/sprache/Sprache) - a nice library I've been using whenever I've needed to create a little parser for something in C#. I'm not dogmatically opposed to using regexes, just for me they feel quite clunky for many tasks and tend to grow legs and become harder to understand.

57. idahoduncan ◴[10 Apr 25 09:54 UTC] No.43642360{3}[source]▶

>>43641675 #

I do this from time to time as well, although I tend to get hung up on error handling. I would say that it's a simple enough exercise if you don't care too much about reporting parse errors in a meaningful way, although I'd be happy for someone to point out an implementation of error handling that fits into one's head.

replies(1): >>43647129 #

58. thaumasiotes ◴[10 Apr 25 10:46 UTC] No.43642623{4}[source]▶

>>43641710 #

What implied counterexample? Regular expressions define the simplest language type a CS education generally covers. You would expect anything to be more capable.

But you wouldn't prove it by demonstrating that you can recognize regular languages. That's the thing regular expressions are good at!

59. zokier ◴[10 Apr 25 11:29 UTC] No.43642797{3}[source]▶

>>43639965 #

> just the syntax becomes unreadable rather quickly. I’m sure you could build a nicer regular-expression syntax in Haskell.

Of course regular expressions are really more of a category of expressions, and the traditional kleene star notation is only one of many options; regular expressions do not somehow inherently need to use that specific syntax.

Pomsky and VerbalExpressions are just some examples of alternative syntaxes for regex. Apparently there is even a port of VerbalExpressions for Haskell:

https://github.com/VerbalExpressions/HaskellVerbalExpression...

replies(1): >>43643224 #

60. antonvs ◴[10 Apr 25 11:37 UTC] No.43642845{3}[source]▶

>>43641205 #

Presumably you're not objecting to the word "parser".

What name would you use for "combinator"? It's a pretty simple word, based on the root "combine".

The common sense concept here is functions whose results depends only on their arguments, which makes them well-suited for being used in "combination" with other combinators, as in `f . g` or `f(g(x))`. That's called function composition, which can be defined as the act of combining functions.

It's a term that has existed in logic for over a century, from before computers existed.

It seems to me that no amount of renaming of terms will fix people's lack of curiosity or lack of desire to learn or understand. That's on them.

replies(1): >>43646757 #

61. virgilp ◴[10 Apr 25 12:05 UTC] No.43642990[source]▶

>>43638422 (OP) #

Parser combinators will eventually generate a push-down automaton. Regex would eventually generate a deterministic finite-state automaton. There's no contest, CPU-wise the DFA will parse the string faster (just have to do one transition for each char/ it's one lookup). In some cases it might take more memory for the automaton though. (also this assumes you don't go into library-specific extensions that allow backtracking, but keep within the original regex scope)

It's fine and fair to suggest that parser combinators are better for many usecases (first of all, they're clearly more powerful! they can parse any context-free language). But it's misleading to suggest they "beat" regex (they don't - not on the regex turf).

replies(1): >>43649771 #

62. wyager ◴[10 Apr 25 12:08 UTC] No.43643009{4}[source]▶

>>43641727 #

There's a correspondence between ???/applicative/monadic parsers and regular/context free/context sensitive grammars.

replies(2): >>43643282 #>>43644232 #

63. wyager ◴[10 Apr 25 12:15 UTC] No.43643059{3}[source]▶

>>43641298 #

The evaluation model of most of these parser combinator libraries is simple enough that you can just think directly about the call tree. It's difficult to get "surprised" like you can with PCREs. E.g. (x+x+)+y as a regex may blow up, but "many1 (many1 x >> many1 x) >> y" will just evaluate forwards a single time, greedily, and fail, at least with any haskell parser combinator library I can think of.

64. ◴[10 Apr 25 12:20 UTC] No.43643091[source]▶

>>43638422 (OP) #

65. qrobit ◴[10 Apr 25 12:43 UTC] No.43643224{4}[source]▶

>>43642797 #

I looked at the VerbalExpressionJS[1] example and it looks like combining parsers to me. If you need to make regex more verbose, better use parser combinator library when available. RegEx benefits compared to parser combinators other than compactness aren't obvious to me.

[1]: <https://github.com/VerbalExpressions/JSVerbalExpressions/tre...>

66. internet_points ◴[10 Apr 25 12:52 UTC] No.43643282{5}[source]▶

>>43643009 #

Now I'm really curious what ??? will turn out to be :-D

67. wruza ◴[10 Apr 25 13:59 UTC] No.43643853[source]▶

>>43639551 #

It's never the origin but the ease of install/use/mentor and compatibility across envs.

68. DadBase ◴[10 Apr 25 14:08 UTC] No.43643933{3}[source]▶

>>43640977 #

Ah, you've clearly never had to parse a CSV exported from a municipal parking database in 2004. Quoted fields inside quoted fields, carriage returns mid-name, and a column that just says "ERROR" every 37th row. Your pseudocode would flee the scene.

69. DadBase ◴[10 Apr 25 14:09 UTC] No.43643935{4}[source]▶

>>43641113 #

Exactly. At some point every parser combinator turns into a three-line awk script that runs perfectly as long as the moon is waning and the file isn’t saved from Excel for Mac.

70. DadBase ◴[10 Apr 25 14:11 UTC] No.43643957{3}[source]▶

>>43640949 #

Parsers can handle it, sure, but then you blink and you're ten layers deep trying to explain why a single unmatched quote ate the rest of the file. Sometimes a little awk and a strong coffee gets you further.

replies(1): >>43648879 #

71. wruza ◴[10 Apr 25 14:12 UTC] No.43643966{4}[source]▶

>>43639896 #

It's unlikely because people don't get why jump through so many hoops when you can

  new = copy(old)
  new.foo.bar.baz = new_baz
  return new

It won't come out of Haskell because it's uniquely a solution to a specific Haskell problem. Mutating a fresh copy through a "lens" when you have `=` is a level of geekage that cannot become mainstream.

replies(2): >>43647066 #>>43649028 #

72. bazoom42 ◴[10 Apr 25 14:21 UTC] No.43644069[source]▶

>>43639465 #

Sounds like you think the comments and explantions are the problem? You can write regexes with comments and parsers without. Regexes are not generally known to be self explanatory, except for trivial cases like \d+

73. bazoom42 ◴[10 Apr 25 14:28 UTC] No.43644151[source]▶

>>43639551 #

> will never be mainstream because it had the bad fortune of coming from the functional programming world

Nah, mainstream developer don’t care where an idea comes from as long as it solves the problem and helps them hit the deadline. E.g. linq in C# have been broadly adopted.

Regexes are disliked by many developers, so parser combinators should be an easy sell if you can show it solves realistic problems with simpler code.

74. pklausler ◴[10 Apr 25 14:28 UTC] No.43644154[source]▶

>>43638422 (OP) #

Parser combinators work well for big messy problems as well as regexes; I used a custom library of combinators to write the Fortran parser for LLVM. They’re great for error recovery and for disambiguating tricky situations, of which Fortran has more than its share; I ended up parsing cooked characters without a tokenization phase.

One hard part is getting good performance when the same production is being tried as part of multiple higher level alternatives and backtracking is taking place. You can get into trouble for bad input cases, like expressions with thousands of operands being used in a context that isn’t clear until you get to characters after the expression. But there’s ways to solve that problem, so I recommend combinators to you if you need to write a parser.

replies(1): >>43646027 #

75. BoiledCabbage ◴[10 Apr 25 14:35 UTC] No.43644232{5}[source]▶

>>43643009 #

Interesting, could you give some more detail?

replies(1): >>43644669 #

76. wyager ◴[10 Apr 25 15:13 UTC] No.43644669{6}[source]▶

>>43644232 #

I tried googling to find an article, and I found some stuff explaining it, but this seems to be deeper lore than I thought it was.

Basically, a monadic parser combinator can have code that "inspects" a previously parsed value (context-sensitivity) but an applicative parser cannot.

Imagine an input like "3 alice bob dave", with a number and then that many names.

We want to parse

   data Parsed = Parsed {count :: Int, names :: [Name]}

Example: monadic parser:

  count <- parseInt
  names <- parseN count name
  return (Parsed count names)

You need to know the value of count before you keep parsing. Context-sensitive.

Applicative parsers don't let you "inspect" the parsed values. You can do stuff like

  Parsed <$> parseInt <*> many name

But if it's not clear where in the input the list of name ends without looking at the output of `parseInt`, you're hosed. There's no way to inspect the output of "parseInt" while you are parsing with an applicative parser.

You could do something like:

      Parsed <$> literal "1" <*> replicateM 1 name
  <|> Parsed <$> literal "2" <*> replicateM 2 name
  <|> ...

where you have an alternative case for each possible number, but obviously this does not generalize to parse realistic inputs.

Technically, you can use Haskell's laziness to parse this particular grammar efficiently enough using applicatives+alternatives to construct an infinite parse tree, but that's kind of an advanced edge case that won't work in most languages.

replies(1): >>43645679 #

77. egonschiele ◴[10 Apr 25 15:47 UTC] No.43645093{4}[source]▶

>>43640797 #

The idea of functions returning functions makes a lot of people's heads spin! I've struggled to get people using this library too, but I use it in personal work and it is an absolute joy. One of my favorite things I've made.

replies(1): >>43645573 #

78. kreig ◴[10 Apr 25 16:15 UTC] No.43645397[source]▶

>>43638422 (OP) #

It makes sense, Combinators can parse a superset of formal languages than regex, according to Chomsky's hierarchy [0].

[0]: https://en.wikipedia.org/wiki/Chomsky_hierarchy

79. TuringTest ◴[10 Apr 25 16:32 UTC] No.43645573{5}[source]▶

>>43645093 #

You can typically trick them into thinking that you're returning an object with a single method ;-)

replies(1): >>43660500 #

80. BoiledCabbage ◴[10 Apr 25 16:41 UTC] No.43645679{7}[source]▶

>>43644669 #

Thanks, and yes that makes perfect sense. It's a useful way to think about the problem space.

And then it does lead back to your "????" - which presumably represents the answer to the question of "What's the simplest abstraction that allows one to build a "Parser" (would this still be using combinators??) that is powerful enough to parse regular languages, but, by design, not powerful enough to parse context-free languages?"

replies(1): >>43647041 #

81. pklausler ◴[10 Apr 25 17:13 UTC] No.43646027[source]▶

>>43644154 #

(adding to the above)

Another downside of parser combinators, at least the ones that I wrote as a basis for Fortran parsing, is that if you implement them as C++ template classes, the types get astonishingly long in their names. This leads to long compilation times and makes debuggers very hard to use. Both of these concerns only affect the developer, of course, and seemed to me to be acceptable so long as the parser is fast and robust.

I'd use parser combinators again if I ever need to write a big parser, although I might not use overloaded operators (">>", "/") and confusing function names ("many" vs. "some") inherited from Haskell's Parsec library.

82. fooker ◴[10 Apr 25 18:26 UTC] No.43646757{4}[source]▶

>>43642845 #

Every piece of software in existence calls functions and composes functions, there's no reason to attach 'combinator' to everything.

I do not doubt the correctness of the term, just the point of naming it. The fact that people write parsers all the time using these concepts without using the name proves that the name was not a great one.

> It's a term that has existed in logic for over a century, from before computers existed.

Yes, that's a great reason to not overload mathematical terms for random things. Monads and monoids are a great example of when this goes a bit wrong.

>It seems to me that no amount of renaming of terms will fix people's lack of curiosity or lack of desire to learn or understand.

Yes, I agree. Most people are just writing programs and solving problems, and moving on with their lives. It's weird to insist that they call something by a specific name if it's a simple idea that thousands of people have developed independently.

replies(1): >>43736086 #

83. fooker ◴[10 Apr 25 18:36 UTC] No.43646849{3}[source]▶

>>43642148 #

The trick is to have your grammar or combinator setup to have error productions for all the common errors you can think of.

That way you get nice, targeted error messages.

84. wyager ◴[10 Apr 25 18:58 UTC] No.43647041{8}[source]▶

>>43645679 #

Yeah, exactly. I don't know what it would look like. It would be nice if the answer was "functorial", since that's at the bottom of the functor-applicative-monad hierarchy, but functor doesn't provide any way to actually combine the combinators, so that's out.

85. kqr ◴[10 Apr 25 19:02 UTC] No.43647066{5}[source]▶

>>43643966 #

This is a common misconception. Lenses (or more generally optics) are not limited to targeting single fields. They can target composites, alternatives, etc. They go way beyond a normal accessor dot and can do stuff that cannot be done in mainstream languages.

Some of the other ideas from lenses are becoming mainstream, like the .? conditional accessor in e.g. C# and EcmaScript.

replies(1): >>43648783 #

86. debugnik ◴[10 Apr 25 19:10 UTC] No.43647129{4}[source]▶

>>43642360 #

Here's the most basic error handling to implement, which I learned from studying FParsec:

1. Keep a collection of errors in the stream. It represents errors at the current position.

2. Clear all errors when input is consumed.

3. When a token-like parser fails without consuming input, push an "expected '<literal>'" error.

4. When a labeled parser fails without consuming input, restore the error set as it was before running it, and push an "expected <label>" error instead.

This will naturally accumulate errors from choice and repetition combinators, as long as you don't have needless backtracking before the choice points.

So this already gives you reasonable errors like "foo:17:4: expected one of expression, assignment or 'end'". And it's easy to extend to nested errors, backtracking information or recoverable nodes.

87. wruza ◴[10 Apr 25 23:02 UTC] No.43648783{6}[source]▶

>>43647066 #

They don't surpass Turing Machine level, so I'm not convinced.

The idea of ?. was there since the beginning of times. You don't have to dig deep to realize that (x ? x.y : nil) and (x ? x : y) sucks. Even gnu C had (x ?: y) for quite a while. It's just that the designers of to-become-mainstream languages all start with the same "density" and ignore the damn obvious, and then rather than fixing it, chew on the same dilemmas they all inevitably introduce. I sometimes wonder if they have written regular code at all or started inventing languages right after school.

88. maxbond ◴[10 Apr 25 23:22 UTC] No.43648879{4}[source]▶

>>43643957 #

Use what's working for you but one of the strengths of parser combinators is managing abstraction so that you can reason locally rather than having to manage things "ten layers deep." That's more of a problem in imperative approaches where you are manually managing a complex state machine, and can indeed have very deep `if` hierarchies.

Parser combinator are good at letting you express the grammar as a relationship among functions so that the compiler handles the nitty-gritty and error prone bits. (Regexes also abstract away these details, of course.)

89. yen223 ◴[10 Apr 25 23:49 UTC] No.43649028{5}[source]▶

>>43643966 #

This comment proves my point about how "all the Haskell" hinders mainstream adoption of a thing. It's very easy for folks to miss the point of the thing!

Lenses and more generally optics are a lot more powerful than just dot-accessors. They are closer in spirit to XPath or jq, except that they are general enough to work on most data structures, not just XML or Json.

replies(1): >>43652398 #

90. fooker ◴[11 Apr 25 02:01 UTC] No.43649771[source]▶

>>43642990 #

>Parser combinators will eventually generate a push-down automaton

No, you can have context sensitivity.

91. Rendello ◴[11 Apr 25 02:26 UTC] No.43649923[source]▶

>>43638422 (OP) #

A few days ago I mentioned my first parser [1], which used regex in the tokenization step, based on studying the parser video by Destroy All Software. I'm glad I learned it that way first, since it takes a lot of the pain out of lexing. I've now built many parsers, including accidentally re-inventing parser combinators when working in Erlang. Two days ago I built another parser that uses regex heavily for tokenization/some parsing steps. I never parse programming languages, so my programs' needs are quite different from the parsers talked about online!

https://news.ycombinator.com/item?id=43627698

92. zahlman ◴[11 Apr 25 04:01 UTC] No.43650405[source]▶

>>43639551 #

> break it down into the much simpler problem of reading tokens, then write simple "combinators" (and, or, one_or_many, etc etc) to combine these simple parsers into larger more complex parsers.

The popular `pyparsing` library for Python (https://pypi.org/project/pyparsing/) expresses these concepts, just written more in traditional OO style rather than functional style (i.e. there are classes like OneOrMany that you instantiate, rather than binding a single-token-matcher argument to lambda or whatever).

93. vrighter ◴[11 Apr 25 05:32 UTC] No.43650782{3}[source]▶

>>43639552 #

Only if a packrat parser is implemented, which requires a lot of memory.

94. wruza ◴[11 Apr 25 10:37 UTC] No.43652398{6}[source]▶

>>43649028 #

It doesn't. "All the haskell" is your protective decoy against "it's not that useful when you're not chained to the wall". Show me a lens with a bunch of other fp tricks and I'll write dumb inline code for it that does that same thing, but without twisting anyones brain or trying to not miss its thin point.

95. yen223 ◴[12 Apr 25 01:29 UTC] No.43660500{6}[source]▶

>>43645573 #

Getting Java flashbacks from this

96. matt_kantor ◴[19 Apr 25 12:55 UTC] No.43736086{5}[source]▶

>>43646757 #

Say I want to tell someone about how I parse things in my programming language. It happens to use parser combinators.

If I want to avoid the term "parser combinator" because it sounds too math-y, what should I say instead? I could spend a few sentences describing what parser combinators are, but if they've already heard the term (or have the gumption to look it up) then I've wasted our time and made the discussion more convoluted for no reason. If not, they can ask me and then I can describe what they are.

> It's weird to insist that they call something by a specific name

I don't see anyone in this thread doing so.

↑