Most active commenters
  • wyager(4)

←back to thread

Parser Combinators Beat Regexes

(entropicthoughts.com)
120 points mooreds | 22 comments | | HN request time: 0.791s | source | bottom
1. o11c ◴[] No.43639374[source]
Note that most implementations of both parser combinators and regexes can fail very badly (exponential time). Never use either on untrusted input, unless you can prove your implementation lets you stay linear.
replies(4): >>43639552 #>>43639845 #>>43640240 #>>43641298 #
2. giraffe_lady ◴[] No.43639552[source]
PEGs don't have this problem right? Linear time over length of input? Though I guess most peg libraries probably aren't "pure" pegs to get around its limitations and may reintroduce this issue?
replies(3): >>43639629 #>>43639698 #>>43650782 #
3. yen223 ◴[] No.43639629[source]
PEGs don't have this problem only because they place restrictions on the grammar.

In practice, this isn't a problem, but it does require you write grammar rules in a specific way (e.g. no left-recursion)

4. o11c ◴[] No.43639698[source]
No, PEG is prone to exactly this in its default form. Parser combinator usually means something PEG-ish anyway.

If you memoize (packrat), it improves to polynomial time (not sure what, but it's not linear; that's false advertising and a fraudulent claim), but you're still stuck with the vulnerability to bugs in the grammar.

A better idea is to create a parser-combinator-style API on top of an LL or LR backend, for guaranteed linear time and only enough stack space for the deepest part of the AST.

5. thaumasiotes ◴[] No.43639845[source]
Only PCREs are exponential time, in service of a feature you basically never need. Regexes are always linear time.

They can take exponential space, though, so I'm not sure why knowing you'll be able to process the data in linear time is supposed to keep you safe.

replies(4): >>43639936 #>>43639940 #>>43640180 #>>43640937 #
6. dullcrisp ◴[] No.43639936[source]
I don’t believe space > time is possible on a Turing machine?
replies(1): >>43639958 #
7. pjscott ◴[] No.43639940[source]
This depends on the implementation of the regex engine; most are potentially superlinear in time, since that’s the easiest way of doing it, and quite fast until suddenly it isn’t. I always check the algorithm being used before I use regular expressions in production. I was surprised how many use a recursive descent strategy!

(Also, BTW, you can deal with the exponential space issue by using an NFA instead of a DFA – it’s slower that way, but the memory space required is reliably linearly proportional to the size of the regex.)

8. pjscott ◴[] No.43639958{3}[source]
You’re right, of course, but there was a minor miscommunication: the exponential space is exponentially proportional to the size of the regular expression, and the linear time is linearly proportional to the length of the string being searched through.
replies(1): >>43639996 #
9. dullcrisp ◴[] No.43639996{4}[source]
Thanks! Though I imagine in most cases the regular expression itself would be a fixed part of the codebase and not given as input.
10. moregrist ◴[] No.43640180[source]
It’s not only PCRE. Any dialect with back references will require a backtracking engine that can go exponential with the wrong expression and input.

This includes most flavors of regexp you find in the wild: Python’s re module, JavaScript regular expressions, Ruby’s regular expressions, Perl, PCRE, and even basic and extended REs used in grep.

Russ Cox has written some very accessible posts on the linear (in input) properties of NFAs and the equivalence of Thompson regular expressions [0]. There’s also quite a bit of literature on the Glushkov construction of regular expressions (and its NFA equivalence) [1] that’s worth reading if you find the topic interesting.

Both Go and Rust have non-backtracking regular expression libraries, and you can find solid non-backtracking C and C++ libraries for regular expressions (eg: libfsm and Hyperscan).

0: https://swtch.com/~rsc/regexp/ 1: My favorite is _Flexible Pattern Matching Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences_ by Gonzalo Navarro

11. ◴[] No.43640240[source]
12. masklinn ◴[] No.43640937[source]
> Only PCREs are exponential time, in service of a feature you basically never need. Regexes are always linear time.

Any re dialect which supports backtracking necessarily has a non-linear worst case, and while a select few have very high resilience against exponential backtracking (e.g. never managed to make postgres fall over) most can be made to fail with a pattern a few characters long.

FA-based engines are getting more popular, but they’re far from universal.

13. internet_points ◴[] No.43641298[source]
This is one of the reasons I've been afraid to use parser combinators too heavily. With regular (finite-state) languages I know their time usage, with parser combinators I have no clue, and there are so many different libraries and implementations and some are monadic and some are applicative and few mention worst-case. There are benchmarks https://gitlab.com/FinnBender/haskell-parsing-benchmarks#byt... but what kinds of "dangerous operations" do I have to watch out for when writing with parser combinators?

(I guess part of the problem is just that regular languages have been studied since Kleene had a full head of hair, while parser combinators were more or less unknown until the 80's.)

replies(2): >>43641727 #>>43643059 #
14. mrkeen ◴[] No.43641727[source]
> I've been afraid to use parser combinators

> With regular (finite-state) languages I know their time usage

Are you talking about parsers or grammars?

replies(1): >>43643009 #
15. wyager ◴[] No.43643009{3}[source]
There's a correspondence between ???/applicative/monadic parsers and regular/context free/context sensitive grammars.
replies(2): >>43643282 #>>43644232 #
16. wyager ◴[] No.43643059[source]
The evaluation model of most of these parser combinator libraries is simple enough that you can just think directly about the call tree. It's difficult to get "surprised" like you can with PCREs. E.g. (x+x+)+y as a regex may blow up, but "many1 (many1 x >> many1 x) >> y" will just evaluate forwards a single time, greedily, and fail, at least with any haskell parser combinator library I can think of.
17. internet_points ◴[] No.43643282{4}[source]
Now I'm really curious what ??? will turn out to be :-D
18. BoiledCabbage ◴[] No.43644232{4}[source]
Interesting, could you give some more detail?
replies(1): >>43644669 #
19. wyager ◴[] No.43644669{5}[source]
I tried googling to find an article, and I found some stuff explaining it, but this seems to be deeper lore than I thought it was.

Basically, a monadic parser combinator can have code that "inspects" a previously parsed value (context-sensitivity) but an applicative parser cannot.

Imagine an input like "3 alice bob dave", with a number and then that many names.

We want to parse

   data Parsed = Parsed {count :: Int, names :: [Name]}
Example: monadic parser:

  count <- parseInt
  names <- parseN count name
  return (Parsed count names)
You need to know the value of count before you keep parsing. Context-sensitive.

Applicative parsers don't let you "inspect" the parsed values. You can do stuff like

  Parsed <$> parseInt <*> many name

But if it's not clear where in the input the list of name ends without looking at the output of `parseInt`, you're hosed. There's no way to inspect the output of "parseInt" while you are parsing with an applicative parser.

You could do something like:

      Parsed <$> literal "1" <*> replicateM 1 name
  <|> Parsed <$> literal "2" <*> replicateM 2 name
  <|> ...
where you have an alternative case for each possible number, but obviously this does not generalize to parse realistic inputs.

Technically, you can use Haskell's laziness to parse this particular grammar efficiently enough using applicatives+alternatives to construct an infinite parse tree, but that's kind of an advanced edge case that won't work in most languages.

replies(1): >>43645679 #
20. BoiledCabbage ◴[] No.43645679{6}[source]
Thanks, and yes that makes perfect sense. It's a useful way to think about the problem space.

And then it does lead back to your "????" - which presumably represents the answer to the question of "What's the simplest abstraction that allows one to build a "Parser" (would this still be using combinators??) that is powerful enough to parse regular languages, but, by design, not powerful enough to parse context-free languages?"

replies(1): >>43647041 #
21. wyager ◴[] No.43647041{7}[source]
Yeah, exactly. I don't know what it would look like. It would be nice if the answer was "functorial", since that's at the bottom of the functor-applicative-monad hierarchy, but functor doesn't provide any way to actually combine the combinators, so that's out.
22. vrighter ◴[] No.43650782[source]
Only if a packrat parser is implemented, which requires a lot of memory.