Parser Combinators Beat Regexes

(entropicthoughts.com)

Show context

o11c ◴[10 Apr 25 00:17 UTC] No.43639374[source]▶

Note that most implementations of both parser combinators and regexes can fail very badly (exponential time). Never use either on untrusted input, unless you can prove your implementation lets you stay linear.

replies(4): >>43639552 #>>43639845 #>>43640240 #>>43641298 #

1. thaumasiotes ◴[10 Apr 25 01:46 UTC] No.43639845[source]▶

>>43639374 #

Only PCREs are exponential time, in service of a feature you basically never need. Regexes are always linear time.

They can take exponential space, though, so I'm not sure why knowing you'll be able to process the data in linear time is supposed to keep you safe.

replies(4): >>43639936 #>>43639940 #>>43640180 #>>43640937 #

2. dullcrisp ◴[10 Apr 25 02:02 UTC] No.43639936[source]▶

>>43639845 (TP) #

I don’t believe space > time is possible on a Turing machine?

replies(1): >>43639958 #

3. pjscott ◴[10 Apr 25 02:03 UTC] No.43639940[source]▶

>>43639845 (TP) #

This depends on the implementation of the regex engine; most are potentially superlinear in time, since that’s the easiest way of doing it, and quite fast until suddenly it isn’t. I always check the algorithm being used before I use regular expressions in production. I was surprised how many use a recursive descent strategy!

(Also, BTW, you can deal with the exponential space issue by using an NFA instead of a DFA – it’s slower that way, but the memory space required is reliably linearly proportional to the size of the regex.)

4. pjscott ◴[10 Apr 25 02:05 UTC] No.43639958[source]▶

>>43639936 #

You’re right, of course, but there was a minor miscommunication: the exponential space is exponentially proportional to the size of the regular expression, and the linear time is linearly proportional to the length of the string being searched through.

replies(1): >>43639996 #

5. dullcrisp ◴[10 Apr 25 02:13 UTC] No.43639996{3}[source]▶

>>43639958 #

Thanks! Though I imagine in most cases the regular expression itself would be a fixed part of the codebase and not given as input.

6. moregrist ◴[10 Apr 25 02:51 UTC] No.43640180[source]▶

>>43639845 (TP) #

It’s not only PCRE. Any dialect with back references will require a backtracking engine that can go exponential with the wrong expression and input.

This includes most flavors of regexp you find in the wild: Python’s re module, JavaScript regular expressions, Ruby’s regular expressions, Perl, PCRE, and even basic and extended REs used in grep.

Russ Cox has written some very accessible posts on the linear (in input) properties of NFAs and the equivalence of Thompson regular expressions [0]. There’s also quite a bit of literature on the Glushkov construction of regular expressions (and its NFA equivalence) [1] that’s worth reading if you find the topic interesting.

Both Go and Rust have non-backtracking regular expression libraries, and you can find solid non-backtracking C and C++ libraries for regular expressions (eg: libfsm and Hyperscan).

0: https://swtch.com/~rsc/regexp/ 1: My favorite is _Flexible Pattern Matching Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences_ by Gonzalo Navarro

7. masklinn ◴[10 Apr 25 05:29 UTC] No.43640937[source]▶

>>43639845 (TP) #

> Only PCREs are exponential time, in service of a feature you basically never need. Regexes are always linear time.

Any re dialect which supports backtracking necessarily has a non-linear worst case, and while a select few have very high resilience against exponential backtracking (e.g. never managed to make postgres fall over) most can be made to fail with a pattern a few characters long.

FA-based engines are getting more popular, but they’re far from universal.

↑