←back to thread

Regex Isn't Hard (2023)

(timkellogg.me)
75 points asicsp | 4 comments | | HN request time: 0.705s | source
Show context
michaelt ◴[] No.43750496[source]
> e.g. This pattern ([0-9][0-9]?[0-9]][.])+ matches one, two or three digits followed by a . and also matches repeated patterns of this. This wold match an IP address (albeit not strictly).

I love regular expressions but one thing I've learned over the years is the syntax is dense enough that even people who are confident enough to start writing regex tutorials often can't write a regex that matches an IP address.

replies(9): >>43750531 #>>43750628 #>>43750641 #>>43750693 #>>43750726 #>>43751250 #>>43751329 #>>43751632 #>>43754055 #
1. vitus ◴[] No.43750726[source]
It's especially ironic given that the title of the post is "Regex Isn't Hard", and then it proceeds to make several (syntactical and logical) errors in the one real-world example.

Syntax error aside (there's an extra ] floating around), it's not even close to correct -- it'll match "999.999.999.000.999." among other things, will never match just one digit (there's a missing ?), and always insists on the trailing dot.

replies(2): >>43750946 #>>43751818 #
2. michaelt ◴[] No.43750946[source]
Correct - it'll accept "999.999.999.000.999." but it'll reject "127.0.0.1"
3. mannykannot ◴[] No.43751818[source]
In practice, the first unpaired ] is treated as an ordinary character (at least according to https://regex101.com/) - which does nothing to make this regex fit for its intended purpose. I'm not sure whether this is according to spec. (I think it is, though that does not really matter compared to what the implementations actually do.)

Characters which are sometimes special, depending on context, are one more thing making regexes harder than they appear at first sight.

The author's willingness to publish code without even minimal testing does not inspire confidence.

replies(1): >>43756107 #
4. vitus ◴[] No.43756107[source]
Agreed entirely, on all those points.

Calling the extra ] a syntax error was a slight exaggeration on my behalf, but that was clearly an unintended extra character -- there's no way the author thinks "123].45].67].89]" is a valid IP address. But yes, it does compile and is interpreted as a valid regex, albeit not a useful one in this context.

The out-of-range values are not ideal but can be fixed with post-validation in code (which is cleaner than writing unnecessarily complicated regex, anyways). The missing ? leads to a bunch of false negatives, and the trailing . causes even more problems.