Most active commenters
  • vitus(4)

←back to thread

Regex Isn't Hard (2023)

(timkellogg.me)
75 points asicsp | 21 comments | | HN request time: 0.962s | source | bottom
1. michaelt ◴[] No.43750496[source]
> e.g. This pattern ([0-9][0-9]?[0-9]][.])+ matches one, two or three digits followed by a . and also matches repeated patterns of this. This wold match an IP address (albeit not strictly).

I love regular expressions but one thing I've learned over the years is the syntax is dense enough that even people who are confident enough to start writing regex tutorials often can't write a regex that matches an IP address.

replies(9): >>43750531 #>>43750628 #>>43750641 #>>43750693 #>>43750726 #>>43751250 #>>43751329 #>>43751632 #>>43754055 #
2. iugtmkbdfil834 ◴[] No.43750531[source]
Is it because everyone tries to make it look short?

edit: asking partly, because in my current work I occassionally have to convince non-technical users to use one type of entry over other. For that reason, easy to read, simple regex wins over fancy, but convoluted regex.

replies(1): >>43750783 #
3. TheDong ◴[] No.43750628[source]
"matches an ip address" is a vague enough specification that of course people fail.

Is it what `inet_addr` accept? In that case, "1", "0x1", "00.01", "00000.01", and more are all ip addresses. `ping` accepts all of em anyway.

Is a valid ipv6 address one with the square brackets around it? Is "::1" a valid ip address? What about "fe80::1%eth2"? ping accepts both of these on my machine (though probably not on yours, since you probably don't have an eth2 interface)

replies(2): >>43750678 #>>43750809 #
4. stavros ◴[] No.43750641[source]
Well, it depends on how specific you want to be. You could do `.*`, and this will match an IP address, or you can be as specific as trying to specify number ranges digit by digit, which is so complicated that it doesn't merit a "can't even".

Also, `16843009` is an IP address, try pinging it.

5. NikkiA ◴[] No.43750678[source]
square brackets around an IP address predates IPv6, it was/is? used to bypass DNS lookups and some (very) old programs required IP addresses inside [...] otherwise they were assumed to be a domain name with all the rules that implied.
6. ninkendo ◴[] No.43750693[source]
Writing one correctly is pretty complicated task if you’re trying to write a simple tutorial… off the top of my head, you’d need:

    (
      (
      25[0-5] # 250-255
      |
      2[0-4][0-9] # 200-249
      |
      1[0-9]{2} # 100-199
      |
      [1-9][0-9] # 10-99
      |
      [0-9]
      )
      \.
    ){3}
    (
    25[0-5] # 250-255
    |
    2[0-4][0-9] # 200-249
    |
    1[0-9]{2} # 100-199
    |
    [1-9][0-9] # 10-99
    |
    [0-9]
    )
    
… but without all the nice white space and comments, unless you’re willing to discuss regex engines that let you do multi-line/commented literals like that… I think ruby does, not sure what other languages.

The problem is that expressing “an integer from 0-255” is surprisingly complicated for regex engines to express. And that’s not even accounting for IP addresses that don’t use dots (which is legal as an argument to most software that connects to an IP address), as other commenters have pointed out.

replies(2): >>43750866 #>>43751438 #
7. vitus ◴[] No.43750726[source]
It's especially ironic given that the title of the post is "Regex Isn't Hard", and then it proceeds to make several (syntactical and logical) errors in the one real-world example.

Syntax error aside (there's an extra ] floating around), it's not even close to correct -- it'll match "999.999.999.000.999." among other things, will never match just one digit (there's a missing ?), and always insists on the trailing dot.

replies(2): >>43750946 #>>43751818 #
8. vitus ◴[] No.43750783[source]
> For that reason, easy to read, simple regex wins over fancy, but convoluted regex.

Sure, I'd take \d+\.\d+\.\d+\.\d+ over... "((2(5[0-5]|[0-4][0-9])|1[0-9]{2}|[1-9]?[0-9])\.){3}(2(5[0-5]|[0-4][0-9])|1[0-9]{2}|[1-9]?[0-9])", assuming that I then validate the results afterwards.

9. ◴[] No.43750809[source]
10. vitus ◴[] No.43750866[source]
> I think ruby does, not sure what other languages.

You're right that Ruby has it. Perl also has /x, of course (since most of Ruby regex was "inspired" directly by Perl's syntax), as well as Python (re.VERBOSE). Otherwise, yeah, it's disappointingly rare.

replies(1): >>43756671 #
11. michaelt ◴[] No.43750946[source]
Correct - it'll accept "999.999.999.000.999." but it'll reject "127.0.0.1"
12. aadhavans ◴[] No.43751250[source]
Shameless plug: My Regex engine (https://pkg.go.dev/gitea.twomorecents.org/Rockingcool/kleing...) has dedicated syntax for this kind of task.

  <0-255>\.<0-255>\.<0-255>\.<0-255>
will only match full IPv4 addresses, but is a lot stricter than the one in the article.

EDIT: formatting

13. russfink ◴[] No.43751329[source]
^^^ this ^^^ I can’t understand my own regexes after a couple weeks - much less the ones I got the AI to write for me because I’m lazy or time constrained.
14. wat10000 ◴[] No.43751438[source]
Regex can be good but you need to be willing to bail out when it’s not appropriate.

For something like locating IP addresses in text, using a regex to identify candidates is a great idea. But as you show, you don’t want to implement the full validation in it. Use regex to find dotted digit groups, but validate the actual numeric values as a separate step afterwards.

15. nialv7 ◴[] No.43751632[source]
When the article starts with an AI generated image that adds nothing to the explanation, it tends to make me suspicious if the article itself was written by an AI as well...
16. mannykannot ◴[] No.43751818[source]
In practice, the first unpaired ] is treated as an ordinary character (at least according to https://regex101.com/) - which does nothing to make this regex fit for its intended purpose. I'm not sure whether this is according to spec. (I think it is, though that does not really matter compared to what the implementations actually do.)

Characters which are sometimes special, depending on context, are one more thing making regexes harder than they appear at first sight.

The author's willingness to publish code without even minimal testing does not inspire confidence.

replies(1): >>43756107 #
17. reverendsteveii ◴[] No.43754055[source]
I haven't verified it but quick googling for a regex to validate all legal email addresses pointed me to https://stackoverflow.com/questions/201323/how-can-i-validat..., where one commenter posits that regex to be:

(?:(?:\r\n)?[ \t])(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?: \r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t])))@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\ ](?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?: (?:\r\n)?[ \t])))|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n) ?[ \t]))\<(?:(?:\r\n)?[ \t])(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\ r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n) ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t] )))(?:,@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t])* )(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t])))) :(?:(?:\r\n)?[ \t]))?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r \n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?: \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))"(?:(?:\r\n)?[ \t])))@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031 ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\]( ?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(? :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(? :\r\n)?[ \t])))\>(?:(?:\r\n)?[ \t]))|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(? :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))"(?:(?:\r\n)?[ \t])):(?:(?:\r\n)?[ \t])(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>

@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t])))@(?:(?:\r\n)?[ \t] )(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(? :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t])))|(?:[^()<>@,;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( ?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t]))\<(?:(?:\r\n)?[ \t])(?:@(?:[^()<>@,; :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\" .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)\](?:(?:\r\n)?[ \t])))(?:,@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)\](?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)\](?:(?:\r\n)?[ \t])))):(?:(?:\r\n)?[ \t]))?(?:[^()<>@,;:\\".\[\] \0 00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@, ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t])))@(?:(?:\r\n)?[ \t])* (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[ ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t])))\>(?:(?:\r\n)?[ \t]))(?:,\s( ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t]))(?:\.(?:( ?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t ])))@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t]))(? :\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t])))|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t]))\<(?:(?:\r\n) ?[ \t])(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n) ?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>

@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t])))(?:,@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t] )(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t])))):(?:(?:\r\n)?[ \t]))? (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t]))(?:\.(?:(?: \r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t]) ))@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t]))(?:\ .(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t])))\>(?:( ?:\r\n)?[ \t]))))?;\s)

Someone else then asks the absolute razor of a question: 'What value does this add over just verifying that the input is of the form {something}@{something}.{something}?'

replies(1): >>43755438 #
18. Supermancho ◴[] No.43755438[source]
> 'What value does this add over just verifying that the input is of the form {something}@{something}.{something}?'

Depends if {something} can contain periods for my email.

name@antispam.mydomain.com

replies(1): >>43762827 #
19. vitus ◴[] No.43756107{3}[source]
Agreed entirely, on all those points.

Calling the extra ] a syntax error was a slight exaggeration on my behalf, but that was clearly an unintended extra character -- there's no way the author thinks "123].45].67].89]" is a valid IP address. But yes, it does compile and is interpreted as a valid regex, albeit not a useful one in this context.

The out-of-range values are not ideal but can be fixed with post-validation in code (which is cleaner than writing unnecessarily complicated regex, anyways). The missing ? leads to a bunch of false negatives, and the trailing . causes even more problems.

20. bazoom42 ◴[] No.43756671{3}[source]
.net also supports verbose regex.
21. reverendsteveii ◴[] No.43762827{3}[source]
something is defined as any string of valid characters greater than 0 in length. I'm sure there's some char somewhere that breaks this but in a realistic setting with normal users you won't encounter that edge case and if you do the worst that happens is an email gets returned as non-deliverable