Regex Isn't Hard (2023)

(timkellogg.me)

75 points asicsp | 1 comments | 21 Apr 25 10:35 UTC | HN request time: 0.203s | source

Show context

michaelt ◴[21 Apr 25 11:01 UTC] No.43750496[source]▶

> e.g. This pattern ([0-9][0-9]?[0-9]][.])+ matches one, two or three digits followed by a . and also matches repeated patterns of this. This wold match an IP address (albeit not strictly).

I love regular expressions but one thing I've learned over the years is the syntax is dense enough that even people who are confident enough to start writing regex tutorials often can't write a regex that matches an IP address.

replies(9): >>43750531 #>>43750628 #>>43750641 #>>43750693 #>>43750726 #>>43751250 #>>43751329 #>>43751632 #>>43754055 #

ninkendo ◴[21 Apr 25 11:29 UTC] No.43750693[source]▶

>>43750496 #

Writing one correctly is pretty complicated task if you’re trying to write a simple tutorial… off the top of my head, you’d need:

    (
      (
      25[0-5] # 250-255
      |
      2[0-4][0-9] # 200-249
      |
      1[0-9]{2} # 100-199
      |
      [1-9][0-9] # 10-99
      |
      [0-9]
      )
      \.
    ){3}
    (
    25[0-5] # 250-255
    |
    2[0-4][0-9] # 200-249
    |
    1[0-9]{2} # 100-199
    |
    [1-9][0-9] # 10-99
    |
    [0-9]
    )

… but without all the nice white space and comments, unless you’re willing to discuss regex engines that let you do multi-line/commented literals like that… I think ruby does, not sure what other languages.

The problem is that expressing “an integer from 0-255” is surprisingly complicated for regex engines to express. And that’s not even accounting for IP addresses that don’t use dots (which is legal as an argument to most software that connects to an IP address), as other commenters have pointed out.

replies(2): >>43750866 #>>43751438 #

1. wat10000 ◴[21 Apr 25 12:52 UTC] No.43751438[source]▶

>>43750693 #

Regex can be good but you need to be willing to bail out when it’s not appropriate.

For something like locating IP addresses in text, using a regex to identify candidates is a great idea. But as you show, you don’t want to implement the full validation in it. Use regex to find dotted digit groups, but validate the actual numeric values as a separate step afterwards.

↑