Regex Isn't Hard (2023)

(timkellogg.me)

1. comrade1234 ◴[21 Apr 25 10:53 UTC] No.43750433[source]▶

I mean sure, if it was my full-time job to write regexes I’d probably get pretty good at it. But instead a really complex one comes up maybe once a year for me and so I have to go to some online regex checker and start iteratively building one up, spending hours only find some condition where it doesn’t work and back to the checker...

So I don’t think it’s easy, but I do agree that they are very useful.

replies(1): >>43750532 #

2. nickez ◴[21 Apr 25 10:56 UTC] No.43750462[source]▶

>>43750314 (OP) #

Found an error immediately "Any lowercase character" doesn't match all Swedish lowercase characters.

replies(2): >>43750511 #>>43750527 #

3. voidUpdate ◴[21 Apr 25 10:58 UTC] No.43750473[source]▶

>>43750314 (OP) #

The text on that ai generated image at the top is definitely... interesting

replies(1): >>43750487 #

4. poisonborz ◴[21 Apr 25 10:59 UTC] No.43750481[source]▶

>>43750314 (OP) #

This is truly one thing AI solved. Hard to write, easy to test. No one needs to learn this convoluted syntax in the future and we're all better for it.

replies(3): >>43750537 #>>43750588 #>>43750794 #

5. justlikereddit ◴[21 Apr 25 11:00 UTC] No.43750484[source]▶

>>43750314 (OP) #

Nothing is hard once you've learned to do it intuitively.

The hardest part is remembering how you struggled with it when you started.

replies(1): >>43750562 #

6. johnisgood ◴[21 Apr 25 11:00 UTC] No.43750487[source]▶

>>43750473 #

I am slow, why do you say this?

7. pyfon ◴[21 Apr 25 11:01 UTC] No.43750493[source]▶

>>43750314 (OP) #

I strongly agree with [^"] etc. over . and .?

Involves much less thinking!

replies(1): >>43750549 #

8. michaelt ◴[21 Apr 25 11:01 UTC] No.43750496[source]▶

>>43750314 (OP) #

> e.g. This pattern ([0-9][0-9]?[0-9]][.])+ matches one, two or three digits followed by a . and also matches repeated patterns of this. This wold match an IP address (albeit not strictly).

I love regular expressions but one thing I've learned over the years is the syntax is dense enough that even people who are confident enough to start writing regex tutorials often can't write a regex that matches an IP address.

replies(9): >>43750531 #>>43750628 #>>43750641 #>>43750693 #>>43750726 #>>43751250 #>>43751329 #>>43751632 #>>43754055 #

9. TrackerFF ◴[21 Apr 25 11:02 UTC] No.43750504[source]▶

>>43750314 (OP) #

Confession: Regex knowledge is one of those things I've let completely atrophy after integrating LLMs into my workflow. I guess if the day comes that AI/ML models suddenly disappear, or become completely unavailable to me, I'll have to get into the nitty gritty of Regex again...but until that time, it is a "solved problem" for my part.

replies(3): >>43750611 #>>43750637 #>>43750872 #

10. iugtmkbdfil834 ◴[21 Apr 25 11:03 UTC] No.43750511[source]▶

>>43750462 #

Ok. This sounds like an interesting detour. Can you elaborate on that one? I doubt I will ever use that knowledge, but it sounds like it is worth knowing anyway.

replies(2): >>43750518 #>>43750586 #

11. Tryk ◴[21 Apr 25 11:04 UTC] No.43750518{3}[source]▶

>>43750511 #

https://en.wikipedia.org/wiki/Swedish_alphabet

12. comrade1234 ◴[21 Apr 25 11:05 UTC] No.43750527[source]▶

>>43750462 #

lol really? Why not? Is that true for all encodings? Is it a bug or a feature? What about a simple character set like gsm-7 Swedish?

replies(2): >>43750578 #>>43750584 #

13. ◴[21 Apr 25 11:05 UTC] No.43750528[source]▶

>>43750314 (OP) #

14. iugtmkbdfil834 ◴[21 Apr 25 11:05 UTC] No.43750531[source]▶

>>43750496 #

Is it because everyone tries to make it look short?

edit: asking partly, because in my current work I occassionally have to convince non-technical users to use one type of entry over other. For that reason, easy to read, simple regex wins over fancy, but convoluted regex.

replies(1): >>43750783 #

15. Tryk ◴[21 Apr 25 11:05 UTC] No.43750532[source]▶

>>43750433 #

It's like a programming language inside a programming language.

16. mrkeen ◴[21 Apr 25 11:06 UTC] No.43750537[source]▶

>>43750481 #

Nothing that LLMs produce today is good enough to bypass a developer who can judge whether it's correct or not.

17. rapidaneurism ◴[21 Apr 25 11:07 UTC] No.43750549[source]▶

>>43750493 #

what about "hello \"there\"" ?

replies(1): >>43751084 #

18. criddell ◴[21 Apr 25 11:09 UTC] No.43750562[source]▶

>>43750484 #

It can help to learn what the “regular” part of regular expression refers to.

19. evertedsphere ◴[21 Apr 25 11:10 UTC] No.43750565[source]▶

>>43750314 (OP) #

  This pattern ([0-9][0-9]?[0-9]][.])+ matches one, two or three digits followed by a . and also matches repeated patterns of this. This wold match an IP address (albeit not strictly).

that pattern (once you fixed the typo) would not match a whole ip address unless you allowed it to also swallow the character after the last octet, which wouldn't work at, say, end of line

20. gwd ◴[21 Apr 25 11:10 UTC] No.43750572[source]▶

>>43750314 (OP) #

So my brother doesn't code for a living, but has done a fair amount of personal coding, and also gotten into the habit of watching live-coding sessions on YouTube. Recently he's gotten involved in my project a bit, and so we've done some pair programming sessions, in part to get him up to speed on the codebase, in part to get him up to speed on more industrial-grade coding practices and workflows.

At some point we needed to do some parsing of some strings, and I suggested a simple regex. But apparently a bunch of the streamers he's been watching basically have this attitude that regexes stink, and you should use basically anything else. So we had a conversation, and compared the clarity of coding up the relatively simple regex I'd made, with how you'd have to do it procedurally; I think the regex was a clear winner.

Obviously regexes aren't the right tool for every job, and they can certainly be done poorly; but in the right place at the right time they're the simplest, most robust, easiest to understand solution to the problem.

replies(1): >>43750627 #

21. goku12 ◴[21 Apr 25 11:11 UTC] No.43750576[source]▶

>>43750314 (OP) #

If you take the regex subset that works uniformly across all regex engines (even for just perl-compatible engines), you would probably get nothing done. They all have some minor variations that make it impossible to write a regex for a particular engine without a reference sheet open nearby, even if you have years of experience writing them. And those 'shortcuts' like look-ahead and look-behind are often too useful to be neglected completely.

Crafting regexes is story of its own. The other commentor has described it. Just to summarize, regexes are fine for simple patterns. But their complexity explode as soon as you need to handle a lot of corner cases.

22. thomasmg ◴[21 Apr 25 11:11 UTC] No.43750577[source]▶

>>43750314 (OP) #

For me, the main problem of the Regex syntax is the escaping rules: Many characters require escaping: \ { } ( ) [ ] | * + ? ^ $ . And the rules are different inside square brackets. I think it would be better if literal text is enclosed in quotes; that way, much less escaping is needed, but it would still be concise (and sometimes, more concise). I tried to formulate a proposal here: https://github.com/thomasmueller/bau-lang/blob/main/RegexV2....

replies(1): >>43750685 #

23. criddell ◴[21 Apr 25 11:12 UTC] No.43750578{3}[source]▶

>>43750527 #

The Swedish alphabet includes characters outside of the a-z range.

24. lalaithion ◴[21 Apr 25 11:12 UTC] No.43750584{3}[source]▶

>>43750527 #

The author says “any lowercase character” but they mean “any character between the character ‘a’ and the character ‘z’”, which happens to correspond to the lower case letters in English but doesn’t include ü, õ, ø, etc.

replies(2): >>43750991 #>>43751211 #

25. lalaithion ◴[21 Apr 25 11:13 UTC] No.43750586{3}[source]▶

>>43750511 #

26. ykonstant ◴[21 Apr 25 11:13 UTC] No.43750588[source]▶

>>43750481 #

I wonder if the problems people are pointing out with the examples (lowercase not being correct under various locales, IP address regex not being conformant etc) would be absent in code furnished by LLMs.

27. SnowingXIV ◴[21 Apr 25 11:16 UTC] No.43750611[source]▶

>>43750504 #

Yeah, this is my heaviest use case too. Mostly because it generally does save me a bit of time and is easily verifiable with tools like rubular and then can tweak what is needed once 90% there.

28. hyperman1 ◴[21 Apr 25 11:18 UTC] No.43750625[source]▶

>>43750314 (OP) #

This is both a demo for the beauty and power of regexes, and of their dangers:

* The use of backslash separatores quickly makes a mess, as they tend to need escaping wherever regexes are usefull.

* The uppercase/lowercase is only right if there are no accented characters, so USA. This is bad in western europe in files where they are rare: Your program works for a while, then an accent sneaks in and breaks things.

* The exact meaning of all the specials like \( vs ( .

* Ranges work in most regex dialects but not everywhere.

* A simple regex for an int with a specific range is nasty. If you want a full float, good luck.

Regexes are great as initial filter or quick hack, but you need more in full size programs.

I'd love to see a better regex syntax, too.

29. kelafoja ◴[21 Apr 25 11:18 UTC] No.43750627[source]▶

>>43750572 #

My problem is that regexes are write-only, unreadable once written (to me anyway). And sometimes they do more than you intended. You maybe tested on a few inputs and declared it fit for purpose, but there might be more inputs upon which it has unintended effects. I don't mind simple, straight-forward regexes. But when they become more complex, I tend to prefer to write out the procedural code, even if it is (much) longer in terms of lines. I find that generally I can read code better than regexes, and that code I write is more predictable than regexes I write.

replies(6): >>43750642 #>>43750826 #>>43751127 #>>43751152 #>>43751569 #>>43751927 #

30. TheDong ◴[21 Apr 25 11:18 UTC] No.43750628[source]▶

>>43750496 #

"matches an ip address" is a vague enough specification that of course people fail.

Is it what `inet_addr` accept? In that case, "1", "0x1", "00.01", "00000.01", and more are all ip addresses. `ping` accepts all of em anyway.

Is a valid ipv6 address one with the square brackets around it? Is "::1" a valid ip address? What about "fe80::1%eth2"? ping accepts both of these on my machine (though probably not on yours, since you probably don't have an eth2 interface)

replies(2): >>43750678 #>>43750809 #

31. throw-qqqqq ◴[21 Apr 25 11:20 UTC] No.43750637[source]▶

>>43750504 #

IMO it’s a “language” you need to understand in order to use.

Just like you wouldn’t copy/paste any random snippet into your source code if you don’t understand exactly what it does.

I see a lot of broken regex at work from people who use regular expressions but don’t understand them (for various reasons).

It used to come with a “found this on stackoverflow”-excuse, but mostly now it’s “AI told me to use this” instead.

replies(1): >>43750695 #

32. stavros ◴[21 Apr 25 11:20 UTC] No.43750641[source]▶

>>43750496 #

Well, it depends on how specific you want to be. You could do `.*`, and this will match an IP address, or you can be as specific as trying to specify number ranges digit by digit, which is so complicated that it doesn't merit a "can't even".

Also, `16843009` is an IP address, try pinging it.

33. fragmede ◴[21 Apr 25 11:20 UTC] No.43750642{3}[source]▶

>>43750627 #

You know you can write comments in your code where the regexp is, right?

replies(2): >>43750686 #>>43765568 #

34. ◴[21 Apr 25 11:24 UTC] No.43750660[source]▶

>>43750314 (OP) #

35. NikkiA ◴[21 Apr 25 11:27 UTC] No.43750678{3}[source]▶

>>43750628 #

square brackets around an IP address predates IPv6, it was/is? used to bypass DNS lookups and some (very) old programs required IP addresses inside [...] otherwise they were assumed to be a domain name with all the rules that implied.

36. lhamil64 ◴[21 Apr 25 11:27 UTC] No.43750685[source]▶

>>43750577 #

One thing I noticed with the example `['0-9a-f']`

Doesn't this go against the "literals are enclosed in quotes" idea? In this case, you have a special character (`-`) inside a quoted string. IMO this would be more consistent: `['0'-'9''a'-'f'']`, maybe even have comma separation like `['0'-'9','a'-'f'']`. This would also allow you to include the character classes like `[d,'a'-'f'']` although that might be a little confusing if you're used to normal regex.

replies(1): >>43751066 #

37. ◴[21 Apr 25 11:28 UTC] No.43750686{4}[source]▶

>>43750642 #

38. ninkendo ◴[21 Apr 25 11:29 UTC] No.43750693[source]▶

>>43750496 #

Writing one correctly is pretty complicated task if you’re trying to write a simple tutorial… off the top of my head, you’d need:

    (
      (
      25[0-5] # 250-255
      |
      2[0-4][0-9] # 200-249
      |
      1[0-9]{2} # 100-199
      |
      [1-9][0-9] # 10-99
      |
      [0-9]
      )
      \.
    ){3}
    (
    25[0-5] # 250-255
    |
    2[0-4][0-9] # 200-249
    |
    1[0-9]{2} # 100-199
    |
    [1-9][0-9] # 10-99
    |
    [0-9]
    )

… but without all the nice white space and comments, unless you’re willing to discuss regex engines that let you do multi-line/commented literals like that… I think ruby does, not sure what other languages.

The problem is that expressing “an integer from 0-255” is surprisingly complicated for regex engines to express. And that’s not even accounting for IP addresses that don’t use dots (which is legal as an argument to most software that connects to an IP address), as other commenters have pointed out.

replies(2): >>43750866 #>>43751438 #

39. qiine ◴[21 Apr 25 11:29 UTC] No.43750695{3}[source]▶

>>43750637 #

yeah programmers famously understands all the random boilerplate incantations they copy past in their code to get things going.

totally definitively

replies(2): >>43750763 #>>43752505 #

40. boricj ◴[21 Apr 25 11:33 UTC] No.43750724[source]▶

>>43750314 (OP) #

In a previous job I've done some stupid tricks with regexes. Inside a MongoDB database I had documents with a version field in string form ("x.y.z") and I needed to exclude documents with a schema too old to process in my queries.

One can construct a regex that matches a number between x and y by enumerating all the digit patterns that fit the criteria. For example, the following pattern matches a number between 1 and 255: ^([1-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])$

This can be extended to match a version less than or equal to x.z.y by enumerating all the patterns across the different fields. The following pattern matches any version less than or equal to 2.14.0: ^([0-1]\.\d+\.\d+)|(2\.[0-9]\.\d+|(2\.1[0-3]\.\d+))$

Basically, I wrote a Java method that would generate a regex with all the patterns to match a version greater than or equal to a lower bound, which was then fed to MongoDB queries to exclude documents too old to process based on the version field. It was a stupid solution to a dumb problem, but it worked flawlessly.

41. vitus ◴[21 Apr 25 11:33 UTC] No.43750726[source]▶

>>43750496 #

It's especially ironic given that the title of the post is "Regex Isn't Hard", and then it proceeds to make several (syntactical and logical) errors in the one real-world example.

Syntax error aside (there's an extra ] floating around), it's not even close to correct -- it'll match "999.999.999.000.999." among other things, will never match just one digit (there's a missing ?), and always insists on the trailing dot.

replies(2): >>43750946 #>>43751818 #

42. bazoom42 ◴[21 Apr 25 11:35 UTC] No.43750735[source]▶

>>43750314 (OP) #

Honestly regex syntax is a mess. For example parentheses are used both for grouping alternatives and for capturing. I think Perl 6 tried (and failed) to fix this. Larger problem is you have to memorize the meta characters since they are basically random.

Regex is still the best solution I know of for its intended domain.

43. throw-qqqqq ◴[21 Apr 25 11:38 UTC] No.43750763{4}[source]▶

>>43750695 #

We all have our own ideas of Utopia I guess :)

44. vitus ◴[21 Apr 25 11:40 UTC] No.43750783{3}[source]▶

>>43750531 #

> For that reason, easy to read, simple regex wins over fancy, but convoluted regex.

Sure, I'd take \d+\.\d+\.\d+\.\d+ over... "((2(5[0-5]|[0-4][0-9])|1[0-9]{2}|[1-9]?[0-9])\.){3}(2(5[0-5]|[0-4][0-9])|1[0-9]{2}|[1-9]?[0-9])", assuming that I then validate the results afterwards.

45. bazoom42 ◴[21 Apr 25 11:41 UTC] No.43750794[source]▶

>>43750481 #

How would you know if a regex is correct if you dont understand it?

replies(1): >>43752218 #

46. latexr ◴[21 Apr 25 11:43 UTC] No.43750808[source]▶

>>43750314 (OP) #

I’m a fan of regular expressions, though I understand why many people wince at the sight. You should avoid showing them to a non-programmer who is interested in learning to code, because they’ll immediately fear programming is intractable.

Even as much as I like regex, I wouldn’t recommend this post. One reason is the code style is too close to regular text:

> a matches a single character, always lowercase a.

That sentence uses “a” three times, two of them as code and once as an indefinite article, but it’s not immediately obvious to eye. VoiceOver completely fumbles it, especially considering the sentence immediately after.

A more important reason against recommending the article is that I find a bunch of the arguments to be unhelpful. If you’re trying to convince people to give regular expressions a chance, telling them to ignore `.` and use `[^%]` is going to bite them. That’s not super common (important when trying to learn more from other sources) and even an experienced regexer must do a double take to figure out “is there a reason this specific character must not be matched?” Furthermore, no new learner is going to remember that four character incantation, and neither are they going to understand what’s happening when their code doesn’t work because there was a `%` in their text. People need to learn about `.` (possibly the most common character in regex) if only because they also need to learn to escape it and not ignore it when there is a literal period in the text. Don’t tell people to ignore repetition ranges either, they aren’t difficult to reason about and are certainly simpler to read than the same blob of intractable text multiple times.

replies(1): >>43751104 #

47. ◴[21 Apr 25 11:43 UTC] No.43750809{3}[source]▶

>>43750628 #

48. lairv ◴[21 Apr 25 11:45 UTC] No.43750827[source]▶

>>43750314 (OP) #

My issue with regexes is that the formal definition of regex I learned at university is clear and simple [0] but then using them in programming languages is always a mess

[0] https://en.wikipedia.org/wiki/Regular_expression#Formal_lang...

replies(1): >>43769129 #

49. latexr ◴[21 Apr 25 11:45 UTC] No.43750826{3}[source]▶

>>43750627 #

> unreadable once written (to me anyway). (…) there might be more inputs upon which it has unintended effects.

https://regex101.com can explain your regex back to you, and allows you to test it with more inputs.

Though I’m not trying to convince you to always use regular expressions, I agree with GP:

> Obviously regexes aren't the right tool for every job, and they can certainly be done poorly; but in the right place at the right time they're the simplest, most robust, easiest to understand solution to the problem.

50. vitus ◴[21 Apr 25 11:51 UTC] No.43750866{3}[source]▶

>>43750693 #

> I think ruby does, not sure what other languages.

You're right that Ruby has it. Perl also has /x, of course (since most of Ruby regex was "inspired" directly by Perl's syntax), as well as Python (re.VERBOSE). Otherwise, yeah, it's disappointingly rare.

replies(1): >>43756671 #

51. TheOtherHobbes ◴[21 Apr 25 11:52 UTC] No.43750872[source]▶

>>43750504 #

It's hilarious that the most reliable way to write a complex regex is to fire up billions of dollars of state of the art ML code and ask for what you want in English.

52. michaelt ◴[21 Apr 25 12:02 UTC] No.43750946{3}[source]▶

>>43750726 #

Correct - it'll accept "999.999.999.000.999." but it'll reject "127.0.0.1"

53. comrade1234 ◴[21 Apr 25 12:07 UTC] No.43750991{4}[source]▶

>>43750584 #

I would expect [a-z] to mean any lowercase in any language, not lowercase but only a to z. So I’d get bitten by that one.

replies(1): >>43751137 #

54. thomasmg ◴[21 Apr 25 12:15 UTC] No.43751066{3}[source]▶

>>43750685 #

Thanks for reading and taking the time to respond!

> Doesn't this go against the "literals are enclosed in quotes" idea?

Sure, one could argue that other changes would also be useful, but then it would be less concise. I think the main reasons why people like regex are: (a) powerful, (b) concise.

For my V2 proposal, the new rule is: "literals are enclosed in quotes", the rule isn't "_only_ literals are enclosed in quotes" :-) In this case, I think `-` can be quoted as well. I wanted to keep the v2 syntax as close as possible to the existing syntax.

55. pyfon ◴[21 Apr 25 12:16 UTC] No.43751084{3}[source]▶

>>43750549 #

Not sure what you are asking?

56. LaputanMachine ◴[21 Apr 25 12:18 UTC] No.43751104[source]▶

>>43750808 #

I've also seen people use `[\s\S]` to match all characters when they couldn't use `.`.

replies(2): >>43752536 #>>43752606 #

57. bazoom42 ◴[21 Apr 25 12:20 UTC] No.43751127{3}[source]▶

>>43750627 #

> I tend to prefer to write out the procedural code, even if it is (much) longer in terms of lines.

This might work for you, but in general the amount of bugs is proportional to the amount of code. The regex engine is alredy throughly tested by someone else while a custom implementation in procedural code will probably have bugs and be a lot more work to maintain if the pattern changes.

replies(3): >>43751445 #>>43753974 #>>43765539 #

58. deciduously ◴[21 Apr 25 12:21 UTC] No.43751137{5}[source]▶

>>43750991 #

The letters with diacritics sort lexicographically after 'z', so it does stand to reason they wouldn't appear in that range.

59. jcelerier ◴[21 Apr 25 12:23 UTC] No.43751152{3}[source]▶

>>43750627 #

What makes them unreadable to you ? 99% of the time you can just read them character by character with maybe some groups and back references

replies(1): >>43754116 #

60. Someone ◴[21 Apr 25 12:30 UTC] No.43751211{4}[source]▶

>>43750584 #

> but they mean “any character between the character ‘a’ and the character ‘z’”, which happens to correspond to the lower case letters in English

‘Only’ in the most commonly used character encodings. In EBCDIC (https://en.wikipedia.org/wiki/EBCDIC), the [a-z] range includes more than 26 characters.

That’s one of the reasons POSIX has character classes (https://en.wikipedia.org/wiki/Regular_expression#Character_c...). [:lower:] always gets you the lowercase characters in the encoding that the program uses.

61. aadhavans ◴[21 Apr 25 12:33 UTC] No.43751250[source]▶

>>43750496 #

Shameless plug: My Regex engine (https://pkg.go.dev/gitea.twomorecents.org/Rockingcool/kleing...) has dedicated syntax for this kind of task.

  <0-255>\.<0-255>\.<0-255>\.<0-255>

will only match full IPv4 addresses, but is a lot stricter than the one in the article.

EDIT: formatting

62. ◴[21 Apr 25 12:37 UTC] No.43751289[source]▶

>>43750314 (OP) #

63. BMc2020 ◴[21 Apr 25 12:38 UTC] No.43751301[source]▶

>>43750314 (OP) #

Regex is much easier if you don't do it all at once. It's perfectly acceptable to, say, trim all the leading spaces, store the result in a temp variable, trim all the trailing spaces, store the result in a temp variable, remove all the hyphens. etc. etc.

Everyone tries to create the platonic ideal regex that does everything in one line.

64. prmph ◴[21 Apr 25 12:38 UTC] No.43751308[source]▶

>>43750314 (OP) #

I consider myself a reasonably competent senior engineer, and yet with regex this is what I have noticed:

Every time I need to write even the simplest regex, I can't seem to get it right the first time. I always need to struggle with it for a long time. Sometimes even using online tools takes me time to get it right. This happens every.single.time.

It baffles me to no end. I'm a pretty quick learner of pretty much everything I get into. I write the most sophisticated Typescript code you can imagine; I've written a small toy language; I've written biometric authentication drivers; I've written my own functional UI lib. But, I cannot master regex.

You can give me all the arguments about what is good about regex, but in my experience (which you can't argue with), it is a VERY badly designed API, and nothing will convince me otherwise. Regex is probably the worst thing ever in programming.

65. russfink ◴[21 Apr 25 12:40 UTC] No.43751329[source]▶

>>43750496 #

^^^ this ^^^ I can’t understand my own regexes after a couple weeks - much less the ones I got the AI to write for me because I’m lazy or time constrained.

66. alganet ◴[21 Apr 25 12:45 UTC] No.43751378[source]▶

>>43750314 (OP) #

One can think of regex as very compact notation for writing text operations. It helps a lot.

The popular idea of them being write-only is obviously a joke, but it has some truth to it. On the good side, small code that needs to be rewritten is often better than large code that needs to be maintained.

67. ◴[21 Apr 25 12:45 UTC] No.43751379[source]▶

>>43750314 (OP) #

68. hamdouni ◴[21 Apr 25 12:48 UTC] No.43751410[source]▶

>>43750314 (OP) #

I jump here just to say that non-greedy construction is valuable and not using them make expression harder to write and to understand.

69. thoroughburro ◴[21 Apr 25 12:50 UTC] No.43751428[source]▶

>>43750314 (OP) #

> NOTE: Some languages, like Rust, have parser combinators which can be as good or better than regex in most of the ways I care about.

What Rust feature is this referring to?

70. wat10000 ◴[21 Apr 25 12:52 UTC] No.43751438{3}[source]▶

>>43750693 #

Regex can be good but you need to be willing to bail out when it’s not appropriate.

For something like locating IP addresses in text, using a regex to identify candidates is a great idea. But as you show, you don’t want to implement the full validation in it. Use regex to find dotted digit groups, but validate the actual numeric values as a separate step afterwards.

71. RHSeeger ◴[21 Apr 25 12:52 UTC] No.43751441[source]▶

>>43750314 (OP) #

I tend to use regular expressions more commonly on the command line (looking for content in files, especially log files) than I do in code. But, that being said, I do use them in both cases. They're a tool and can be used well. But, like any other programming, you need to make sure your code is readable. Which (generally) means avoiding any really complex regular expressions.

72. justin66 ◴[21 Apr 25 12:52 UTC] No.43751445{4}[source]▶

>>43751127 #

> This might work for you, but in general the amount of bugs is proportional to the amount of code.

If you wanted to look for cases which serve as an exception to this rule, code relying on regexes would be an excellent place to start.

73. satisfice ◴[21 Apr 25 12:58 UTC] No.43751508[source]▶

>>43750314 (OP) #

I like the sentiment but I would make some very different choices. For instance, use the . operator, because it is easier to understand than his Rube-Goldberg-logic negation groups alternative.

He’s also strangely worried about portability. If you are really concerned about portability, you are moving between languages and you probably aren’t some novice who should be frightened by complexity.

I don’t think about portability at all, ever. And I do maintain code in Perl, Python, and Javascript.

But yeah, just as in all programming languages, you can get by with knowing about a 20% subset of all it can do.

74. bena ◴[21 Apr 25 13:04 UTC] No.43751569{3}[source]▶

>>43750627 #

Kind of fair.

I don't incorporate a lot of regular expressions into my code. But where I do like them is for search and replace. So I do treat them as mostly disposable.

75. nialv7 ◴[21 Apr 25 13:11 UTC] No.43751632[source]▶

>>43750496 #

When the article starts with an AI generated image that adds nothing to the explanation, it tends to make me suspicious if the article itself was written by an AI as well...

76. noxer ◴[21 Apr 25 13:22 UTC] No.43751734[source]▶

>>43750314 (OP) #

> Instead, use a range negation, like [^%] if you know the % character won’t show up. It doesn’t hurt to be a little more explicit.

This is absolutely horrible, pattern are fairly readable if they follow the syntax logic. Matching "everything but that random character that will not appear" is absurd. Also the idea that a . (dot) behaves arbitrary in different languages shows a sever lack up understanding about regex syntax. Ofc you can't write a proper pattern if you don't know which syntax is used. If anything you would force override the behavior of the . (dot) with the appropriate flag to ensure it works the same with different compatible regex engines.

replies(1): >>43754673 #

77. mannykannot ◴[21 Apr 25 13:29 UTC] No.43751818{3}[source]▶

>>43750726 #

In practice, the first unpaired ] is treated as an ordinary character (at least according to https://regex101.com/) - which does nothing to make this regex fit for its intended purpose. I'm not sure whether this is according to spec. (I think it is, though that does not really matter compared to what the implementations actually do.)

Characters which are sometimes special, depending on context, are one more thing making regexes harder than they appear at first sight.

The author's willingness to publish code without even minimal testing does not inspire confidence.

replies(1): >>43756107 #

78. rusk ◴[21 Apr 25 13:41 UTC] No.43751927{3}[source]▶

>>43750627 #

These are all valid criticisms of regex

but they’re not an excuse to avoid regex. Similarly git has many warts but there’s no getting around it. Same with CSS

If you want to run with the herd though you need to know these things, even enjoy them.

You can rely on tooling and training wheels like Python VERBOSE but you’re never going to get away from the fact that the “rump” of the population works with them.

Easier to bite the bullet and get practised. I’ve no doubt you have the intellect - you only need be convinced it’s a good use of your time.

79. poisonborz ◴[21 Apr 25 14:11 UTC] No.43752218{3}[source]▶

>>43750794 #

You have test strings covering all cases and they match accordingly? The same way you'd know when writing manually.

replies(1): >>43753334 #

80. tomsmeding ◴[21 Apr 25 14:38 UTC] No.43752505{4}[source]▶

>>43750695 #

I know some people consider this fine. I do not. The fact that the world is not ideal does not mean that we cannot continue to improve things.

81. jmpman ◴[21 Apr 25 14:38 UTC] No.43752508[source]▶

>>43750314 (OP) #

I’ve started using LLMs to identify the proper regex for my use cases. I’d like to see such regex creation as an LLM benchmark.

82. tomsmeding ◴[21 Apr 25 14:39 UTC] No.43752536{3}[source]▶

>>43751104 #

This is a common approach when the regex needs to match any character including newlines; `.` often doesn't.

83. dimava ◴[21 Apr 25 14:47 UTC] No.43752606{3}[source]▶

>>43751104 #

I generally use `[^]`

Also you can use . with the dotAll /s

84. ◴[21 Apr 25 15:42 UTC] No.43753227[source]▶

>>43753195 #

85. bazoom42 ◴[21 Apr 25 15:52 UTC] No.43753334{4}[source]▶

>>43752218 #

Covering all cases? How would that be possible? Even if we only consider ASCII strings, there are 16.000 possible two-character strings, 2 million possible three-character strings and so on.

86. mannykannot ◴[21 Apr 25 16:02 UTC] No.43753483[source]▶

>>43750314 (OP) #

Here’s a regex crossword:

https://jimbly.github.io/regex-crossword/

See also: Are Regex Crosswords NP-hard?

https://cs.stackexchange.com/questions/30143/are-regex-cross...

87. rerdavies ◴[21 Apr 25 16:52 UTC] No.43753974{4}[source]▶

>>43751127 #

In general, the correctness of the code is proportional to its readability.

I also prefer procedural code instead of regexes.

replies(1): >>43755629 #

88. reverendsteveii ◴[21 Apr 25 17:01 UTC] No.43754055[source]▶

>>43750496 #

I haven't verified it but quick googling for a regex to validate all legal email addresses pointed me to https://stackoverflow.com/questions/201323/how-can-i-validat..., where one commenter posits that regex to be:

(?:(?:\r\n)?[ \t])(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?: \r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:( ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t])))@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\0 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\ ](?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+ (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?: (?:\r\n)?[ \t])))|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n) ?[ \t]))\<(?:(?:\r\n)?[ \t])(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\ r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n) ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t] )))(?:,@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t])* )(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t] )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t])))) :(?:(?:\r\n)?[ \t]))?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+ |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r \n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?: \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t ]))"(?:(?:\r\n)?[ \t])))@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031 ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\]( ?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(? :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(? :\r\n)?[ \t])))\>(?:(?:\r\n)?[ \t]))|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(? :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)? [ \t]))"(?:(?:\r\n)?[ \t])):(?:(?:\r\n)?[ \t])(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]| \\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>

@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|" (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t])))@(?:(?:\r\n)?[ \t] )(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(? :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[ \]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t])))|(?:[^()<>@,;:\\".\[\] \000- \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|( ?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t]))\<(?:(?:\r\n)?[ \t])(?:@(?:[^()<>@,; :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([ ^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\" .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\ ]\r\\]|\\.)\](?:(?:\r\n)?[ \t])))(?:,@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\ [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\ r\\]|\\.)\](?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\] |\\.)\](?:(?:\r\n)?[ \t])))):(?:(?:\r\n)?[ \t]))?(?:[^()<>@,;:\\".\[\] \0 00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\ .|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[^()<>@, ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(? :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t])))@(?:(?:\r\n)?[ \t])* (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t])(?:[ ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\] ]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t])))\>(?:(?:\r\n)?[ \t]))(?:,\s( ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t]))(?:\.(?:( ?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[ \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t ])))@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t]))(? :\.(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+| \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t])))|(?: [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\ ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t]))\<(?:(?:\r\n) ?[ \t])(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[" ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n) ?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>

@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t])))(?:,@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@, ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t]))(?:\.(?:(?:\r\n)?[ \t] )(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\ ".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t])))):(?:(?:\r\n)?[ \t]))? (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\". \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t]))(?:\.(?:(?: \r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[ "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))"(?:(?:\r\n)?[ \t]) ))@(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t]))(?:\ .(?:(?:\r\n)?[ \t])(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)\](?:(?:\r\n)?[ \t])))\>(?:( ?:\r\n)?[ \t]))))?;\s)

Someone else then asks the absolute razor of a question: 'What value does this add over just verifying that the input is of the form {something}@{something}.{something}?'

replies(1): >>43755438 #

89. bluecheese452 ◴[21 Apr 25 17:06 UTC] No.43754116{4}[source]▶

>>43751152 #

I don’t think this is a particularly useful question. If they could accurately describe what exactly is confusing they wouldn’t be confused.

90. strunz ◴[21 Apr 25 18:03 UTC] No.43754673[source]▶

>>43751734 #

Agreed, I wanted to write the whole article off after that suggestion. That is such a terrible anti pattern that would confuse everyone who looked at it, even people with decades of experience.

91. Supermancho ◴[21 Apr 25 19:17 UTC] No.43755438{3}[source]▶

>>43754055 #

> 'What value does this add over just verifying that the input is of the form {something}@{something}.{something}?'

Depends if {something} can contain periods for my email.

name@antispam.mydomain.com

replies(1): >>43762827 #

92. bazoom42 ◴[21 Apr 25 19:40 UTC] No.43755629{5}[source]▶

>>43753974 #

Surely complexity is a factor? A procedual implementation will necessarily have the same essential complexity as the regex it replaces, but then it will additionally have a bunch of incidental complexity in matching and looping and backtracking.

Regexes can certainly be hard to read - the solution is to use formatting and comments to make them easier to understand - not to drown the logic in reams of boilerplate code.

replies(1): >>43765553 #

93. vitus ◴[21 Apr 25 20:28 UTC] No.43756107{4}[source]▶

>>43751818 #

Agreed entirely, on all those points.

Calling the extra ] a syntax error was a slight exaggeration on my behalf, but that was clearly an unintended extra character -- there's no way the author thinks "123].45].67].89]" is a valid IP address. But yes, it does compile and is interpreted as a valid regex, albeit not a useful one in this context.

The out-of-range values are not ideal but can be fixed with post-validation in code (which is cleaner than writing unnecessarily complicated regex, anyways). The missing ? leads to a bunch of false negatives, and the trailing . causes even more problems.

94. bazoom42 ◴[21 Apr 25 21:26 UTC] No.43756671{4}[source]▶

>>43750866 #

.net also supports verbose regex.

95. ◴[22 Apr 25 02:29 UTC] No.43758573[source]▶

>>43750314 (OP) #

96. m463 ◴[22 Apr 25 02:30 UTC] No.43758574[source]▶

>>43750314 (OP) #

Regexes are powerful, useful and needlessly hard to use.

But not because of the regex idea itself.

It is quoting.

The reason people don't properly learn how to use a regex is because they are insulated from it by whatever language they are using.

It's literally like those surgeons who do heart surgery starting at a vein in your leg.

I use regexes all the time, in emacs, python, perl, bash, sed, awk, grep and more...

and just about every time the regex syntax is mixed with single quotes, double quotes, backslashes, $variable names and more from the "enclosing language or tool".

If I have a parenthesis or $, I'm always wondering if it is part of the enclosing language, or the matching pattern, or the literal. Also, the kind of regex adds to the confusion (basic or extended regex?)

I think it would be nice to have a syntax highlighter that would help with this, independent of language. green for variable or other language construct, red for regex pattern, white for matching literal.

replies(1): >>43759153 #

97. recursivecaveat ◴[22 Apr 25 04:50 UTC] No.43759153[source]▶

>>43758574 #

Wait until somebody uses string templating to insert something that ends with a backslash, changing the meaning of following characters from what the syntax highlighting thinks; a curse be upon that person.

Escaping/quoting is such a mud pile everywhere because it's in-band communication, but nobody would tolerate all out-of-band because it's too tedious. At least newer languages are getting better with things like 'raw' strings or Rust's arbitrarily long delimeters, but I'd still like more control.

I'm surprised I never see languages adopt directed delimeters like {my string} or something, since it lets you avoid escaping in the very common case of balanced internal delimeters.

98. reverendsteveii ◴[22 Apr 25 14:44 UTC] No.43762827{4}[source]▶

>>43755438 #

something is defined as any string of valid characters greater than 0 in length. I'm sure there's some char somewhere that breaks this but in a realistic setting with normal users you won't encounter that edge case and if you do the worst that happens is an email gets returned as non-deliverable

99. kelafoja ◴[22 Apr 25 19:39 UTC] No.43765539{4}[source]▶

>>43751127 #

That is quite a generalization. The regex engine is tested, but my specific regular expression isn't. My ability to write correct regular expressions is weak, so there can be many bugs in the one line of regular expession.

replies(2): >>43822087 #>>43822524 #

100. kelafoja ◴[22 Apr 25 19:41 UTC] No.43765553{6}[source]▶

>>43755629 #

> A procedual implementation will necessarily have the same essential complexity as the regex it replaces

I don't think I fully agree with this, and I don't see a basis for why this should be true. If I have a very specific implementation, it could have very little incidental complexity, it could be fully targeted to the use case. Whereas with regular expressions there is incidental complexity of the regex engine itself by definition.

replies(1): >>43771185 #

101. kelafoja ◴[22 Apr 25 19:43 UTC] No.43765568{4}[source]▶

>>43750642 #

You know that there are more friendly sounding ways to give this suggestion, right?

102. krackers ◴[23 Apr 25 06:09 UTC] No.43769129[source]▶

>>43750827 #

The issue is the formal definition of regex only deals with whether a string belongs to language recognized by regex or not (boolean accept/non-accept), but regex in practice often talks in terms of "find the substring (if any) that matches". Which then causes issues because a regex is equivalent to an NFA so a given string can be matched in possibly multiple ways, which forces you to bring in the notion of a "greedy" vs "non-greedy" match in order to disambiguate. And then add in top of that the desire to define sub-matches in terms of capturing groups, and it's just a complete mess. And that's not even getting to not-strictly regular PCRE extensions like lookaround, backreferences, etc.

103. bazoom42 ◴[23 Apr 25 12:12 UTC] No.43771185{7}[source]▶

>>43765553 #

Complexity in the standard library is not that relevant. If you make your own custom dictionary implementation, you increase complexity of your code base compared to just using the one in the standard library, even if your own implementaion is simpler.

The relevant complexity for using a regex is the complexity of the pattern itself and the complexity of invoking the regex. Any custom procedural solution will be more complex unless it is literally something as simple as checking whether a string contain a given literal string.

replies(1): >>43792521 #

104. rerdavies ◴[25 Apr 25 11:36 UTC] No.43792521{8}[source]▶

>>43771185 #

For some arbitrary definition of complex.

105. ◴[28 Apr 25 14:45 UTC] No.43822087{5}[source]▶

>>43765539 #

106. bazoom42 ◴[28 Apr 25 15:25 UTC] No.43822524{5}[source]▶

>>43765539 #

If you have made a bug in the specification of the pattern to match, then you will have the same bug in the hand-rolled implementation of the matching. It will just be more difficult to find the bug since the pattern is not explicitly specified anymore.