Most active commenters
  • RadiozRadioz(3)
  • tough(3)
  • layer8(3)

←back to thread

Getting AI to write good SQL

(cloud.google.com)
476 points richards | 29 comments | | HN request time: 1.836s | source | bottom
Show context
nashashmi ◴[] No.44010349[source]
AI text to regex solutions would be incredibly handy.
replies(4): >>44010706 #>>44011846 #>>44012981 #>>44015104 #
1. RadiozRadioz ◴[] No.44010706[source]
This comment appears frequently and always surprises me. Do people just... not know regex? It seems so foreign to me.

It's not like it's some obscure thing, it's absolutely ubiquitous.

Relatively speaking it's not very complicated, it's widely documented, has vast learning resources, and has some of the best ROI of any DSL. It's funny to joke that it looks like line noise, but really, there is not a lot to learn to understand 90% of the expressions people actually write.

It takes far longer to tell an AI what you want than to write a regex yourself.

replies(12): >>44010769 #>>44010791 #>>44010803 #>>44010854 #>>44010941 #>>44011236 #>>44011532 #>>44011584 #>>44011591 #>>44012097 #>>44012483 #>>44013224 #
2. tough ◴[] No.44010769[source]
Its something you use so sparingly far away usually that never sticks around
replies(1): >>44011136 #
3. jacob019 ◴[] No.44010791[source]
The first languge I used to solve real problems was perl, where regex is a first class citizen. In python less so, most of my python scripts don't use it. I love regex but know several developers who avoid it like plague. You don't know what you don't know, and there's nothing wrong with that. LLM's are super helpful for getting up to speed on stuff.
4. insin ◴[] No.44010803[source]
IME it's not just longer, but also more difficult to tell the LLM precisely what you want than to write it yourself if you need a somewhat novel RegExp, which won't be all over the training data.

I needed one to do something with Markdown which was a very internal BigCo thing to need to do, something I'd never have written without weird requirements in play. It wasn't that tricky, but going back trying to get LLMs to replicate it after the fact from the same description I was working from, they were hopeless. I need to dig that out again and try it on the latest models.

5. eddd-ddde ◴[] No.44010854[source]
I know regex. But I use it so sparingly that every time I need it I forgot again the character for word boundary, or the character for whitespace, or the exact incantation for negative lookahead. Is it >!? who knows.

A shortcut to type in natural language and get something I can validate in seconds is really useful.

replies(1): >>44010931 #
6. layer8 ◴[] No.44010931[source]
How do you validate it if you don’t know the syntax? Or are you saying that looking up syntax –> semantics is significantly quicker than semantics –> syntax? Which I don’t find to be the case. What takes time is grokking the semantics in context, which you have to do in both cases.
replies(2): >>44010966 #>>44014538 #
7. CuriouslyC ◴[] No.44010941[source]
I was using perl in the late 90s for sysadmin stuff, have written web scrapers in python and have a solid history with regex. That being said, AI can still write really complex lookback/lookahead/nested extraction code MUCH faster and with fewer bugs than me, because regex is easy to make small mistakes with even when proficient.
8. tough ◴[] No.44010966{3}[source]
https://regex101.com/
replies(2): >>44011044 #>>44011120 #
9. marcosdumay ◴[] No.44011044{4}[source]
Notice that site has a very usable reference list you can consult for all those details the GP forgets.
10. layer8 ◴[] No.44011120{4}[source]
That doesn’t answer the question. By “validate”, I mean “prove to yourself that the regular expression is correct”. Much like with program code, you can’t do that by only testing it. You need to understand what the expression actually says.
replies(1): >>44013194 #
11. skydhash ◴[] No.44011136[source]
A cheat sheet is just a web search away.
replies(1): >>44012057 #
12. nashashmi ◴[] No.44011236[source]
I use regex as an alternative to wildcards in various apps like notepad++ and vscode. The format is different in each app. And the syntax is somewhat different. I have to research it each time. And complex regex is a nightmare.

Which is why I would ask an AI to build it if it could.

13. crystal_revenge ◴[] No.44011532[source]
I personally didn’t really understand how to write regex until I understood “regular languages” properly, then it was obvious.

I’ve found that the vast majority of programmers today do not have any foundation in formal languages and/or the theory of computation (something that 10 years ago was pretty common to assume).

It used to be pretty safe to assume that everyone from perl hackers to computer science theorists understood regex pretty well, but I’ve found it’s increasingly a rare skill. While it used to be common for all programmers to understand these things, even people with a CS background view that as some annoying course they forgot as soon as the exam was over.

14. emmelaich ◴[] No.44011584[source]
There's often a bunch of edge cases that people overlook. And you also get quadratic behaviour for some fairly 'simple' looking regexes that few people seem aware of.
15. fooker ◴[] No.44011591[source]
Regex, especially non standard (and non regular) extensions can be pretty tricky to grok.

http://alf.nu/RegexGolf?world=regex&level=r00

replies(1): >>44013047 #
16. jimbokun ◴[] No.44012057{3}[source]
So is an LLM.
replies(1): >>44013011 #
17. vivzkestrel ◴[] No.44012097[source]
since you know so much regex, why dont you write a regex html parser /s
18. simonw ◴[] No.44012483[source]
"It takes far longer to tell an AI what you want than to write a regex yourself."

My experience is the exact opposite. Writing anything but the simplest regex by hand still takes me significant time, and I've been using them for decades.

Getting an LLM to spit out a regex is so much less work. Especially since an LLM already knows the details of the different potential dialects of regex.

I use them to write regexes in PostgreSQL, Python, JavaScript, ripgrep... they've turned writing a regex from something I expect to involve a bunch of documentation diving to something I'll do on a whim.

Here's a recent example - my prompt included a copy of a PostgreSQL schema and these instructions:

  Write me a SQL query to extract
  all of my images and their alt
  tags using regular expressions.
  In HTML documents it should look
  for either <img .* src="..." .*
  alt="..." or <img alt="..." .*
  src="..." (images may be self-
  closing XHTML style in some 
  places). In Markdown they will
  always be ![alt text](url)
I ended up with 100 lines of SQL: https://gist.github.com/simonw/5b44a662354e124e33cc1d4704cdb...

The markdown portion of that is a good example of the kind of regex I don't enjoy writing by hand, due to the need to remember exactly which characters to escape and how:

  (REGEXP_MATCHES(commentary,
  '!\[([^\]]*)\]\(([^)]*)\)', 'g'))[2] AS src,
  (REGEXP_MATCHES(commentary,
  '!\[([^\]]*)\]\(([^)]*)\)', 'g'))[1] AS alt_text
Full prompt and notes here: https://simonwillison.net/2025/Apr/28/dashboard-alt-text/
replies(1): >>44013215 #
19. DonHopkins ◴[] No.44013011{4}[source]
So is a real html parser.

https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)

20. RadiozRadioz ◴[] No.44013047[source]
/foo/

took me 25.75 seconds, including learning how the website worked. I actually solved it in ~15 seconds, but I hadn't realized I got the correct answer becuase it was far too simple.

This website is much better https://regexcrossword.com/challenges/experienced/puzzles/e9...

replies(1): >>44017836 #
21. widdershins ◴[] No.44013194{5}[source]
Testing something is the best way to prove that it behaves correctly in all the cases you can think of. Relying on your own (fallible) understanding is dangerous.

Of course, there may be cases you didn't think of where it behaves incorrectly. But if that's true, you're just as likely to forget those cases when studying the expression to see "what it actually says". If you have tests, fixing a broken case (once you discover it) is easy to do without breaking the existing cases you care about.

So for me, getting an AI to write a regex, and writing some tests for it (possibly with AI help) is a reasonable way to work.

replies(1): >>44014320 #
22. RadiozRadioz ◴[] No.44013215[source]
Perhaps Perl has given me Stockholm Syndrome, but when I look at your escaped regex example, it's extremely natural for me. In fact, I'd say it's a little too simple, because the LLM forgot to exclude unnecessary whitespace:

  (REGEXP_MATCHES(commentary,
  '!\[\s*([^\]]*?)\s*\]\(\s*([^)]*?)\s*\)', 'g'))[2] AS src,
  (REGEXP_MATCHES(commentary,
  '!\[\s*([^\]]*?)\s*\]\(\s*([^)]*?)\s*\)', 'g'))[1] AS alt_text
That is just nitpicking a one-off example though, I understand your wider point.

I appreciate the LLM is useful for problems outside one's usual scope of comfort. I'm mainly saying that I think it's a skill where the "time economics" really are in favor of learning it and expanding your scope. As in, it does not take a lot learning time before you're faster than the LLM for 90% of things, and those things occur frequently enough that your "learning time deficit" gets repaid quickly. Certainly not the case for all skills, but I truly believe regex is one of them due to its small scope and ubiquitous application. The LLM can be used for the remaining 10% of really complicated cases.

As you've been using regex for decades, there is already a large subset of problems where you're faster than the LLM. So that problem space exists, it's all about how to tune learning time to right-size it for the frequency the problems are encountered. Regex, I think, is simple enough & frequent enough where that works very well.

replies(1): >>44013646 #
23. nevf1 ◴[] No.44013224[source]
I respectfully disagree. Thankfully, I don't need to write regex much, so when I do it's always like it's the first time. I don't find the syntax particularly intuitive and I always rely on web-based or third party tools to validate my regex.

Whenever I have worked on code smells (performance issues, fuzzy test fails etc), regex was 3rd only to poorly written SQL queries, and/or network latency.

All-in-all, not a good experience for me. Regex is the one task that I almost entirely rely on GitHub Copilot in the 3-4 times a year I have to.

24. simonw ◴[] No.44013646{3}[source]
> As in, it does not take a lot learning time before you're faster than the LLM for 90% of things, and those things occur frequently enough that your "learning time deficit" gets repaid quickly.

It doesn't matter how fast I get at regex, I still won't be able to type any but the shortest (<5 characters) patterns out quicker than an LLM can. They are typing assistants that can make really good guesses about my vaguely worded intent.

As for learning deficit: I am learning so much more thanks to heavy use of LLMs!

Prior to LLMs the idea of using a 100 line PostgreSQL query with embedded regex to answer a mild curiosity about my use of alt text would have finished at the idea stage: that's not a high value enough problem for me to invest more than a couple of minutes, so I would not have done it at all.

replies(1): >>44013937 #
25. Agraillo ◴[] No.44013937{4}[source]
Good points. Also looking at your original example I noticed that not only humans can explain regularities they expect in many different ways (also correcting along the way), they can basically ask LLM to base the result on a reference. In your example you provided a template with an img tag and brackets having different attributes patterns. But one can also just ask for a html-style tag. As I did with the "Please create a regex for extracting image file names when in a text a html-style tag img is met" (not posting it here, but "src" is clearly visible in the answer). So the "knowledge" from other domains is applied to the regex creation.
26. layer8 ◴[] No.44014320{6}[source]
I don’t believe this is true. That’s why we do mathematical proofs, instead of only testing all the cases one can think of. It’s important to sanity-check one’s understanding with tests, but mere black-box testing is no substitute for the understanding.
replies(1): >>44015072 #
27. eddd-ddde ◴[] No.44014538{3}[source]
In my case most of my regex is for utility scripts for text processing. That means that I just run the script, and if it does what I want it to do I know I'm done.

LLMs have been amazing in my experience putting together awk scripts or shell scripts in general. I've also discovered many more tools and features I wouldn't have otherwise.

28. tough ◴[] No.44015072{7}[source]
Code is not perfect like math imho

libraries some times make weird choices

in theory theory and practice are the same, in practice not really

in the context of regex, you have to know which dialect and programming language version of regex you’re targeting for example. its not really universal how all libs/languages works

thus the need to test

29. fooker ◴[] No.44017836{3}[source]
Great, except you misunderstood the problem and wrote the exact opposite solution here.

Also this is the easiest starter puzzle, once you solve it you can click through to the next ones with increasing difficulty.