←back to thread

Getting AI to write good SQL

(cloud.google.com)
478 points richards | 3 comments | | HN request time: 0.613s | source
Show context
pcblues ◴[] No.44014141[source]
Can someone please answer these questions because I still think AI stinks of a false promise of determinable accuracy:

Do you need an expert to verify if the answer from AI is correct? How is it time saved refining prompts instead of SQL? Is it typing time? How can you know the results are correct if you aren't able to do it yourself? Why should a junior (sorcerer's apprentice) be trusted in charge of using AI? No matter the domain, from art to code to business rules, you still need an expert to verify the results. Would they (and their company) be in a better place to design a solution to a problem themselves, knowing their own assumptions? Or just check of a list of happy-path results without a FULL knowledge of the underlying design? This is not just a change from hand-crafting to line-production, it's a change from deterministic problem-solving to near-enough is good enough, sold as the new truth in problem-solving. It smells wrong.

replies(5): >>44014181 #>>44014318 #>>44014541 #>>44014740 #>>44015021 #
1. lmeyerov ◴[] No.44014541[source]
I can bring data here:

We recently did the first speed run where Louie.ai beat teams of professional cybersecurity analysts in an open competition, Splunk's annual Boss of the SOC. Think writing queries, wrangling Python, and scanning through 100+ log sources to answer frustratingly sloppy database questions:

- We get 100% correct for basic stuff in the first half that takes most people 5-15 minutes per question, and 50% correct in the second half that most people take 15-45+ minute per question, and most teams time out in that second half.

- ... Louie does a median 2-3min per question irrespective of the expected difficulty, so about 10X faster than a team of 5 (wall clock), and 30X less work (person hours). Louie isn't burnt out at the end ;-)

- This doesn't happen out-of-the-box with frontier models, including fancy reasoning ones. Likewise, letting the typical tool here burn tokens until it finds an answer would cost more than a new hire, which is why we measure as a speedrun vs deceptively uncapped auto-solve count.

- The frontier models DO have good intuition , understand many errors, and for popular languages, DO generate good text2query. We are generally happy with OpenAI for example, so it's more on how Louie and the operator uses it.

- We found we had to add in key context and strategies. You see a bit in Claude Code and Cursor, except those are quite generic, so would have failed as well. Intuitively in coding, you want to use types/lint/tests, and same but diff issues if you do database stuff. But there is a lot more, by domain, in my experience, and expecting tools to just work is unlikely, so having domain relevant patterns baked in and that you can extend is key, and so is learning loops.

A bit more on louie's speed run here: https://www.linkedin.com/posts/leo-meyerovich-09649219_genai...

This is our first attempt at the speed run. I expect Louie to improve: my answers represent the current floor, not the ceiling of where things are (dizzyingly) going. Happy to answer any other q's where data might help!

replies(1): >>44014771 #
2. blibble ◴[] No.44014771[source]
is a competition/speed run a realistic example?
replies(1): >>44014807 #
3. lmeyerov ◴[] No.44014807[source]
Splunk Boss of the SOC is the realistic test, it is one of the best cyber ranges. Think effectively 30+ hours of tricky querying across 100+ real log source types (tables) with a variety of recorded cyber incidents - OS logs, AWS logs, alerting systems, etc. As I mentioned, the AI has to seriously look at the data too, typically several queries deep for the right answer, and a lot of rabbit holes before then - answers can't just skate by on schema. I recommend folks look at the questions and decide for themselves what this signifies. I personally gained a lot of respect for the team create the competition.

The speed run formulation for all those same questions helps measure real-world quality vs cost trade-offs. I don't find uncapped solve rates to be relevant to most scenarios. If we allowed infinite time, yes we would have scored even higher... But if our users also ran it that way, it would bankrupt them.

If anyone is in the industry, there are surprisingly few open tests here. That is another part of why we did BOTS. IMO sunlight here brings progress, and I would love to chat with others on doing more open benchmarks!