Getting AI to write good SQL

1. mritchie712 ◴[16 May 25 21:43 UTC] No.44010031[source]▶

the short answer: use a semantic layer.

It's the cleanest way to give the right context and the best place to pull a human in the loop.

A human can validate and create all important metrics (e.g. what does "monthly active users" really mean) then an LLM can use that metric definition whenever asked for MAU.

With a semantic layer, you get the added benefit of writing queries in JSON instead of raw SQL. LLM's are much more consistent at writing a small JSON vs. hundreds of lines of SQL.

We[0] use cube[1] for this. It's the best open source semantic layer, but there's a couple closed source options too.

My last company wrote a post on this in 2021[2]. Looks like the acquirer stopped paying for the blog hosting, but the HN post is still up.

0 - https://www.definite.app/

1 - https://cube.dev/

2 - https://news.ycombinator.com/item?id=25930190

replies(7): >>44010358 #>>44011108 #>>44011775 #>>44011802 #>>44012638 #>>44013043 #>>44013772 #

2. ljm ◴[16 May 25 22:32 UTC] No.44010358[source]▶

>>44010031 (TP) #

> you get the added benefit of writing queries in JSON instead of raw SQL.

I’m sorry, I can’t. The tail is wagging the dog.

dang, can you delete my account and scrub my history? I’m serious.

replies(6): >>44010404 #>>44010459 #>>44011117 #>>44011630 #>>44011681 #>>44013578 #

3. fhkatari ◴[16 May 25 22:45 UTC] No.44010459[source]▶

>>44010358 #

You move all the tools to debug and inspect slow queries, in a completely unsupported JSON environment, with prompts not to make up column names. And this is progress?

replies(2): >>44010596 #>>44011364 #

4. ◴[16 May 25 23:06 UTC] No.44010596{3}[source]▶

>>44010459 #

5. galenmarchetti ◴[17 May 25 00:41 UTC] No.44011108[source]▶

>>44010031 (TP) #

still need someone to build the semantic layer, why not use text2sql or something similar for that

6. indymike ◴[17 May 25 00:43 UTC] No.44011117[source]▶

>>44010358 #

This may be the best comment on Hacker News ever.

replies(1): >>44014681 #

7. mritchie712 ◴[17 May 25 01:33 UTC] No.44011364{3}[source]▶

>>44010459 #

The JSON compiles to SQL. Have you used a semantic layer? You might have a different opinion if you tried one.

replies(2): >>44012209 #>>44021252 #

8. IncreasePosts ◴[17 May 25 02:52 UTC] No.44011681[source]▶

>>44010358 #

You're right, it's a bit ridiculous. This is a perfect time to use xml instead of json.

replies(1): >>44013246 #

9. tclancy ◴[17 May 25 03:14 UTC] No.44011775[source]▶

>>44010031 (TP) #

Mother of God. I can write JSON instead of a language designed for querying. What is the advantage? If I’m going to move up an abstraction layer, why not give me natural language? Lots of things turn a limited natural language grammar into SQL for you. What is JSON going to {do: for: {me}}?

replies(2): >>44012372 #>>44013758 #

10. fkyimeanit ◴[17 May 25 03:24 UTC] No.44011802[source]▶

>>44010031 (TP) #

>you get the added benefit of writing queries in JSON instead of raw SQL

You should have written your comment in JSON instead of raw English.

11. e3bc54b2 ◴[17 May 25 05:18 UTC] No.44012209{4}[source]▶

>>44011364 #

As someone who actually wrote a JSON to (limited) SQL transpiler at $DAYJOB, as much fun as I had designing and implementing that thing and for as many problems it solved immediately, 'tail wagging the dog' is the perfect description.

12. 8n4vidtmkvmk ◴[17 May 25 06:16 UTC] No.44012372[source]▶

>>44011775 #

Sorry, I couldn't parse that. You didn't quote your keys

replies(1): >>44017959 #

13. meindnoch ◴[17 May 25 07:23 UTC] No.44012638[source]▶

>>44010031 (TP) #

>you get the added benefit of writing queries in JSON instead of raw SQL

^ kids, this is what AI-induced brainrot looks like.

14. jinjin2 ◴[17 May 25 09:06 UTC] No.44013043[source]▶

>>44010031 (TP) #

I agree that using a semantic layer is the best way to get better precision. It is almost like a cheatsheet for the AI.

But I would never use one that forced me to express my queries in JSON. The best implementations integrate right into the database so they become an integral part of regular your SQL queries, and as such also available to all your tools.

In my experience, from using the Exasol Semantic Layer, it can be a totally seamless experience.

15. meindnoch ◴[17 May 25 10:02 UTC] No.44013246{3}[source]▶

>>44011681 #

Clearly the right solution is to use XML Object Notation, aka XON™!

JSON:

  {"foo": ["bar", 42]}

XON:

  <Object>
    <Property>
      <Key>foo</Key>
      <Value>
        <Array>
          <String>bar</String>
          <Number>42</Number>
        </Arra>
      </Value>
    </Property>
  </Object>

It gives you all the flexibility of JSON with the mature tooling of XML!

Edit: jesus christ, it actually exists https://sevenval.gitbook.io/flat/reference/templating/oxn

replies(1): >>44013524 #

16. indymike ◴[17 May 25 11:28 UTC] No.44013524{4}[source]▶

>>44013246 #

We had an IT guy who once bought an XML<->JSON server for $12,000. Very proud of his rack of "data appliances". It made XML like XON out of JSON and JSON that was a soup of elements attributes and ___content___, thus giving you the complexity of XML in JSON. I don't think it got used once by our dev team, and I'm pretty sure it never processed a byte of anything of value.

17. mritchie712 ◴[17 May 25 11:41 UTC] No.44013578[source]▶

>>44010358 #

LLMs are far more reliable at producing something like this:

    {
      "dimensions": [
        "users.state",
        "users.city",
        "orders.status"
      ],
      "measures": [
        "orders.count"
      ],
      "filters": [
        {
          "member": "users.state",
          "operator": "notEquals",
          "values": ["us-wa"]
        }
      ],
      "timeDimensions": [
        {
          "dimension": "orders.created_at",
          "dateRange": ["2020-01-01", "2021-01-01"]
        }
      ],
      "limit": 10
    }

than this:

    SELECT
      users.state,
      users.city,
      orders.status,
      sum(orders.count)
    FROM orders
    CROSS JOIN users
    WHERE
      users.state != 'us-wa'
      AND orders.created_at BETWEEN '2020-01-01' AND '2021-01-01'
    GROUP BY 1, 2, 3
    LIMIT 10;

replies(2): >>44013861 #>>44014931 #

18. Spivak ◴[17 May 25 12:21 UTC] No.44013758[source]▶

>>44011775 #

I find it funny people are making fun of this while every ORM builds up an object representing the query and then compiles it to SQL. SQL but as a data structure you can manipulate has thousands of implementations because it solves a real problem. This time it's because LLMs have an easier time outputting complex JSON than SQL itself.

replies(2): >>44014898 #>>44021201 #

19. christophilus ◴[17 May 25 12:24 UTC] No.44013772[source]▶

>>44010031 (TP) #

A semantic layer would be great. It should be a structured layer designed to make relational queries easy to write. We could call it “structured data language” or maybe “structured query language”.

In all seriousness, I have some complaints about SQL (I think LINQ’s reordering of it is a good idea), but there’s no need to invent another layer on order for LLMs to be able to wrangle it.

replies(1): >>44013805 #

20. cmrdporcupine ◴[17 May 25 12:31 UTC] No.44013805[source]▶

>>44013772 #

The semantic layer for database queries is (roughly) the relational algebra.

21. Agraillo ◴[17 May 25 12:39 UTC] No.44013861{3}[source]▶

>>44013578 #

The programming languages are more predictable than human. So the rules are much easier to be "compressed" after they're basically detected when fed with big data. Your two examples imho are easily interchangeable during follow-up conversation with a decent LLM. Tested this with the following prompt and fed a c fragment and an SQL-fragment, got in both cases something like your first one

> Please convert the following fragment of a programming language (auto-detect) into a json-like parsing information when language construct is represented like an object, fixed branches are represented like properties and iterative clauses (statement list for example) as array.

22. sgarland ◴[17 May 25 14:42 UTC] No.44014681{3}[source]▶

>>44011117 #

I think that honor still belongs to "Did you win the Putnam?" [0] but this is definitely still in the top 5.

[0]: https://news.ycombinator.com/item?id=35079

23. ses1984 ◴[17 May 25 15:11 UTC] No.44014898{3}[source]▶

>>44013758 #

Any idea why that is?

24. sgarland ◴[17 May 25 15:16 UTC] No.44014931{3}[source]▶

>>44013578 #

This doesn't make sense.

From a schema standpoint, table `orders` presumably has a row per order, with columns like `user_id`, `status` (as you stated), `created_at` (same), etc. Why would there be a `count` column? What does that represent?

From a query standpoint, I'm not sure what this would accomplish. You want the cartesian product of `users` and `orders`, filtered to all states except Washington, and where the order was created in 2020? The only reason I can think of to use a CROSS JOIN would be if there is no logical link between the tables, but that doesn't make any sense for this, because users:orders should be a 1:M relationship. Orders don't place themselves.

I think what you might have meant would be:

    SELECT
      users.state,
      users.city,
      orders.status,
      COUNT(*)
    FROM users
    JOIN orders ON user.id = orders.user_id
    WHERE
      users.state != 'us-wa' AND
      orders.created_at BETWEEN '2020-01-01' AND '2021-01-01'
    GROUP BY 1, 2, 3
    LIMIT 10;

Though without an ORDER BY, this has no significant meaning, and is a random sampling at best.

Also, if you or anyone else is creating a schema like this, _please_ don't make this denormalized mess. `orders.status` is going to be extremely low cardinality, as is `users.state` (to a lesser extent), and `users.city` (to an even lesser extent, but still). Make separate lookup tables for `city` and/or `state` (you don't even need to worry about pre-populating these, you can use GeoNames[0]). For `status`, you could do the same, or turn them into native ENUM [1] if you'd like to save a lookup.

[0]: https://www.geonames.org

[1]: https://www.postgresql.org/docs/current/datatype-enum.html

25. tclancy ◴[18 May 25 00:10 UTC] No.44017959{3}[source]▶

>>44012372 #

Was on a phone. Also, there’s more invalid than that. Plus I am lazy.

26. ljm ◴[18 May 25 13:16 UTC] No.44021201{3}[source]▶

>>44013758 #

Something still has to convert the JSON to SQL but in this case, who is writing the JSON? An LLM?

Sometimes, it's easier or more efficient to just learn the shit you're working with instead of spending 1000x the compute fobbing it off to OpenAI. Even just putting a blob of SQL in a heredoc and using a prepared statement to parameterise it is good enough.

Beyond that, query building is just part of the functionality an ORM provides. The main chunk of it is the mapping of DB-level data structures to app-level models and vice-versa, particularly in an OOP context.

27. ljm ◴[18 May 25 13:22 UTC] No.44021252{4}[source]▶

>>44011364 #

    SELECT email FROM users WHERE deleted_at IS NOT NULL OR status = 'active'

seems more semantic to me at first glance than piping this into a JSON->SQL library

    {
      "_select": "email",
      "_table": "users",
      "_where": { 
        "deleted_at": { "_is": { "_not": SQL_NULL_VALUE } },
        "_or": [
          { "status": "inactive" },
        ]
      }
    }

which is usually how these things end up looking.