Getting AI to write good SQL

(cloud.google.com)

501 points richards | 3 comments | 16 May 25 21:10 UTC | HN request time: 0s | source

Show context

mritchie712 ◴[16 May 25 21:43 UTC] No.44010031[source]▶

the short answer: use a semantic layer.

It's the cleanest way to give the right context and the best place to pull a human in the loop.

A human can validate and create all important metrics (e.g. what does "monthly active users" really mean) then an LLM can use that metric definition whenever asked for MAU.

With a semantic layer, you get the added benefit of writing queries in JSON instead of raw SQL. LLM's are much more consistent at writing a small JSON vs. hundreds of lines of SQL.

We[0] use cube[1] for this. It's the best open source semantic layer, but there's a couple closed source options too.

My last company wrote a post on this in 2021[2]. Looks like the acquirer stopped paying for the blog hosting, but the HN post is still up.

0 - https://www.definite.app/

1 - https://cube.dev/

2 - https://news.ycombinator.com/item?id=25930190

replies(7): >>44010358 #>>44011108 #>>44011775 #>>44011802 #>>44012638 #>>44013043 #>>44013772 #

ljm ◴[16 May 25 22:32 UTC] No.44010358[source]▶

>>44010031 #

> you get the added benefit of writing queries in JSON instead of raw SQL.

I’m sorry, I can’t. The tail is wagging the dog.

dang, can you delete my account and scrub my history? I’m serious.

replies(6): >>44010404 #>>44010459 #>>44011117 #>>44011630 #>>44011681 #>>44013578 #

1. mritchie712 ◴[17 May 25 11:41 UTC] No.44013578{3}[source]▶

>>44010358 #

LLMs are far more reliable at producing something like this:

    {
      "dimensions": [
        "users.state",
        "users.city",
        "orders.status"
      ],
      "measures": [
        "orders.count"
      ],
      "filters": [
        {
          "member": "users.state",
          "operator": "notEquals",
          "values": ["us-wa"]
        }
      ],
      "timeDimensions": [
        {
          "dimension": "orders.created_at",
          "dateRange": ["2020-01-01", "2021-01-01"]
        }
      ],
      "limit": 10
    }

than this:

    SELECT
      users.state,
      users.city,
      orders.status,
      sum(orders.count)
    FROM orders
    CROSS JOIN users
    WHERE
      users.state != 'us-wa'
      AND orders.created_at BETWEEN '2020-01-01' AND '2021-01-01'
    GROUP BY 1, 2, 3
    LIMIT 10;

replies(2): >>44013861 #>>44014931 #

2. Agraillo ◴[17 May 25 12:39 UTC] No.44013861[source]▶

>>44013578 (TP) #

The programming languages are more predictable than human. So the rules are much easier to be "compressed" after they're basically detected when fed with big data. Your two examples imho are easily interchangeable during follow-up conversation with a decent LLM. Tested this with the following prompt and fed a c fragment and an SQL-fragment, got in both cases something like your first one

> Please convert the following fragment of a programming language (auto-detect) into a json-like parsing information when language construct is represented like an object, fixed branches are represented like properties and iterative clauses (statement list for example) as array.

3. sgarland ◴[17 May 25 15:16 UTC] No.44014931[source]▶

>>44013578 (TP) #

This doesn't make sense.

From a schema standpoint, table `orders` presumably has a row per order, with columns like `user_id`, `status` (as you stated), `created_at` (same), etc. Why would there be a `count` column? What does that represent?

From a query standpoint, I'm not sure what this would accomplish. You want the cartesian product of `users` and `orders`, filtered to all states except Washington, and where the order was created in 2020? The only reason I can think of to use a CROSS JOIN would be if there is no logical link between the tables, but that doesn't make any sense for this, because users:orders should be a 1:M relationship. Orders don't place themselves.

I think what you might have meant would be:

    SELECT
      users.state,
      users.city,
      orders.status,
      COUNT(*)
    FROM users
    JOIN orders ON user.id = orders.user_id
    WHERE
      users.state != 'us-wa' AND
      orders.created_at BETWEEN '2020-01-01' AND '2021-01-01'
    GROUP BY 1, 2, 3
    LIMIT 10;

Though without an ORDER BY, this has no significant meaning, and is a random sampling at best.

Also, if you or anyone else is creating a schema like this, _please_ don't make this denormalized mess. `orders.status` is going to be extremely low cardinality, as is `users.state` (to a lesser extent), and `users.city` (to an even lesser extent, but still). Make separate lookup tables for `city` and/or `state` (you don't even need to worry about pre-populating these, you can use GeoNames[0]). For `status`, you could do the same, or turn them into native ENUM [1] if you'd like to save a lookup.

[0]: https://www.geonames.org

[1]: https://www.postgresql.org/docs/current/datatype-enum.html

↑