←back to thread

Getting AI to write good SQL

(cloud.google.com)
476 points richards | 2 comments | | HN request time: 0.495s | source
1. navaed01 ◴[] No.44015170[source]
“Metrics: We combine user metrics and offline eval metrics, and employ both human and automated evaluation, particularly using LLM-as-a-judge techniques”.

I’m curious to know what people are doing to measure whether the customer got what they were looking for. Thumbs up/down seems insufficient to me.

The ability of the LLM to perform purely depends on having good knowledge of what is going to get asked and how, which is more complex than it sounds

What techniques are people having success with?

replies(1): >>44015419 #
2. edmundsauto ◴[] No.44015419[source]
Training a 2nd agent as a qualitative evaluator works pretty well "LLM-as-a-judge". You train it with labeled critiques from experts, iterate a few times, then point it to your ground truth human-labelled-data ("golden dataset"). The quantitative output metric is human2ai alignment on the golden dataset, mix that with some expert judgment about the critique output by the ai as well.

Works pretty well for me, where you can typically get within the range of human2human variance.