“Metrics: We combine user metrics and offline eval metrics, and employ both human and automated evaluation, particularly using LLM-as-a-judge techniques”.
I’m curious to know what people are doing to measure whether the customer got what they were looking for. Thumbs up/down seems insufficient to me.
The ability of the LLM to perform purely depends on having good knowledge of what is going to get asked and how, which is more complex than it sounds
What techniques are people having success with?
replies(1):