←back to thread

343 points sillysaurusx | 1 comments | | HN request time: 0.206s | source
Show context
swyx ◴[] No.35027808[source]
thanks for doing this, honestly your writeup seems more valuable than the model weights lol

> But for what it's worth, my personal opinion is that LLaMA probably isn't OpenAI-grade -- there's a big difference between training a model in an academic setting vs when your entire company depends on it for wide-scale commercial success. I wasn't impressed that 30B didn't seem to know who Captain Picard was.

im new to benchmarking shenanigans but how is it that facebook was able to proclaim that it matched GPT3 performance on presumably standard LLM benchmarks? is there a good survey paper or blogpost on how to think about known deficiencies in benchmarks?

replies(3): >>35027861 #>>35028412 #>>35028468 #
sillysaurusx ◴[] No.35027861[source]
Because loss != quality. This was one of the most counterintuitive discoveries in ML for me. People treat the two as interchangeable, and to a certain extent — a controlled extent — they are.

But if your dataset doesn’t include a word about Captain Picard, no amount of training will get it to know about the USS enterprise. Yet your loss metrics will still reach that magical 2.1 value with time. (2.1 is pretty much “excellent” quality; below that means you’re probably overfitting and need a bigger dataset.)

Thanks for the comment friendo. I wasn’t sure if this would get any attention at all, but that made it worth it. Be sure to DM me on Twitter if you’d like to chat about anything ML related: basic questions are one of my favorite things to assist with too, so feel free.

replies(1): >>35028500 #
1. nl ◴[] No.35028500[source]
This isn't really correct.

Loss is a training-time measurement based on performance on the training objective.

The training objective is rarely the same as an end user task that is being benchmark.

For example, classically language models are training on next token prediction. The closest benchmark for that is perplexity[1], often reported on the WikiText-103 dataset.

Until around 2019 this was often reported, but since then most large language model papers have moved onto reporting more useful benchmarks. Some examples of this are question answering performance or maybe embedding performance.

Unfortunately there aren't great benchmarks (yet?) for generative tasks. Quality is quite hard to measure here in a systematic way (see, eg the issues with BLEU benchmarks in summarization benchmarks).

[1] https://en.wikipedia.org/wiki/Perplexity