←back to thread

343 points sillysaurusx | 1 comments | | HN request time: 0.217s | source
Show context
swyx ◴[] No.35027808[source]
thanks for doing this, honestly your writeup seems more valuable than the model weights lol

> But for what it's worth, my personal opinion is that LLaMA probably isn't OpenAI-grade -- there's a big difference between training a model in an academic setting vs when your entire company depends on it for wide-scale commercial success. I wasn't impressed that 30B didn't seem to know who Captain Picard was.

im new to benchmarking shenanigans but how is it that facebook was able to proclaim that it matched GPT3 performance on presumably standard LLM benchmarks? is there a good survey paper or blogpost on how to think about known deficiencies in benchmarks?

replies(3): >>35027861 #>>35028412 #>>35028468 #
rnosov ◴[] No.35028412[source]
You can read the original LLaMA paper which is pretty accessible[1]. For example, they claim to outperform GPT-3 on HellaSwag benchmark ( finishing sentences ). You can find examples of unfinished sentences in the HellaSwag paper [2] on page 13. Unfortunately for LLaMA, most people would be probably just asking questions about Captain Picard and so on, and on this benchmark LLaMA significantly underperforms compared to OpenAI models (thats's from their paper).

[1] https://research.facebook.com/file/1574548786327032/LLaMA--O...

[2] https://arxiv.org/pdf/1905.07830.pdf

replies(1): >>35030146 #
1. yunyu ◴[] No.35030146[source]
Hellaswag is also a deeply flawed benchmark, I wouldn't read too much into it: https://www.surgehq.ai/blog/hellaswag-or-hellabad-36-of-this...