←back to thread

343 points sillysaurusx | 2 comments | | HN request time: 0.403s | source
Show context
swyx ◴[] No.35027808[source]
thanks for doing this, honestly your writeup seems more valuable than the model weights lol

> But for what it's worth, my personal opinion is that LLaMA probably isn't OpenAI-grade -- there's a big difference between training a model in an academic setting vs when your entire company depends on it for wide-scale commercial success. I wasn't impressed that 30B didn't seem to know who Captain Picard was.

im new to benchmarking shenanigans but how is it that facebook was able to proclaim that it matched GPT3 performance on presumably standard LLM benchmarks? is there a good survey paper or blogpost on how to think about known deficiencies in benchmarks?

replies(3): >>35027861 #>>35028412 #>>35028468 #
nl ◴[] No.35028468[source]
Because there are many benchmarks that measure different things.

You need to look at the benchmark that reflects your specific interest.

So in this case ("I wasn't impressed that 30B didn't seem to know who Captain Picard was") the closest relevant benchmark they performed is MMLU (Massive Multitask Language Understanding"[1].

In the LLAMA paper they publish a figure of 63.4% for the 5-shot average setting without fine tuning on the 65B model, and 68.9% after fine tuning. This is significantly better that the original GPT-3 (43.9% under the same conditions) but as they note:

> "[it is] still far from the state-of-the-art, that is 77.4 for GPT code-davinci-002 on MMLU (numbers taken from Iyer et al. (2022))"

InstructGPT[2] (which OpenAI points at as most relevant ChatGPT publication) doesn't report MMLU performance.

[1] https://github.com/hendrycks/test

[2] https://arxiv.org/abs/2203.02155

replies(1): >>35028544 #
1. JonathanFly ◴[] No.35028544[source]
The capability of a language model I care about most is probably its ability to represent or simulate Captain Picard. In the sense of being good at creative tasks but also Captain Picard, specifically. Is OpenAI deliberately doing something different on purpose that makes their models better for this, or is just that OpenAI has a lot more copyrighted data in their dataset, as I noticed just now when skimming the Facebook paper for MMLU section and seems be what the Facebook folks think?

"A potential explanation is that we have used a limited amount of books and academic papers in our pre-training data, i.e., ArXiv, Gutenberg and Books3, that sums up to only 177GB, while these models were trained on up to 2TB of books. This large quantity of books used by Gopher, Chinchilla and PaLM may also explain why Gopher outperforms GPT-3 on this benchmark, while it is comparable on other benchmarks."

replies(1): >>35028663 #
2. nl ◴[] No.35028663[source]
It's unclear exactly why it doesn't work as well for you.

I have two comments that may be useful:

1) It's very unclear how good the generative capabilities of LLAMA are generally. It benchmarks well for code generation, but for English there aren't really any good benchmarks around. There's good chance the larger model performs much better since generative capabilities seem to be a partially emergent capability.

2) If you just want to "make it work" I'd suggest downloading all the Star Trek scripts you can that include Captain Picard and fine tuning LLAMA using them. It's unclear how well this will work, but that is probably about as good as you can get.

If you care about this probably deeply, it's probably worth trying the same with some of the other open GPT-3 models (GPTJ, GPT-NEOX etc)