←back to thread

Grok 3: Another win for the bitter lesson

(www.thealgorithmicbridge.com)
129 points kiyanwang | 1 comments | | HN request time: 0.001s | source
Show context
bambax ◴[] No.43112611[source]
This article is weak and just general speculation.

Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks. And Sabine Hossenfelder says this:

> Asked Grok 3 to explain Bell's theorem. It gets it wrong just like all other LLMs I have asked because it just repeats confused stuff that has been written elsewhere rather than looking at the actual theorem.

https://x.com/skdh/status/1892432032644354192

Which shows that "massive scaling", even enormous, gigantic scaling, doesn't improve intelligence one bit; it improves scope, maybe, or flexibility, or coverage, or something, but not "intelligence".

replies(7): >>43112886 #>>43112908 #>>43113270 #>>43113312 #>>43113843 #>>43114290 #>>43115189 #
ttoinou ◴[] No.43112908[source]

   Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks
That's something I always wondered about, Goodhart's law is so obvious to apply to each new AI release. Even the fact that writers and journalists don't mention that possibility makes me instantly skeptical about the quality of the article I'm reading
replies(1): >>43113035 #
NitpickLawyer ◴[] No.43113035[source]
> Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks

2 anecdotes here:

- just before grok2 was released, they put it on livearena under a pseudonim. If you read the topics (reddit,x,etc) when that hit, everyone was raving about the model. People were saying it's the next 4o, that it's so good, hyped, so on. Then it launched, and they revealed the pseudonim, and everyone started shitting on it. There is a lot of bias in this area, especially with anything touching bad spaceman, so take "many people doubt" with a huge grain of salt. People be salty.

- there are benchmarks that seem to correlate very well with end to end results on a variety of tasks. Livebench is one of them. Models scoring highly there have proven to perform well on general tasks, and don't feel like they cheated. This is supported by the finding in that paper that found models like phi and qwen to lose ~10-20% of their benchmarks scores when checked against newly-built, unseen but similar tasks. Models scoring strongly on livebench didn't see that big of a gap.

replies(3): >>43113712 #>>43113936 #>>43116292 #
1. staticman2 ◴[] No.43116292[source]
I found arena was a place with a 2000 token limit on inputs.

I think it even quietly eliminates the input without telling you. Nobody is putting serious work tasks in 2000 tokens on Arena.

The lesson you should have learned is Arena is a dumb metric, not that people have unfounded biases against Grok 2. (Which I've used on Perplexity and found to be unimpressive.)

The other thing is dumb, low quality statements are all over reddit and Twitter about any "hype" topic, including mysterious new models on arena. So it isn't surprising you encountered that for Grok 2, but you could have said the same thing for Gemini models.

If reddit can be believed, Wizard LM 2 was so much better than OpenAI models that Microsoft had to cancel it so OpenAI wouldn't be driven out of business.

People say all sorts of dumb stuff on social media.