Grok 3: Another win for the bitter lesson

(www.thealgorithmicbridge.com)

132 points kiyanwang | 1 comments | 20 Feb 25 07:15 UTC | HN request time: 1.199s | source

Show context

bambax ◴[20 Feb 25 09:04 UTC] No.43112611[source]▶

>>43111963 (OP) #

This article is weak and just general speculation.

Many people doubt the actual performance of Grok 3 and suspect it has been specifically trained on the benchmarks. And Sabine Hossenfelder says this:

> Asked Grok 3 to explain Bell's theorem. It gets it wrong just like all other LLMs I have asked because it just repeats confused stuff that has been written elsewhere rather than looking at the actual theorem.

https://x.com/skdh/status/1892432032644354192

Which shows that "massive scaling", even enormous, gigantic scaling, doesn't improve intelligence one bit; it improves scope, maybe, or flexibility, or coverage, or something, but not "intelligence".

replies(7): >>43112886 #>>43112908 #>>43113270 #>>43113312 #>>43113843 #>>43114290 #>>43115189 #

melodyogonna ◴[20 Feb 25 09:49 UTC] No.43112886[source]▶

>>43112611 #

How can it be specifically trained on benchmarks when it is leading on blind chatbot tests?

The post you quoted is not a Grok problem if other LLMs are also failing, it seems, to me, to be a fundamental failure in the current approach to AI model development.

replies(2): >>43113802 #>>43115538 #

1. nycdatasci ◴[20 Feb 25 15:04 UTC] No.43115538[source]▶

>>43112886 #

I think a more plausible path to gaming benchmarks would be to use watermarks in text output to identify your model, then unleash bots to consistently rank your model over opponents.

↑