←back to thread

579 points paulpauper | 1 comments | | HN request time: 0.21s | source
Show context
InkCanon ◴[] No.43604503[source]
The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.
replies(18): >>43604865 #>>43604962 #>>43605147 #>>43605224 #>>43605451 #>>43606419 #>>43607255 #>>43607532 #>>43607825 #>>43608628 #>>43609068 #>>43609232 #>>43610244 #>>43610557 #>>43610890 #>>43612243 #>>43646840 #>>43658014 #
AstroBen ◴[] No.43605224[source]
This seems fairly obvious at this point. If they were actually reasoning at all they'd be capable (even if not good) of complex games like chess

Instead they're barely able to eek out wins against a bot that plays completely random moves: https://maxim-saplin.github.io/llm_chess/

replies(4): >>43605990 #>>43606017 #>>43606243 #>>43609237 #
og_kalu ◴[] No.43606017[source]
LLMs are capable of playing chess and 3.5 turbo instruct does so quite well (for a human) at 1800 ELO. Does this mean they can truly reason now ?

https://github.com/adamkarvonen/chess_gpt_eval

replies(2): >>43606282 #>>43606954 #
hatefulmoron ◴[] No.43606282[source]
3.5 turbo instruct is a huge outlier.

https://dynomight.substack.com/p/chess

Discussion here: https://news.ycombinator.com/item?id=42138289

replies(1): >>43606905 #
og_kalu ◴[] No.43606905[source]
That might be overstating it, at least if you mean it to be some unreplicable feat. Small models have been trained that play around 1200 to 1300 on the eleuther discord. And there's this grandmaster level transformer - https://arxiv.org/html/2402.04494v1

Open AI, Anthropic and the like simply don't care much about their LLMs playing chess. That or post training is messing things up.

replies(1): >>43607013 #
hatefulmoron ◴[] No.43607013[source]
> That might be overstating it, at least if you mean it to be some unreplicable feat.

I mean, surely there's a reason you decided to mention 3.5 turbo instruct and not.. 3.5 turbo? Or any other model? Even the ones that came after? It's clearly a big outlier, at least when you consider "LLMs" to be a wide selection of recent models.

If you're saying that LLMs/transformer models are capable of being trained to play chess by training on chess data, I agree with you.

I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?

replies(2): >>43607265 #>>43607575 #
og_kalu ◴[] No.43607265[source]
I mentioned it because it's the best example. One example is enough to disprove the "not capable of". There are other examples too.

>I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?

Not really. The LLMs play chess like they have no clue what the rules of the game are, not like poor reasoners. Trying to predict and failing is how they learn anything. If you want them to learn a game like chess then how you get them to learn it - by trying to predict chess moves. Chess books during training only teach them how to converse about chess.

replies(2): >>43607365 #>>43610938 #
1. throwaway173738 ◴[] No.43610938[source]
The issue isn’t whether they can be trained to play. The issue is whether, after making a careful reading of the rules, they can infer how to play. The latter is something a human child could do, but it is completely beyond an LLM.