Most active commenters
  • og_kalu(7)
  • AstroBen(4)
  • hatefulmoron(3)

←back to thread

579 points paulpauper | 19 comments | | HN request time: 2.438s | source | bottom
Show context
InkCanon ◴[] No.43604503[source]
The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.
replies(18): >>43604865 #>>43604962 #>>43605147 #>>43605224 #>>43605451 #>>43606419 #>>43607255 #>>43607532 #>>43607825 #>>43608628 #>>43609068 #>>43609232 #>>43610244 #>>43610557 #>>43610890 #>>43612243 #>>43646840 #>>43658014 #
AstroBen ◴[] No.43605224[source]
This seems fairly obvious at this point. If they were actually reasoning at all they'd be capable (even if not good) of complex games like chess

Instead they're barely able to eek out wins against a bot that plays completely random moves: https://maxim-saplin.github.io/llm_chess/

replies(4): >>43605990 #>>43606017 #>>43606243 #>>43609237 #
1. og_kalu ◴[] No.43606017[source]
LLMs are capable of playing chess and 3.5 turbo instruct does so quite well (for a human) at 1800 ELO. Does this mean they can truly reason now ?

https://github.com/adamkarvonen/chess_gpt_eval

replies(2): >>43606282 #>>43606954 #
2. hatefulmoron ◴[] No.43606282[source]
3.5 turbo instruct is a huge outlier.

https://dynomight.substack.com/p/chess

Discussion here: https://news.ycombinator.com/item?id=42138289

replies(1): >>43606905 #
3. og_kalu ◴[] No.43606905[source]
That might be overstating it, at least if you mean it to be some unreplicable feat. Small models have been trained that play around 1200 to 1300 on the eleuther discord. And there's this grandmaster level transformer - https://arxiv.org/html/2402.04494v1

Open AI, Anthropic and the like simply don't care much about their LLMs playing chess. That or post training is messing things up.

replies(1): >>43607013 #
4. AstroBen ◴[] No.43606954[source]
My point wasn't chess specific or that they couldn't have specific training for it. It was a more general "here is something that LLMs clearly aren't being trained for currently, but would also be solvable through reasoning skills"

Much in the same way a human who only just learnt the rules but 0 strategy would very, very rarely lose here

These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?

We're on the verge of AGI but there's not even the tiniest spark of general reasoning ability in something they haven't been trained for

"Reasoning" or "Thinking" are marketing terms and nothing more. If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning"

replies(1): >>43607307 #
5. hatefulmoron ◴[] No.43607013{3}[source]
> That might be overstating it, at least if you mean it to be some unreplicable feat.

I mean, surely there's a reason you decided to mention 3.5 turbo instruct and not.. 3.5 turbo? Or any other model? Even the ones that came after? It's clearly a big outlier, at least when you consider "LLMs" to be a wide selection of recent models.

If you're saying that LLMs/transformer models are capable of being trained to play chess by training on chess data, I agree with you.

I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?

replies(2): >>43607265 #>>43607575 #
6. og_kalu ◴[] No.43607265{4}[source]
I mentioned it because it's the best example. One example is enough to disprove the "not capable of". There are other examples too.

>I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?

Not really. The LLMs play chess like they have no clue what the rules of the game are, not like poor reasoners. Trying to predict and failing is how they learn anything. If you want them to learn a game like chess then how you get them to learn it - by trying to predict chess moves. Chess books during training only teach them how to converse about chess.

replies(2): >>43607365 #>>43610938 #
7. og_kalu ◴[] No.43607307[source]
>If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning".

If you think you can play chess at that level over that many games and moves with memorization then i don't know what to tell you except that you're wrong. It's not possible so let's just get that out of the way.

>These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?

Why doesn't it ? Have you actually looked at any of these games ? Those LLMs aren't playing like poor reasoners. They're playing like machines who have no clue what the rules of the game are. LLMs learn by predicting and failing and getting a little better at it, repeat ad nauseum. You want them to learn the rules of a complex game ? That's how you do it. By training them to predict it. Training on chess books just makes them learn how to converse about chess.

Humans have weird failure modes that are odds with their 'intelligence'. We just choose to call them funny names and laugh about it sometimes. These Machines have theirs. That's all there is to it. The top comment we are both replying to had gemini-2.5-pro which released less than 5 days later hit 25% on the benchmark. Now that was particularly funny.

replies(2): >>43607591 #>>43620364 #
8. hatefulmoron ◴[] No.43607365{5}[source]
> One example is enough to disprove the "not capable of" nonsense. There are other examples too.

Gotcha, fair enough. Throw enough chess data in during training, I'm sure they'd be pretty good at chess.

I don't really understand what you're trying to say in your next paragraph. LLMs surely have plenty of training data to be familiar with the rules of chess. They also purportedly have the reasoning skills to use their familiarity to connect the dots and actually play. It's trivially true that this issue can be plastered over by shoving lots of chess game training data into them, but the success of that route is not a positive reflection on their reasoning abilities.

replies(1): >>43607456 #
9. og_kalu ◴[] No.43607456{6}[source]
Gradient descent is a dumb optimizer. LLM training is not at all like a human reading a book and more like evolution tuning adaptations over centuries. You would not expect either process to be aware of anything they are converging towards. So having lots of books that talk about chess in training will predictably just return a model that knows how to talk about chess really well. I'm not surprised they may know how to talk about the rules but play them poorly.

And that post had a follow-up. Post-training messing things up could well be the issue seeing the impact even a little more examples and/or regurgitation made. https://dynomight.net/more-chess/

replies(1): >>43608267 #
10. cma ◴[] No.43607575{4}[source]
Reasoning training causes some about of catastrophic forgetting, so unlikely they burn that on mixing in chess puzzles if they want a commercial product, unless it somehow transfers well to other reasoning problems broadly cared about.
11. AstroBen ◴[] No.43607591{3}[source]
> Why doesn't it?

It was surprising to me because I would have expected if there was reasoning ability then it would translate across domains at least somewhat, but yeah what you say makes sense. I'm thinking of it in human terms

replies(1): >>43607708 #
12. og_kalu ◴[] No.43607708{4}[source]
Transfer Learning during LLM training tends to be 'broader' than that.

Like how

- Training LLMs on code makes them solve reasoning problems better - Training Language Y alongside X makes them much better at Y than if they were trained on language Y alone and so on.

Probably because well gradient descent is a dumb optimizer and training is more like evolution than a human reading a book.

Also, there is something genuinely weird going on with LLM chess. And it's possible base models are better. https://dynomight.net/more-chess/

replies(1): >>43613386 #
13. tsimionescu ◴[] No.43608267{7}[source]
The whole premise on which the immense valuations of these AI companies is based on is that they are learning general reasoning skills from their training on language. That is, that simply training on text is going to eventually give the AI the ability to generate language that reasons at more or less human level in more or less any domain of knowledge.

This whole premise crashes and burns if you need task-specific training, like explicit chess training. That is because there are far too many tasks that humans need to be competent at in order to be useful in society. Even worse, the vast majority of those tasks are very hard to source training data for, unlike chess.

So, if we accept that LLMs can't learn chess unless they explicitly include chess games in the training set, then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.

replies(1): >>43610036 #
14. og_kalu ◴[] No.43610036{8}[source]
>The whole premise on which the immense valuations of these AI companies is based on is that they are learning general reasoning skills from their training on language.

And they do, just not always in the ways we expect.

>This whole premise crashes and burns if you need task-specific training, like explicit chess training.

Everyone needs task specific training. Any human good at chess or anything enough to make it a profession needs it. So I have no idea why people would expect any less for a Machine.

>then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.

Yeah so ? How much business pitches they need in the training set has no correlation with chess. I don't see any reason to believe what is already present isn't enough. There's enough chess data on the internet to teach them chess too, it's just a matter of how much open AI care about it.

replies(1): >>43618684 #
15. throwaway173738 ◴[] No.43610938{5}[source]
The issue isn’t whether they can be trained to play. The issue is whether, after making a careful reading of the rules, they can infer how to play. The latter is something a human child could do, but it is completely beyond an LLM.
16. AstroBen ◴[] No.43613386{5}[source]
It seems to be fairly nuanced in how abilities transfer: https://arxiv.org/html/2310.16937v2

Very hard for me to wrap my head around the idea that an LLM being able to discuss, even perhaps teach high level chess strategy wouldn't transfer at all to its playing performance

17. tsimionescu ◴[] No.43618684{9}[source]
Chess is a very simple game, and having basic general reasoning skills is more than enough to learn how to play it. It's not some advanced mathematics or complicated human interaction - it's a game with 30 or so fixed rules. And chess manuals have numerous examples of actual chess games, it's not like they are pure text talking about the game.

So, the fact that LLMs can't learn this sample game despite probably including all of the books ever written on it in their training set tells us something about their general reasoning skills.

replies(1): >>43620393 #
18. pdimitar ◴[] No.43620364{3}[source]
> Humans have weird failure modes that are odds with their 'intelligence'. We just choose to call them funny names and laugh about it sometimes. These Machines have theirs. That's all there is to it.

Yes, that's all there is to it and it's not enough. I ain't paying for another defective organism that makes mistakes in entirely novel ways. At least with humans you know how to guide them back on course.

If that's the peak of "AI" evolution today, I am not impressed.

19. pdimitar ◴[] No.43620393{10}[source]
As in: they do not have general reasoning skills.