Most active commenters

og_kalu(9)
AstroBen(5)
hyperbovine(5)
pdimitar(4)
airstrike(3)
cma(3)
billforsternz(3)
MoonGhost(3)
otabdeveloper4(3)
bambax(3)

Popular/hot comments

>>43607255 #
>>43605451 #
>>43609890 #
>>43607092 #
>>43605224 #
>>43607910 #
>>43604865 #
>>43609102 #
>>43610395 #
>>43610413 #
>>43611877 #
>>43612267 #
>>43612276 #

←back to thread

Recent AI model progress feels mostly like bullshit

(www.lesswrong.com)

1. InkCanon ◴[06 Apr 25 20:03 UTC] No.43604503[source]▶

>>43603453 (OP) #

The biggest story in AI was released a few weeks ago but was given little attention: on the recent USAMO, SOTA models scored on average 5% (IIRC, it was some abysmal number). This is despite them supposedly having gotten 50%, 60% etc performance on IMO questions. This massively suggests AI models simply remember the past results, instead of actually solving these questions. I'm incredibly surprised no one mentions this, but it's ridiculous that these companies never tell us what (if any) efforts have been made to remove test data (IMO, ICPC, etc) from train data.

replies(18): >>43604865 #>>43604962 #>>43605147 #>>43605224 #>>43605451 #>>43606419 #>>43607255 #>>43607532 #>>43607825 #>>43608628 #>>43609068 #>>43609232 #>>43610244 #>>43610557 #>>43610890 #>>43612243 #>>43646840 #>>43658014 #

2. AIPedant ◴[06 Apr 25 20:52 UTC] No.43604865[source]▶

>>43604503 (TP) #

Yes, here's the link: https://arxiv.org/abs/2503.21934v1

Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:

1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)

2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)

I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...

I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!

replies(4): >>43608074 #>>43609801 #>>43610413 #>>43611877 #

3. simonw ◴[06 Apr 25 21:06 UTC] No.43604962[source]▶

>>43604503 (TP) #

I had to look up these acronyms:

- USAMO - United States of America Mathematical Olympiad

- IMO - International Mathematical Olympiad

- ICPC - International Collegiate Programming Contest

Relevant paper: https://arxiv.org/abs/2503.21934 - "Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad" submitted 27th March 2025.

4. usaar333 ◴[06 Apr 25 21:34 UTC] No.43605147[source]▶

>>43604503 (TP) #

And then within a week, Gemini 2.5 was tested and got 25%. Point is AI is getting stronger.

And this only suggested LLMs aren't trained well to write formal math proofs, which is true.

replies(2): >>43607028 #>>43609276 #

5. AstroBen ◴[06 Apr 25 21:45 UTC] No.43605224[source]▶

>>43604503 (TP) #

This seems fairly obvious at this point. If they were actually reasoning at all they'd be capable (even if not good) of complex games like chess

Instead they're barely able to eek out wins against a bot that plays completely random moves: https://maxim-saplin.github.io/llm_chess/

replies(4): >>43605990 #>>43606017 #>>43606243 #>>43609237 #

6. bglazer ◴[06 Apr 25 22:20 UTC] No.43605451[source]▶

>>43604503 (TP) #

Yeah I’m a computational biology researcher. I’m working on a novel machine learning approach to inferring cellular behavior. I’m currently stumped why my algorithm won’t converge.

So, I describe the mathematics to ChatGPT-o3-mini-high to try to help reason about what’s going on. It was almost completely useless. Like blog-slop “intro to ML” solutions and ideas. It ignores all the mathematical context, and zeros in on “doesn’t converge” and suggests that I lower the learning rate. Like, no shit I tried that three weeks ago. No amount of cajoling can get it to meaningfully “reason” about the problem, because it hasn’t seen the problem before. The closest point in latent space is apparently a thousand identical Medium articles about Adam, so I get the statistical average of those.

I can’t stress how frustrating this is, especially with people like Terence Tao saying that these models are like a mediocre grad student. I would really love to have a mediocre (in Terry’s eyes) grad student looking at this, but I can’t seem to elicit that. Instead I get low tier ML blogspam author.

**PS** if anyone read this far (doubtful) and knows about density estimation and wants to help my email is bglazer1@gmail.com

I promise its a fun mathematical puzzle and the biology is pretty wild too

replies(8): >>43605845 #>>43607258 #>>43607653 #>>43608731 #>>43609218 #>>43609908 #>>43615581 #>>43617498 #

7. root_axis ◴[06 Apr 25 23:24 UTC] No.43605845[source]▶

>>43605451 #

It's funny, I have the same problem all the time with typical day to day programming roadblocks that these models are supposed to excel at. I'm talking about any type of bug or unexpected behavior that requires even 5 minutes of deeper analysis.

Sometimes when I'm anxious just to get on with my original task, I'll paste the code and output/errors into the LLM and iterate over its solutions, but the experience is like rolling dice, cycling through possible solutions without any kind of deductive analysis that might bring it gradually closer to a solution. If I keep asking, it eventually just starts cycling through variants of previous answers with solutions that contradict the established logic of the error/output feedback up to this point.

Not to say that the LLMs aren't productive tools, but they're more like calculators of language than agents that reason.

replies(2): >>43605981 #>>43608793 #

8. jwrallie ◴[06 Apr 25 23:41 UTC] No.43605981{3}[source]▶

>>43605845 #

True. There’s a small bonus that trying to explain the issue to the llm may sometimes be essentially rubber ducking, and that can lead to insights. I feel most of the time the llm can give erroneous output that still might trigger some thinking on a different direction, and sometimes I’m inclined to think it’s helping me more than it actually is.

9. kylebyte ◴[06 Apr 25 23:42 UTC] No.43605990[source]▶

>>43605224 #

Every day I am more convinced that LLM hype is the equivalent of someone seeing a stage magician levitate a table across the stage and assuming this means hovercars must only be a few years away.

replies(1): >>43606479 #

10. og_kalu ◴[06 Apr 25 23:45 UTC] No.43606017[source]▶

>>43605224 #

LLMs are capable of playing chess and 3.5 turbo instruct does so quite well (for a human) at 1800 ELO. Does this mean they can truly reason now ?

https://github.com/adamkarvonen/chess_gpt_eval

replies(2): >>43606282 #>>43606954 #

11. gilleain ◴[07 Apr 25 00:26 UTC] No.43606243[source]▶

>>43605224 #

Just in case it wasn't a typo, and you happen not to know ... that word is probably "eke" - meaning gaining (increasing, enlarging from wiktionary) - rather than "eek" which is what mice do :)

replies(2): >>43606466 #>>43607167 #

12. hatefulmoron ◴[07 Apr 25 00:34 UTC] No.43606282{3}[source]▶

>>43606017 #

3.5 turbo instruct is a huge outlier.

https://dynomight.substack.com/p/chess

Discussion here: https://news.ycombinator.com/item?id=42138289

replies(1): >>43606905 #

13. sanxiyn ◴[07 Apr 25 01:04 UTC] No.43606419[source]▶

>>43604503 (TP) #

Nope, no LLMs reported 50~60% performance on IMO, and SOTA LLMs scoring 5% on USAMO is expected. For 50~60% performance on IMO, you are thinking of AlphaProof, but AlphaProof is not a LLM. We don't have the full paper yet, but clearly AlphaProof is a system built on top of LLM with lots of bells and whistles, just like AlphaFold is.

replies(1): >>43607419 #

14. Terr_ ◴[07 Apr 25 01:12 UTC] No.43606466{3}[source]▶

>>43606243 #

Ick, OK, ACK.

15. Terr_ ◴[07 Apr 25 01:13 UTC] No.43606479{3}[source]▶

>>43605990 #

I believe there's a widespread confusion between a fictional character that is described as a AI assistant, versus the actual algorithm building the play-story which humans imagine the character from. An illusion actively promoted by companies seeking investment and hype.

AcmeAssistant is "helpful" and "clever" in the same way that Vampire Count Dracula is "brooding" and "immortal".

16. og_kalu ◴[07 Apr 25 02:14 UTC] No.43606905{4}[source]▶

>>43606282 #

That might be overstating it, at least if you mean it to be some unreplicable feat. Small models have been trained that play around 1200 to 1300 on the eleuther discord. And there's this grandmaster level transformer - https://arxiv.org/html/2402.04494v1

Open AI, Anthropic and the like simply don't care much about their LLMs playing chess. That or post training is messing things up.

replies(1): >>43607013 #

17. AstroBen ◴[07 Apr 25 02:22 UTC] No.43606954{3}[source]▶

>>43606017 #

My point wasn't chess specific or that they couldn't have specific training for it. It was a more general "here is something that LLMs clearly aren't being trained for currently, but would also be solvable through reasoning skills"

Much in the same way a human who only just learnt the rules but 0 strategy would very, very rarely lose here

These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?

We're on the verge of AGI but there's not even the tiniest spark of general reasoning ability in something they haven't been trained for

"Reasoning" or "Thinking" are marketing terms and nothing more. If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning"

replies(1): >>43607307 #

18. hatefulmoron ◴[07 Apr 25 02:31 UTC] No.43607013{5}[source]▶

>>43606905 #

> That might be overstating it, at least if you mean it to be some unreplicable feat.

I mean, surely there's a reason you decided to mention 3.5 turbo instruct and not.. 3.5 turbo? Or any other model? Even the ones that came after? It's clearly a big outlier, at least when you consider "LLMs" to be a wide selection of recent models.

If you're saying that LLMs/transformer models are capable of being trained to play chess by training on chess data, I agree with you.

I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?

replies(2): >>43607265 #>>43607575 #

19. selcuka ◴[07 Apr 25 02:33 UTC] No.43607028[source]▶

>>43605147 #

> within a week

How do we know that Gemini 2.5 wasn't specifically trained or fine-tuned with the new questions? I don't buy that a new model could suddenly score 5 times better than the previous state-of-the-art models.

replies(2): >>43607092 #>>43614328 #

20. levocardia ◴[07 Apr 25 02:44 UTC] No.43607092{3}[source]▶

>>43607028 #

They retrained their model less than a week before its release, just to juice one particular nonstandard eval? Seems implausible. Models get 5x better at things all the time. Challenges like the Winograd schema have gone from impossible to laughably easy practically overnight. Ditto for "Rs in strawberry," ferrying animals across a river, overflowing wine glass, ...

replies(5): >>43607320 #>>43607428 #>>43607553 #>>43608063 #>>43610236 #

21. AstroBen ◴[07 Apr 25 02:58 UTC] No.43607167{3}[source]▶

>>43606243 #

hah you're right on the spelling but wrong on my meaning. That's probably the first time I've typed it. I don't think LLMs are quite at the level of mice reasoning yet!

https://dictionary.cambridge.org/us/dictionary/english/eke-o... to obtain or win something only with difficulty or great effort

22. billforsternz ◴[07 Apr 25 03:12 UTC] No.43607255[source]▶

>>43604503 (TP) #

I asked Google "how many golf balls can fit in a Boeing 737 cabin" last week. The "AI" answer helpfully broke the solution into 4 stages; 1) A Boeing 737 cabin is about 3000 cubic metres [wrong, about 4x2x40 ~ 300 cubic metres] 2) A golf ball is about 0.000004 cubic metres [wrong, it's about 40cc = 0.00004 cubic metres] 3) 3000 / 0.000004 = 750,000 [wrong, it's 750,000,000] 4) We have to make an adjustment because seats etc. take up room, and we can't pack perfectly. So perhaps 1,500,000 to 2,000,000 golf balls final answer [wrong, you should have been reducing the number!]

So 1) 2) and 3) were out by 1,1 and 3 orders of magnitude respectively (the errors partially cancelled out) and 4) was nonsensical.

This little experiment made my skeptical about the state of the art of AI. I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.

replies(10): >>43607836 #>>43607857 #>>43607910 #>>43608930 #>>43610117 #>>43610390 #>>43611692 #>>43612201 #>>43612324 #>>43612398 #

23. kristianp ◴[07 Apr 25 03:13 UTC] No.43607258[source]▶

>>43605451 #

Have you tried gemini 2.5? It's one of the best reasoning models. Available free in google ai studio.

24. og_kalu ◴[07 Apr 25 03:13 UTC] No.43607265{6}[source]▶

>>43607013 #

I mentioned it because it's the best example. One example is enough to disprove the "not capable of". There are other examples too.

>I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?

Not really. The LLMs play chess like they have no clue what the rules of the game are, not like poor reasoners. Trying to predict and failing is how they learn anything. If you want them to learn a game like chess then how you get them to learn it - by trying to predict chess moves. Chess books during training only teach them how to converse about chess.

replies(2): >>43607365 #>>43610938 #

25. og_kalu ◴[07 Apr 25 03:21 UTC] No.43607307{4}[source]▶

>>43606954 #

>If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning".

If you think you can play chess at that level over that many games and moves with memorization then i don't know what to tell you except that you're wrong. It's not possible so let's just get that out of the way.

>These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?

Why doesn't it ? Have you actually looked at any of these games ? Those LLMs aren't playing like poor reasoners. They're playing like machines who have no clue what the rules of the game are. LLMs learn by predicting and failing and getting a little better at it, repeat ad nauseum. You want them to learn the rules of a complex game ? That's how you do it. By training them to predict it. Training on chess books just makes them learn how to converse about chess.

Humans have weird failure modes that are odds with their 'intelligence'. We just choose to call them funny names and laugh about it sometimes. These Machines have theirs. That's all there is to it. The top comment we are both replying to had gemini-2.5-pro which released less than 5 days later hit 25% on the benchmark. Now that was particularly funny.

replies(2): >>43607591 #>>43620364 #

26. AIPedant ◴[07 Apr 25 03:24 UTC] No.43607320{4}[source]▶

>>43607092 #

The "ferrying animals across a river" problem has definitely not been solved, they still don't understand the problem at all, overcomplicating it because they're using an off-the-shelf solution instead of actual reasoning:

o1 screwing up a trivially easy variation: https://xcancel.com/colin_fraser/status/1864787124320387202

Claude 3.7, utterly incoherent: https://xcancel.com/colin_fraser/status/1898158943962271876

DeepSeek: https://xcancel.com/colin_fraser/status/1882510886163943443#...

Overflowing wine glass also isn't meaningfully solved! I understand it is sort of solved for wine glasses (even though it looks terrible and unphysical, always seems to have weird fizz). But asking GPT to "generate an image of a transparent vase with flowers which has been overfilled with water, so that water is spilling over" had the exact same problem as the old wine glasses: the vase was clearly half-full, yet water was mysteriously trickling over the sides. Presumably OpenAI RLHFed wine glasses since it was a well-known failure, but (as always) this is just whack-a-mole, it does not generalize into understanding the physical principle.

replies(1): >>43607773 #

27. hatefulmoron ◴[07 Apr 25 03:33 UTC] No.43607365{7}[source]▶

>>43607265 #

> One example is enough to disprove the "not capable of" nonsense. There are other examples too.

Gotcha, fair enough. Throw enough chess data in during training, I'm sure they'd be pretty good at chess.

I don't really understand what you're trying to say in your next paragraph. LLMs surely have plenty of training data to be familiar with the rules of chess. They also purportedly have the reasoning skills to use their familiarity to connect the dots and actually play. It's trivially true that this issue can be plastered over by shoving lots of chess game training data into them, but the success of that route is not a positive reflection on their reasoning abilities.

replies(1): >>43607456 #

28. InkCanon ◴[07 Apr 25 03:42 UTC] No.43607419[source]▶

>>43606419 #

o1 reportedly got 83% on IMO, and 89th percentile on Codeforces.

https://openai.com/index/learning-to-reason-with-llms/

The paper tested it on o1-pro as well. Correct me if I'm getting some versioning mixed up here.

replies(2): >>43607571 #>>43608872 #

29. akoboldfrying ◴[07 Apr 25 03:43 UTC] No.43607428{4}[source]▶

>>43607092 #

>one particular nonstandard eval

A particular nonstandard eval that is currently top comment on this HN thread, due to the fact that, unlike every other eval out there, LLMs score badly on it?

Doesn't seem implausible to me at all. If I was running that team, I would be "Drop what you're doing, boys and girls, and optimise the hell out of this test! This is our differentiator!"

replies(1): >>43607620 #

30. og_kalu ◴[07 Apr 25 03:49 UTC] No.43607456{8}[source]▶

>>43607365 #

Gradient descent is a dumb optimizer. LLM training is not at all like a human reading a book and more like evolution tuning adaptations over centuries. You would not expect either process to be aware of anything they are converging towards. So having lots of books that talk about chess in training will predictably just return a model that knows how to talk about chess really well. I'm not surprised they may know how to talk about the rules but play them poorly.

And that post had a follow-up. Post-training messing things up could well be the issue seeing the impact even a little more examples and/or regurgitation made. https://dynomight.net/more-chess/

replies(1): >>43608267 #

31. cma ◴[07 Apr 25 04:01 UTC] No.43607532[source]▶

>>43604503 (TP) #

OpenAI told how they removed it for GPT-4 in its release paper: only exact string matches. So all discussion of bar exam questions from memory on test taking forums etc., that wouldnn't exactly match, made it in.

32. cma ◴[07 Apr 25 04:05 UTC] No.43607553{4}[source]▶

>>43607092 #

They could have rlhfed or finetuned on user thumbs up responses, which could include users who took the test and asked it to explain problems after

33. alexlikeits1999 ◴[07 Apr 25 04:08 UTC] No.43607571{3}[source]▶

>>43607419 #

I've gone through the link you posted and the o1 system card and can't see any reference to IMO. Are you sure they were referring to IMO or were they referring to AIME?

34. cma ◴[07 Apr 25 04:09 UTC] No.43607575{6}[source]▶

>>43607013 #

Reasoning training causes some about of catastrophic forgetting, so unlikely they burn that on mixing in chess puzzles if they want a commercial product, unless it somehow transfers well to other reasoning problems broadly cared about.

35. AstroBen ◴[07 Apr 25 04:12 UTC] No.43607591{5}[source]▶

>>43607307 #

> Why doesn't it?

It was surprising to me because I would have expected if there was reasoning ability then it would translate across domains at least somewhat, but yeah what you say makes sense. I'm thinking of it in human terms

replies(1): >>43607708 #

36. og_kalu ◴[07 Apr 25 04:16 UTC] No.43607620{5}[source]▶

>>43607428 #

It's implausible that fine-tuning of a premier model would have anywhere near that turn around time. Even if they wanted to and had no qualms doing so, it's not happening anywhere near that fast.

replies(1): >>43609303 #

37. airstrike ◴[07 Apr 25 04:22 UTC] No.43607653[source]▶

>>43605451 #

I tend to prefer Claude over all things ChatGPT so maybe give the latest model a try -- although in some way I feel like 3.7 is a step down from the prior 3.5 model

replies(1): >>43620198 #

38. og_kalu ◴[07 Apr 25 04:32 UTC] No.43607708{6}[source]▶

>>43607591 #

Transfer Learning during LLM training tends to be 'broader' than that.

Like how

- Training LLMs on code makes them solve reasoning problems better - Training Language Y alongside X makes them much better at Y than if they were trained on language Y alone and so on.

Probably because well gradient descent is a dumb optimizer and training is more like evolution than a human reading a book.

Also, there is something genuinely weird going on with LLM chess. And it's possible base models are better. https://dynomight.net/more-chess/

replies(1): >>43613386 #

39. leonidasv ◴[07 Apr 25 04:42 UTC] No.43607773{5}[source]▶

>>43607320 #

Gemini 2.5 Pro got the farmer problem variation right: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...

replies(2): >>43608001 #>>43609491 #

40. geuis ◴[07 Apr 25 04:51 UTC] No.43607825[source]▶

>>43604503 (TP) #

Query: Could you explain the terminology to people who don't follow this that closely?

replies(1): >>43607932 #

41. Sunspark ◴[07 Apr 25 04:52 UTC] No.43607836[source]▶

>>43607255 #

It's fascinating to me when you tell one that you'd like to see translated passages of work from authors who never have written or translated the item in question, especially if they passed away before the piece was written.

The AI will create something for you and tell you it was them.

replies(1): >>43610202 #

42. senordevnyc ◴[07 Apr 25 04:57 UTC] No.43607857[source]▶

>>43607255 #

Just tried with o3-mini-high and it came up with something pretty reasonable: https://chatgpt.com/share/67f35ae9-5ce4-800c-ba39-6288cb4685...

replies(1): >>43611814 #

43. greenmartian ◴[07 Apr 25 05:04 UTC] No.43607910[source]▶

>>43607255 #

Weird thing is, in Google AI Studio all their models—from the state-of-the-art Gemini 2.5Pro, to the lightweight Gemma 2—gave a roughly correct answer. Most even recognised the packing efficiency of spheres.

But Google search gave me the exact same slop you mentioned. So whatever Search is using, they must be using their crappiest, cheapest model. It's nowhere near state of the art.

replies(4): >>43608013 #>>43609176 #>>43609774 #>>43611700 #

44. BlanketLogic ◴[07 Apr 25 05:08 UTC] No.43607932[source]▶

>>43607825 #

Not the OP but

USAMO : USA Math Olympiad. Referred here https://arxiv.org/pdf/2503.21934v1

IMO : International Math Olympiad

SOTA : State of the Art

OP is probably referring to this referred to this paper here https://arxiv.org/pdf/2503.21934v1. The paper explains out how a rigorous testing revealed abysmal performance of LLMs (results that are at odds with how they are hyped about).

45. greenmartian ◴[07 Apr 25 05:20 UTC] No.43608001{6}[source]▶

>>43607773 #

When told, "only room for one person OR one animal", it's also the only one to recognise the fact that the puzzle is impossible to solve. The farmer can't take any animals with them, and neither the goat nor wolf could row the boat.

replies(1): >>43609226 #

46. aurareturn ◴[07 Apr 25 05:22 UTC] No.43608013{3}[source]▶

>>43607910 #

Makes sense that search has a small, fast, dumb model designed to summarize and not to solve problems. Nearly 14 billion Google searches per day. Way too much compute needed to use a bigger model.

replies(1): >>43608376 #

47. 112233 ◴[07 Apr 25 05:29 UTC] No.43608063{4}[source]▶

>>43607092 #

Imagine that you are making problem solving AI. You have large budget, and access to compute and web crawling infra to run your AI "on internet". You would like to be aware of the ways people are currently evaluating AI so that you can be sure your product looks good. Do you have maybe an idea how one could do that?

48. JohnKemeny ◴[07 Apr 25 05:30 UTC] No.43608074[source]▶

>>43604865 #

Discussed here: https://news.ycombinator.com/item?id=43540985 (Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad, 4 points, 2 comments).

49. tsimionescu ◴[07 Apr 25 06:00 UTC] No.43608267{9}[source]▶

>>43607456 #

The whole premise on which the immense valuations of these AI companies is based on is that they are learning general reasoning skills from their training on language. That is, that simply training on text is going to eventually give the AI the ability to generate language that reasons at more or less human level in more or less any domain of knowledge.

This whole premise crashes and burns if you need task-specific training, like explicit chess training. That is because there are far too many tasks that humans need to be competent at in order to be useful in society. Even worse, the vast majority of those tasks are very hard to source training data for, unlike chess.

So, if we accept that LLMs can't learn chess unless they explicitly include chess games in the training set, then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.

replies(1): >>43610036 #

50. fire_lake ◴[07 Apr 25 06:17 UTC] No.43608376{4}[source]▶

>>43608013 #

Massive search overlap though - and some questions (like the golf ball puzzle) can be cached for a long time.

replies(1): >>43609132 #

51. KolibriFly ◴[07 Apr 25 07:01 UTC] No.43608628[source]▶

>>43604503 (TP) #

Yeah, this is one of those red flags that keeps getting hand-waved away, but really shouldn't be.

52. torginus ◴[07 Apr 25 07:18 UTC] No.43608731[source]▶

>>43605451 #

When I was an undergrad EE student a decade ago, I had to tangle a lot with complex maths in my Signals & Systems, and Electricity and Magnetism classes. Stuff like Fourier transforms, hairy integrals, partial differential equations etc.

Math packages of the time like Mathematica and MATLAB helped me immensely, once you could get the problem accurately described in the correct form, they could walk through the steps and solve systems of equations, integrate tricky functions, even though AI was nowhere to be found back then.

I feel like ChatGPT is doing something similar when doing maths with its chain of thoughts method, and while its method might be somewhat more generic, I'm not sure it's strictly superior.

53. worldsayshi ◴[07 Apr 25 07:26 UTC] No.43608793{3}[source]▶

>>43605845 #

> they're more like calculators of language than agents that reason

This might be honing in on both the issue and the actual value of LLM:s. I think there's a lot of value in a "language calculator" but if it's continuously being sold as something it's not we will dismiss it or build heaps of useless apps that will just form a market bubble. I think the value is there but it's different from how we think about it.

54. sanxiyn ◴[07 Apr 25 07:38 UTC] No.43608872{3}[source]▶

>>43607419 #

AIME is so not IMO.

55. aezart ◴[07 Apr 25 07:46 UTC] No.43608930[source]▶

>>43607255 #

> I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.

I feel the same way. It's like discovering for the first time that magicians aren't doing "real" magic, just sleight of hand and psychological tricks. From that point on, it's impossible to be convinced that a future trick is real magic, no matter how impressive it seems. You know it's fake even if you don't know how it works.

replies(2): >>43609752 #>>43609890 #

56. TrackerFF ◴[07 Apr 25 08:11 UTC] No.43609068[source]▶

>>43604503 (TP) #

What would the average human score be?

I.e. if you randomly sampled N humans to take those tests.

replies(1): >>43609102 #

57. sanxiyn ◴[07 Apr 25 08:17 UTC] No.43609102[source]▶

>>43609068 #

The average human score on USAMO (let alone IMO) is zero, of course. Source: I won medals at Korean Mathematical Olympiad.

replies(3): >>43609193 #>>43610920 #>>43612359 #

58. summerlight ◴[07 Apr 25 08:21 UTC] No.43609132{5}[source]▶

>>43608376 #

AFAIK they got 15% of unseen queries everyday, so it might be not very simple to design an effective cache layer on that. Semantic-aware clustering of natural language queries and projecting them into a cache-able low rank dimension is a non-trivial problem. Of course, LLM can effectively solve that, but then what's the point of using cache when you need LLM for clustering queries...

replies(1): >>43619704 #

59. vintermann ◴[07 Apr 25 08:31 UTC] No.43609176{3}[source]▶

>>43607910 #

I have a strong suspicion that for all the low threshold APIs/services, before the real model sees my prompt, it gets evaluated by a quick model to see if it's something they care to bother the big models with. If not i get something shaked out of the sleeve of a bottom barrel model.

60. vintermann ◴[07 Apr 25 08:33 UTC] No.43609193{3}[source]▶

>>43609102 #

Average, hmmm?

61. MoonGhost ◴[07 Apr 25 08:39 UTC] No.43609218[source]▶

>>43605451 #

I was working some time ago on image processing model using GAN architecture. One model produces output and tries to fool the second. Both are trained together. Simple, but requires a lot extra efforts to make it work. Unstable and falls apart (blows up to unrecoverable state). I found some ways to make it work by adding new loss functions, changing params, changing models' architectures and sizes. Adjusting some coefficients through the training to gradually rebalance loss functions' influence.

The same may work with you problem. If it's unstable try introduce extra 'brakes' which theoretically are not required. May be even incorrect. Whatever it is in your domain. Another thing to check is optimizer, try several. Check default parameters. I've heard Adams defaults lead to instability later in training.

PS: it would be heaven if models could work at human expert level. Not sure why some really expect this. We are just at the beginning.

PPS: the fact that they can do known tasks with minor variations is already a huge time saver.

replies(1): >>43612717 #

62. yyy3ww2 ◴[07 Apr 25 08:40 UTC] No.43609226{7}[source]▶

>>43608001 #

> When told, "only room for one person OR one animal"

In common terms suppose I say: there is only room for one person or one animal in my car to go home, one can suppose that it is referring to additional room besides that occupied by the driver. There is a problem when we try to use LLM trained in common use of language to solve puzzle in formal logic or math. I think the current LLMs are not able to have a specialized context to become a logical reasoning agent, but perhaps such thing could be possible if the evaluation function of the LLM was designed to give high credit to changing context with a phrase or token.

63. anonzzzies ◴[07 Apr 25 08:41 UTC] No.43609232[source]▶

>>43604503 (TP) #

That type of news might make investors worry / scared.

64. raylad ◴[07 Apr 25 08:42 UTC] No.43609237[source]▶

>>43605224 #

Eek! You mean eke.

65. MoonGhost ◴[07 Apr 25 08:48 UTC] No.43609276[source]▶

>>43605147 #

They are trained on some mix with minimal fraction of math. That's how it was from the beginning. But we can rebalance it by adding quality generated content. Just content will cost millions of $$ to generate. Distillation on new level looks like logical next step.

66. suddenlybananas ◴[07 Apr 25 08:55 UTC] No.43609303{6}[source]▶

>>43607620 #

It's really not that implausible, they probably are adding stuff to the data-soup all the time and have a system in place for it.

replies(1): >>43610058 #

67. Tepix ◴[07 Apr 25 09:32 UTC] No.43609491{6}[source]▶

>>43607773 #

That can't be viewed without logging into Google first.

68. katsura ◴[07 Apr 25 10:27 UTC] No.43609752{3}[source]▶

>>43608930 #

To be fair, I love that magicians can pull tricks on me even though I know it is fake.

69. InDubioProRubio ◴[07 Apr 25 10:32 UTC] No.43609774{3}[source]▶

>>43607910 #

Its most likely one giant ["input token close enough question hash"] = answer_with_params_replay? It doesent missunderstands the question, it tries to squeeze the input to something close enough?

70. otabdeveloper4 ◴[07 Apr 25 10:36 UTC] No.43609801[source]▶

>>43604865 #

Anecdotally: schoolkids are at the leading edge of LLM innovation, and nowadays all homework assignments are explicitly made to be LLM-proof. (Well, at least in my son's school. Yours might be different.)

This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)

replies(2): >>43609850 #>>43628785 #

71. bambax ◴[07 Apr 25 10:47 UTC] No.43609850{3}[source]▶

>>43609801 #

How do you make homework assignments LLM-proof? There may be a huge business opportunity if that actually works, because LLMs are destroying education at a rapid pace.

replies(2): >>43609943 #>>43612276 #

72. bambax ◴[07 Apr 25 10:54 UTC] No.43609890{3}[source]▶

>>43608930 #

I think there is a big divide here. Every adult on earth knows magic is "fake", but some can still be amazed and entertained by it, while others find it utterly boring because it's fake, and the only possible (mildly) interesting thing about it is to try to figure out what the trick is.

I'm in the second camp but find it kind of sad and often envy the people who can stay entertained even though they know better.

replies(5): >>43611595 #>>43611757 #>>43612440 #>>43613188 #>>43614673 #

73. ◴[07 Apr 25 10:57 UTC] No.43609908[source]▶

>>43605451 #

74. otabdeveloper4 ◴[07 Apr 25 11:04 UTC] No.43609943{4}[source]▶

>>43609850 #

You just (lol) need to give non-standard problems and demand students to provide reasoning and explanations along with the answer. Yeah, LLMs can "reason" too, but it's obvious when the output comes from an LLM here.

(Yes, that's a lot of work for a teacher. Gone are the days when you could just assign reports as homework.)

replies(1): >>43610395 #

75. og_kalu ◴[07 Apr 25 11:19 UTC] No.43610036{10}[source]▶

>>43608267 #

>The whole premise on which the immense valuations of these AI companies is based on is that they are learning general reasoning skills from their training on language.

And they do, just not always in the ways we expect.

>This whole premise crashes and burns if you need task-specific training, like explicit chess training.

Everyone needs task specific training. Any human good at chess or anything enough to make it a profession needs it. So I have no idea why people would expect any less for a Machine.

>then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.

Yeah so ? How much business pitches they need in the training set has no correlation with chess. I don't see any reason to believe what is already present isn't enough. There's enough chess data on the internet to teach them chess too, it's just a matter of how much open AI care about it.

replies(1): >>43618684 #

76. og_kalu ◴[07 Apr 25 11:22 UTC] No.43610058{7}[source]▶

>>43609303 #

Yeah it is lol. You don't just train your model on whatever you like when you're expected to serve it. They're are a host of problems with doing that. The idea that they trained on this obscure benchmark released about the day of is actually very silly.

77. throwawaymaths ◴[07 Apr 25 11:33 UTC] No.43610117[source]▶

>>43607255 #

I've seen humans make exactly these sorts of mistakes?

replies(1): >>43611857 #

78. prawn ◴[07 Apr 25 11:46 UTC] No.43610202{3}[source]▶

>>43607836 #

"That's impossible because..."

"Good point! Blah blah blah..."

Absolutely shameless!

79. NiloCK ◴[07 Apr 25 11:52 UTC] No.43610236{4}[source]▶

>>43607092 #

I'm not generally inclined toward the "they are cheating cheaters" mindset, but I'll point out that fine tuning is not the same as retraining. It can be done cheaply and quickly.

Models getting 5X better at things all the time is at least as easy to interpret as evidence of task-specific tuning than as breakthroughs in general ability, especially when the 'things being improved on' are published evals with history.

replies(1): >>43610319 #

80. sigmoid10 ◴[07 Apr 25 11:53 UTC] No.43610244[source]▶

>>43604503 (TP) #

>I'm incredibly surprised no one mentions this

If you don't see anyone mentioning what you wrote that's not surprising at all, because you totally misunderstood the paper. The models didn't suddenly drop to 5% accuracy on math olympiad questions. Instead this paper came up with a human evaluation that looks at the whole reasoning process (instead of just the final answer) and their finding is that the "thoughts" of reasoning models are not sufficiently human understandable or rigorous (at least for expert mathematicians). This is something that was already well known, because "reasoning" is essentially CoT prompting baked into normal responses. But the empirics also tell us it greatly helps for final outputs nonetheless.

replies(1): >>43611758 #

81. alphabetting ◴[07 Apr 25 12:03 UTC] No.43610319{5}[source]▶

>>43610236 #

Google team said it was outside the training window fwiw

https://x.com/jack_w_rae/status/1907454713563426883

82. tim333 ◴[07 Apr 25 12:12 UTC] No.43610390[source]▶

>>43607255 #

A lot of humans are similarly good at some stuff and bad at other things.

Looking up the math ability of the average American this is given as an example for the median (from https://www.wyliecomm.com/2021/11/whats-the-latest-u-s-numer...):

>Review a motor vehicle logbook with columns for dates of trip, odometer readings and distance traveled; then calculate trip expenses at 35 cents a mile plus $40 a day.

Which is ok but easier than golf balls in a 747 and hugely easier than USAMO.

Another question you could try from the easy math end is: Someone calculated the tariff rate for a country as (trade deficit)/(total imports from the country). Explain why this is wrong.

83. itchyjunk ◴[07 Apr 25 12:12 UTC] No.43610395{5}[source]▶

>>43609943 #

Can you provide sample questions that are "LLM proof" ?

replies(3): >>43610624 #>>43611868 #>>43611976 #

84. larodi ◴[07 Apr 25 12:14 UTC] No.43610413[source]▶

>>43604865 #

This is a paper by INSAIT researchers - a very young institute which hired most of its PHD staff only in the last 2 years, basically onboarding anyone who wanted to be part of it. They were waiving their BG-GPT on national TV in the country as a major breakthrough, while it was basically was a Mistral fine-tuned model, that was eventually never released to the public, nor the training set.

Not sure whether their (INSAIT's) agenda is purely scientific, as there's a lot of PR on linkedin by these guys, literally celebrating every PHD they get, which is at minimum very weird. I'd take anything they release with a grain of sand if not caution.

replies(3): >>43610872 #>>43614143 #>>43617257 #

85. yahoozoo ◴[07 Apr 25 12:33 UTC] No.43610557[source]▶

>>43604503 (TP) #

LLMs are “next token” predictors. Yes, I realize that there’s a bit more to it and it’s not always just the “next” token, but at a very high level that’s what they are. So why are we so surprised when it turns out they can’t actually “do” math? Clearly the high benchmark scores are a result of the training sets being polluted with the answers.

86. otabdeveloper4 ◴[07 Apr 25 12:42 UTC] No.43610624{6}[source]▶

>>43610395 #

It's not about being "LLM-proff", it's about teacher involvement in making up novel questions and grading attentively. There's no magic trick.

87. colonial ◴[07 Apr 25 13:08 UTC] No.43610890[source]▶

>>43604503 (TP) #

Less than 5%. OpenAI's O1 burned through over $100 in tokens during the test as well!

88. lordgrenville ◴[07 Apr 25 13:10 UTC] No.43610920{3}[source]▶

>>43609102 #

I am hesitant to correct a math Olympian, but don't you mean the median?

replies(1): >>43621718 #

89. throwaway173738 ◴[07 Apr 25 13:11 UTC] No.43610938{7}[source]▶

>>43607265 #

The issue isn’t whether they can be trained to play. The issue is whether, after making a careful reading of the rules, they can infer how to play. The latter is something a human child could do, but it is completely beyond an LLM.

90. nucleogenesis ◴[07 Apr 25 14:00 UTC] No.43611595{4}[source]▶

>>43609890 #

Idk I don’t think of it as fake - it’s creative fiction paired with sometimes highly skilled performance. I’ve learned a lot about how magic tricks work and I still love seeing performers do effects because it takes so much talent to, say, hold and hide 10 coins in your hands while showing them as empty or to shuffle a deck of cards 5x and have the audience cut it only to pull 4 aces off the top.

91. swader999 ◴[07 Apr 25 14:06 UTC] No.43611692[source]▶

>>43607255 #

It'll get it right next time because they'll hoover up the parent post.

92. Workaccount2 ◴[07 Apr 25 14:07 UTC] No.43611700{3}[source]▶

>>43607910 #

Google is shooting themselves in the foot with whatever model they use for search. It's probably a 2B or 4B model to keep up with demand, and man is it doing way more harm than good.

93. toddmorey ◴[07 Apr 25 14:12 UTC] No.43611757{4}[source]▶

>>43609890 #

I think the problem-solving / want-to-be-engineer side of my brain lights up in that "how did he do that??" way. To me that's the fun of it... I immediately try to engineer my own solutions to what I just saw happen. So I guess I'm the first camp, but find trying to figure out the trick hugely interesting.

94. Workaccount2 ◴[07 Apr 25 14:12 UTC] No.43611758[source]▶

>>43610244 #

On top of that, what the model prints out in the CoT window is not necessarily what the model is actually thinking. Anthropic just showed this in their paper from last week where they got models to cheat at a question by "accidentally" slipping them the answer, and the CoT had no mention of answer being slipped to them.

95. CamperBob2 ◴[07 Apr 25 14:16 UTC] No.43611814{3}[source]▶

>>43607857 #

It's just the usual HN sport: ask a low-end, obsolete or unspecified model, get a bad answer, brag about how you "proved" AI is pointless hype, collect karma.

Edit: Then again, maybe they have a point, going by an answer I just got from Google's best current model ( https://g.co/gemini/share/374ac006497d ) I haven't seen anything that ridiculous from a leading-edge model for a year or more.

96. toddmorey ◴[07 Apr 25 14:21 UTC] No.43611857{3}[source]▶

>>43610117 #

As another commenter mentioned, LLMs tend to make these bad mistakes with enormous confidence. And because they represent SOTA technology (and can at times deliver incredible results), they have extra credence.

More than even filling the gaps in knowledge / skills, would be a huge advancement in AI for it to admit when it doesn't know the answer or is just wildly guessing.

97. xeromal ◴[07 Apr 25 14:22 UTC] No.43611868{6}[source]▶

>>43610395 #

Part of the proof is knowing your students and forcing an answer that will rat out whether they used an LLM. There is no universal question and it requires personal knowledge of each student. You're looking for something that doesn't exist.

98. apercu ◴[07 Apr 25 14:22 UTC] No.43611877[source]▶

>>43604865 #

In my experience LLMs can't get basic western music theory right, there's no way I would use an LLM for something harder than that.

replies(3): >>43613409 #>>43622057 #>>43628772 #

99. jerf ◴[07 Apr 25 14:31 UTC] No.43611976{6}[source]▶

>>43610395 #

The models have moved on past this working reliably, but an example that I found in the early days of LLMs is asking it "Which is heavier, two pounds of iron or a pound of feathers?" You could very easily trick it into giving the answer about how they're both the same, because of the number of training instances of the well-known question about a pound of each that it encountered.

You can still do this to the current models, though it takes more creativity; you can bait it into giving wrong answers if you ask a question that is "close" to a well-known one but is different in an important way that does not manifest as a terribly large English change (or, more precisely, a very large change in the model's vector space).

The downside is that the frontier between what fools the LLMs and what would fool a great deal of the humans in the class too shrinks all the time. Humans do not infinitely carefully parse their input either... as any teacher could tell you! Ye Olde "Read this entire problem before proceeding, {a couple of paragraphs of complicated instruction that will take 45 minutes to perform}, disregard all the previous and simply write 'flower' in the answer space" is an old chestnut that has been fooling humans for a long time, for instance. Given how jailbreaks work on LLMs, LLMs are probably much better at that than humans are, which I suppose shows you can construct problems in the other direction too.

(BRB... off to found a new CAPTCHA company for detecting LLMs based on LLMs being too much better than humans at certain tasks...)

replies(1): >>43612267 #

100. CivBase ◴[07 Apr 25 14:51 UTC] No.43612201[source]▶

>>43607255 #

I just asked my company-approved AI chatbot the same question.

It got the golf ball volume right (0.00004068 cubic meters), but it still overestimated the cabin volume at 1000 cubic meters.

It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?

It didn't acknowledge other items in the cabin (like seats) reducing its volume, but it did at least acknowlesge inefficiencies in packing spherical objects and suggested the actual number would be "somewhat lower", though it did not offer an estimate.

When I pressed it for an estimate, it used a packing density of 74% and gave an estimate of 18,191,766 golf balls. That's one more than the calculation should have produced, but arguably insignificant in context.

Next I asked it to account for fixtures in the cabin such as seats. It estimated a 30% reduction in cabin volume and redid the calculations with a cabin volume of 700 cubic meters. These calculations were much less accurate. It told me 700 ÷ 0.00004068 = 17,201,480 (off by ~6k). And it told me 17,201,480 × 0.74 was 12,728,096 (off by ~1k).

I told it the calculations were wrong and to try again, but it produced the same numbers. Then I gave it the correct answer for 700 ÷ 0.00004068. It told me I was correct and redid the last calculation correctly using the value I provided.

Of all the things for an AI chatbot which can supposedly "reason" to fail at, I didn't expect it to be basic arithmetic. The one I used was closer, but it was still off by a lot at times despite the calculations being simple multiplication and division. Even if might not matter in the context of filling an air plane cabin with golf balls, it does not inspire trust for more serious questions.

replies(1): >>43617479 #

101. hyperbovine ◴[07 Apr 25 14:56 UTC] No.43612243[source]▶

>>43604503 (TP) #

Is that really so surprising given what we know about how these models actually work? I feel vindicated on behalf of myself and all the other commenters who have been mercilessly downvoted over the past three years for pointing out the obvious fact that next token prediction != reasoning.

replies(1): >>43612270 #

102. immibis ◴[07 Apr 25 14:57 UTC] No.43612267{7}[source]▶

>>43611976 #

"Draw a wine glass filled to the brim with wine" worked recently on image generators. They only knew about half-full wine glasses.

If you asked a multimodal system questions about the image it just generated, it would tell you the wine was almost overflowing out of the top of the glass.

But any trick prompt like this is going to start giving expected results once it gets well-known enough.

Late edit: Another one was the farmer/fox/chicken/cabbage/river problem, but you modify the problem in unexpected ways, by stating, for example, that the cabbage will eat the fox, or that the farmer can bring three items per trip. LLMs used to ignore your modifications and answer the original problem.

replies(3): >>43613006 #>>43615618 #>>43618160 #

103. aoeusnth1 ◴[07 Apr 25 14:58 UTC] No.43612270[source]▶

>>43612243 #

2.5 pro scores 25%.

It’s just a much harder math benchmark which will fall by the end of next year just like all the others. You won’t be vindicated.

replies(1): >>43612302 #

104. hyperbovine ◴[07 Apr 25 14:58 UTC] No.43612276{4}[source]▶

>>43609850 #

By giving pen and paper exams and telling your students that the only viable preparation strategy is doing the hw assignments themselves :)

replies(3): >>43613586 #>>43614742 #>>43620870 #

105. hyperbovine ◴[07 Apr 25 14:59 UTC] No.43612302{3}[source]▶

>>43612270 #

Bold claim! Let's see what that 25% is. I guarantee it is the portion of the exam which is trivially answerable if you have a stored database of all previous math exams ever written to consult.

replies(1): >>43612821 #

106. aoeusnth1 ◴[07 Apr 25 15:02 UTC] No.43612324[source]▶

>>43607255 #

2.5 pro nails each of these calculations. I don’t agree with Google’s decision to use a weak model in its search queries, but you can’t say progress on LLMs in bullshit as evidenced by a weak model no one thinks is close to SOTA.

107. hyperbovine ◴[07 Apr 25 15:04 UTC] No.43612359{3}[source]▶

>>43609102 #

This is a disappointing answer from an MO alum. Pick a quantile, any quantile...

108. raxxorraxor ◴[07 Apr 25 15:08 UTC] No.43612398[source]▶

>>43607255 #

This reminds me of Google quick answers we had for a time in search. It is quite funny if you live outside the US, because it very often got the units or numbers wrong because of different decimal delimiters.

No wonder Trump isn't afraid to put taxes against Canada. Who could take a 3.8 sqare miles country seriously?

109. tshaddox ◴[07 Apr 25 15:11 UTC] No.43612440{4}[source]▶

>>43609890 #

I think magic is extremely interesting (particularly close-up magic), but I also hate the mindset (which seems to be common though not ubiquitous) that stigmatizes any curiosity in how the trick works.

In my view, the trick as it is intended to appear to the audience and the explanation of how the trick is performed are equal and inseparable aspects of my interest as a viewer. Either one without the other is less interesting than the pair.

replies(1): >>43614605 #

110. bglazer ◴[07 Apr 25 15:38 UTC] No.43612717{3}[source]▶

>>43609218 #

Yes, I suspect that engineering the loss and hyperparams could eventually get this to work. However, I was hoping the model would help me get to a more fundamental insight into why the training falls into bad minima. Like the Wasserstein GAN is a principled change to the GAN that improves stability, not just fiddling around with Adam’s beta parameter.

The reason I expected better mathematical reasoning is because the companies making them are very loudly proclaiming that these models are capable of high level mathematical reasoning.

And yes the fact I don’t have to look at matplotlib documentation anymore makes these models extremely useful already, but thats qualitatively different from having Putnam prize winning reasoning ability

replies(1): >>43617488 #

111. aoeusnth1 ◴[07 Apr 25 15:50 UTC] No.43612821{4}[source]▶

>>43612302 #

There is 0% of the exam which is trivially answerable.

The entire point of USAMO problems is that they demand novel insight and rigorous, original proofs. They are intentionally designed not to be variations of things you can just look up. You have to reason your way through, step by logical step.

Getting 25% (~11 points) is exceptionally difficult. That often means fully solving one problem and maybe getting solid partial credit on another. The median score is often in the single digits.

replies(1): >>43614564 #

112. jerf ◴[07 Apr 25 16:06 UTC] No.43613006{8}[source]▶

>>43612267 #

"But any trick prompt like this is going to start giving expected results once it gets well-known enough."

Which makes it difficult to fairly evaluate whether the models have actually gotten better at the feather/iron problem or if it just got enough samples of trick questions that it learned better, either naturally from the internet, or fed as part of the training data. I am fairly certain the training data has had "trick questions" like this added to it, because, I mean, why wouldn't it?

I have noticed in my playing with image AIs that they do seem more prone to getting dragged into local maxima when a human would know the prompt than the LLMs. Perhaps it's all the additional data in an image that reveals it.

113. aezart ◴[07 Apr 25 16:24 UTC] No.43613188{4}[source]▶

>>43609890 #

It's still entertaining, that's true. I like magic tricks.

The point is the analogy to LLMs. A lot of people are very optimistic about their capabilities, while other people who have "seen behind the curtain" are skeptical, and feel that the fundamental flaws are still there even if they're better-hidden.

114. AstroBen ◴[07 Apr 25 16:46 UTC] No.43613386{7}[source]▶

>>43607708 #

It seems to be fairly nuanced in how abilities transfer: https://arxiv.org/html/2310.16937v2

Very hard for me to wrap my head around the idea that an LLM being able to discuss, even perhaps teach high level chess strategy wouldn't transfer at all to its playing performance

115. waffletower ◴[07 Apr 25 16:48 UTC] No.43613409{3}[source]▶

>>43611877 #

While I may be mistaken, but I don't believe that LLMs are trained on a large corpus of machine readable music representations, which would arguably be crucial to strong performance in common practice music theory. I would also surmise that most music theory related datasets largely arrive without musical representations altogether. A similar problem exists for many other fields, particularly mathematics, but it is much more profitable to invest the effort to span such representation gaps for them. I would not gauge LLM generality on music theory performance, when its niche representations are likely unavailable in training and it is widely perceived as having miniscule economic value.

116. bambax ◴[07 Apr 25 17:02 UTC] No.43613586{5}[source]▶

>>43612276 #

You wish. I used to think that too. But it turns out, nowadays, every single exam in person is done with a phone hidden somewhere, with various efficiency, and you can't really strip students before they enter the room.

Some teachers try to collect the phones beforehand, but then students simply give out older phones and keep their active ones with them.

You could try to verify that the phones they're giving out are working by calling them, but that would take an enormous amount of time and it's impractical for simple exams.

We really have no idea how much AI is ruining education right now.

replies(1): >>43615673 #

117. ◴[07 Apr 25 17:58 UTC] No.43614143{3}[source]▶

>>43610413 #

118. bakkoting ◴[07 Apr 25 18:16 UTC] No.43614328{3}[source]▶

>>43607028 #

New models suddenly doing much better isn't really surprising, especially for this sort of test: going from 98% accuracy to 99% accuracy can easily be the difference between having 1 fatal reasoning error and having 0 fatal reasoning errors on a problem with 50 reasoning steps, and a proof with 0 fatal reasoning errors gets ~full credit whereas a proof with 1 fatal reasoning error gets ~no credit.

And to be clear, that's pretty much all this was: there's six problems, it got almost-full credit on one and half credit on another and bombed the rest, whereas all the other models bombed all the problems.

119. hyperbovine ◴[07 Apr 25 18:41 UTC] No.43614564{5}[source]▶

>>43612821 #

> There is 0% of the exam which is trivially answerable.

That's true, but of course, not what I claimed.

The claim is that, given the ability to memorize an every mathematical result that has ever been published (in print or online), it is not so difficult to get 25% correct on an exam by pattern matching.

Note that this is skill is, by definition, completely out of the reach of any human being, but that possessing it does not imply creativity or the ability to "think".

120. mrandish ◴[07 Apr 25 18:48 UTC] No.43614605{5}[source]▶

>>43612440 #

> that stigmatizes any curiosity in how the trick works.

As a long-time close-up magician and magical inventor who's spent a lot of time studying magic theory (which has been a serious field of magical research since the 1960s), it depends on which way we interpret "how the trick works." Frankly, for most magic tricks the method isn't very interesting, although there are some notable exceptions where the method is fascinating, sometimes to the extent it can be far more interesting than the effect it creates.

However, in general, most magic theorists and inventors agree that the method, for example, "palm a second coin in the other hand", isn't usually especially interesting. Often the actual immediate 'secret' of the method is so simple and, in hindsight, obvious that many non-magicians feel rather let down if the method is revealed. This is the main reason magicians usually don't reveal secret methods to non-magicians. It's not because of some code of honor, it's simply because the vast majority of people think they'll be happy if they know the secret but are instead disappointed.

Where studying close-up magic gets really fascinating is understanding why that simple, obvious thing works to mislead and then surprise audiences in the context of this trick. Very often changing subtle things seemingly unrelated to the direct method will cause the trick to stop fooling people or to be much less effective. Comparing a master magician to even a competent, well-practiced novice performing the exact same effect with the same method can be a night and day difference. Typically, both performances will fool and entertain audiences but the master's performance can have an intensely more powerful impact. Like leaving most audience members in stunned shock vs just pleasantly surprised and fooled. While neither the master nor novice's audiences have any idea of the secret method, this dramatic difference in impact is fascinating because careful deconstruction reveals it often has little to do with mechanical proficiency in executing the direct method. In other words, it's rarely driven by being able to do the sleight of hand faster or more dexterously. I've seen legendary close-up masters like a Dai Vernon or Albert Goshman when in their 80s and 90s perform sleight of hand with shriveled, arthritic hands incapable of even cleanly executing a basic palm, absolutely blow away a roomful of experienced magicians with a trick all the magicians already knew. How? It turns out there's something deep and incredibly interesting about the subtle timing, pacing, body language, posture, and psychology surrounding the "secret method" that elevates the impact to almost transcendence compared to a good, competent but uninspired performance of the same method and effect.

Highly skilled, experienced magicians refer to the complex set of these non-method aspects, which can so powerfully elevate an effect to another level, as "the real work" of the trick. At the top levels, most magicians don't really care about the direct methods which some audience members get so obsessed about. They aren't even interesting. And, contrary to what most non-magicians think, these non-methods are the "secrets" master magicians tend to guard from widespread exposure. And it's pretty easy to keep this crucially important "real work" secret because it's so seemingly boring and entirely unlike what people expect a magic secret to be. You have to really "get it" on a deeper level to even understand that what elevated the effect was intentionally establishing a completely natural-seeming, apparently random three-beat pattern of motion and then carefully injecting a subtle pause and slight shift in posture to the left six seconds before doing "the move". Audiences mistakenly think that "the hidden move" is the secret to the trick when it's just the proximate first-order secret. Knowing that secret won't get you very far toward recreating the absolute gob-smacking impact resulting from a master's years of experimentation figuring out and deeply understanding which elements beyond the "secret method" really elevate the visceral impact of the effect to another level.

replies(1): >>43616235 #

121. abustamam ◴[07 Apr 25 18:54 UTC] No.43614673{4}[source]▶

>>43609890 #

I love magic, and illusions in general. I know that Disney's Haunted Mansion doesn't actually have ghosts. But it looks pretty convincing, and watching the documentaries about how they made it is pretty mind-blowing especially considering that they built the original long before I was born.

I look at optical illusions like The Dress™ and am impressed that I cannot force my brain to see it correctly even though I logically know what color it is supposed to be.

Finding new ways that our brains can be fooled despite knowing better is kind of a fun exercise in itself.

122. econ ◴[07 Apr 25 19:03 UTC] No.43614742{5}[source]▶

>>43612276 #

Or you simply account for it and provide equally challenging tasks adjusted for the tools of the time. Give them access to the best LLMs money can buy.

After all, they will grow up next to these things. They will do the homework today, by the time they graduate the LLM will take their job. There might be human large langage model managers for a while, soon to be replaced by the age of idea men.

123. ◴[07 Apr 25 20:24 UTC] No.43615581[source]▶

>>43605451 #

124. achierius ◴[07 Apr 25 20:35 UTC] No.43615673{6}[source]▶

>>43613586 #

Unlike the hard problem of "making an exam difficult to take when you have access to an LLM", "making sure students don't have devices on them when they take one" is very tractable, even if teachers are going to need some time to catch up with the curve.

Any of the following could work, though the specific tradeoffs & implementation details do vary:

- have <n> teachers walking around the room to watch for cheaters

- mount a few cameras to various points in the room and give the teacher a dashboard so that they can watch from all angles

- record from above and use AI to flag potential cheaters for manual review

- disable Wi-Fi + activate cell jammers during exam time (with a land-line in the room in case of emergencies?)

- build dedicated examination rooms lined with metal mesh to disrupt cell reception

So unlike "beating LLMs" (where it's an open question as to whether it's even possible, and a moving target to boot), barring serious advances in wearable technology this just seems like a question of funding and therefore political will.

replies(2): >>43615891 #>>43617261 #

125. atiedebee ◴[07 Apr 25 21:00 UTC] No.43615891{7}[source]▶

>>43615673 #

Cell jammers sound like they could be a security risk. In the context of highschool, it is generally very easy to see when someone is on their phone.

126. tshaddox ◴[07 Apr 25 21:38 UTC] No.43616235{6}[source]▶

>>43614605 #

> Frankly, for most magic tricks the method isn't very interesting, although there are some notable exceptions where the method is fascinating, sometimes to the extent it can be far more interesting than the effect it creates.

> However, in general, most magic theorists and inventors agree that the method, for example, "palm a second coin in the other hand", isn't usually especially interesting.

Fair enough. It sounds like I simply fundamentally disagree, because I think nearly any explanation of method is very interesting. For close-up maginc, the only exceptions for me would be if the explanation is "the video you were watching contains visual effects" or "the entire in-person audience was in on it."

Palming is awesome. Misdirection is awesome. I fully expect these sorts of things to be used in most magic tricks, but I still want to know precisely how. The fact that I'm aware of most close-up magic techniques but am still often fooled by magic tricks should make it pretty clear that the methods are interesting!

replies(1): >>43617212 #

127. mrandish ◴[08 Apr 25 00:16 UTC] No.43617212{7}[source]▶

>>43616235 #

> Palming is awesome. Misdirection is awesome.

Since studying magic has been a lifelong passion since I was a kid, I clearly couldn't agree more. However, experience has shown that despite claiming otherwise, most people aren't actually interested in the answer to "How did you do that?" beyond the first 30 seconds. So... you're unusual - and that's great!

> but I still want to know precisely how.

Well, you're extremely fortunate to be interested in learning how magic is really done at the best time in history for doing so. I was incredibly lucky to be accepted into the Magic Castle as a teenager and mentored by Dai Vernon (widely thought to be the greatest close-up magician of the 20th century) who was in his late 80s at the time. I also had access the Castle's library of magic books, the largest in the world at the time. 99% of other kids on Earth interested in magic at the time only had a handful of local public library books and mail-order tricks.

Today there's an incredible amount of insanely high-quality magic instruction available in streaming videos, books and online forums. There are even master magicians who teach those willing to learn via Zoom. While most people think magicians want to hoard their secrets, the reality couldn't be more different. Magicians love teaching how to actually do magic to anyone who really wants to learn. However, most magicians aren't interested in wasting time satisfying the extremely fleeting curiosity of those who only want to know "how it works" in the surface sense of that first 30 seconds of only revealing the proximate 'secret method'.

Yet many magicians will happily devote hours to teaching anyone who really wants to actually learn how to do magic themselves and is willing put in the time and effort to develop the skills, even if those people have no intention of ever performing magic for others - and even if the student isn't particularly good at it. It just requires the interest to go really deep on understanding the underlying principles and developing the skills, even if for no other purpose than just having the knowledge and skills. Personally, I haven't performed magic for non-magicians in over a decade but I still spend hours learning and mastering new high-level skills because it's fun, super intellectually interesting and extremely satisfying. If you're really interested, I encourage you to dive in. There's quite literally never been a better time to learn magic.

128. sealeck ◴[08 Apr 25 00:28 UTC] No.43617257{3}[source]▶

>>43610413 #

Half the researchers are at ETH Zurich (INSAIT is a partnership between EPFL, ETH and Sofia) - hardly an unreliable institution.

129. sealeck ◴[08 Apr 25 00:29 UTC] No.43617261{7}[source]▶

>>43615673 #

Infrared camera should do the trick.

130. billforsternz ◴[08 Apr 25 01:17 UTC] No.43617479{3}[source]▶

>>43612201 #

> It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?

1000 ÷ 0.00004068 = 25,000,000. I think this is an important point that's increasingly widely misunderstood. All those extra digits you show are just meaningless noise and should be ruthlessly eliminated. If 1000 cubic metres in this context really meant 1000.000 cubic metres, then by all means show maybe the four digits of precision you get from the golf ball (but I am more inclined to think 1000 cubic metres is actually the roughest of rough approximations, with just one digit of precision).

In other words, I don't fault the AI for mismatching one set of meaninglessly precise digits for another, but I do fault it for using meaninglessly precise digits in the first place.

replies(1): >>43617644 #

131. MoonGhost ◴[08 Apr 25 01:19 UTC] No.43617488{4}[source]▶

>>43612717 #

One thing I forgot. Your solution may never converge. Like in my case with GAN after training models start wobbling around some point trying to outsmart each other. Then they _always_ explode. So, I was saving them periodically and took the best intermediate weights.

132. melagonster ◴[08 Apr 25 01:21 UTC] No.43617498[source]▶

>>43605451 #

I doubt this is because his explanation is better. I tried to ask question of Calculus I, ChatGPT just repeated content from textbooks. It is useful, but people should remind that where the limitation is.

133. CivBase ◴[08 Apr 25 01:50 UTC] No.43617644{4}[source]▶

>>43617479 #

I agree those digits are not significant in the context of the question asked. But if the AI is going to use that level of precision in the answer, I expect it to be correct.

replies(1): >>43639574 #

134. int_19h ◴[08 Apr 25 03:40 UTC] No.43618160{8}[source]▶

>>43612267 #

> Another one was the farmer/fox/chicken/cabbage/river problem, but you modify the problem in unexpected ways, by stating, for example, that the cabbage will eat the fox, or that the farmer can bring three items per trip. LLMs used to ignore your modifications and answer the original problem.

This is still the case. Very few non-reasoning models can solve such variations correctly, even SOTA models. Worse yet, not only they confidently give wrong responses, but they often do so even when specifically told to use CoT, and they continue giving wrong answers in a loop even if you specifically point out where they are wrong.

Reasoning models do much better, though. E.g. QwQ-32b can solve it pretty reliably, although it takes a lot of tokens for it to explore the possibilities. But at least it can fairly consistently tell when it's doing something wrong and then backtrack.

One other example that befuddles even the reasoning models is frying-cubes-in-a-pan and equivalents, e.g. this version from Simple Bench:

> Beth places four whole ice cubes in a frying pan at the start of the first minute, then five at the start of the second minute and some more at the start of the third minute, but none in the fourth minute. If the average number of ice cubes per minute placed in the pan while it was frying a crispy egg was five, how many whole ice cubes can be found in the pan at the end of the third minute? Pick the most realistic answer option. A) 5 B) 11 C) 0 D) 20

135. tsimionescu ◴[08 Apr 25 05:48 UTC] No.43618684{11}[source]▶

>>43610036 #

Chess is a very simple game, and having basic general reasoning skills is more than enough to learn how to play it. It's not some advanced mathematics or complicated human interaction - it's a game with 30 or so fixed rules. And chess manuals have numerous examples of actual chess games, it's not like they are pure text talking about the game.

So, the fact that LLMs can't learn this sample game despite probably including all of the books ever written on it in their training set tells us something about their general reasoning skills.

replies(1): >>43620393 #

136. fire_lake ◴[08 Apr 25 09:11 UTC] No.43619704{6}[source]▶

>>43609132 #

Not a search engineer, but wouldn’t a cache lookup to a previous LLM result be faster than a conventional free text search over the indexed websites? Seems like this could save money whilst delivering better results?

replies(1): >>43625227 #

137. pdimitar ◴[08 Apr 25 10:45 UTC] No.43620198{3}[source]▶

>>43607653 #

What do you find inferior in 3.7 compared to 3.5 btw? I only recently started using Claude so I don't have a point of reference.

replies(1): >>43621964 #

138. pdimitar ◴[08 Apr 25 11:14 UTC] No.43620364{5}[source]▶

>>43607307 #

> Humans have weird failure modes that are odds with their 'intelligence'. We just choose to call them funny names and laugh about it sometimes. These Machines have theirs. That's all there is to it.

Yes, that's all there is to it and it's not enough. I ain't paying for another defective organism that makes mistakes in entirely novel ways. At least with humans you know how to guide them back on course.

If that's the peak of "AI" evolution today, I am not impressed.

139. pdimitar ◴[08 Apr 25 11:20 UTC] No.43620393{12}[source]▶

>>43618684 #

As in: they do not have general reasoning skills.

140. billy99k ◴[08 Apr 25 12:23 UTC] No.43620870{5}[source]▶

>>43612276 #

Making in-person tests the only thing that counts toward your grade seems to be a step in the right direction. If students use AI to do their homework, it will only hurt them in the long run.

141. nhinck3 ◴[08 Apr 25 13:44 UTC] No.43621718{4}[source]▶

>>43610920 #

Average is fine.

142. airstrike ◴[08 Apr 25 14:08 UTC] No.43621964{4}[source]▶

>>43620198 #

It's hard to say, super subjective. It's just wrong more often and sometimes it goes off in tangents wrt. what I asked. Also I might ask a question and it starts coding an entire React project. Every once in a while it will literally max out its response tokens because it can't stop writing code.

Just feels less "stable" or "tight" overall.

replies(1): >>43622207 #

143. code_for_monkey ◴[08 Apr 25 14:15 UTC] No.43622057{3}[source]▶

>>43611877 #

music theory is a really good test because in my experience the AI is extremely bad at it

144. pdimitar ◴[08 Apr 25 14:28 UTC] No.43622207{5}[source]▶

>>43621964 #

I see. I have a similar feeling; as if they made it to quickly force you to pay (quickly maxing out one conversation in my case). I'm quite cynical and paranoid in this regard and I try hard not to be ruled by those two... but I can't shake the feeling that they're right this time.

replies(1): >>43622453 #

145. airstrike ◴[08 Apr 25 14:50 UTC] No.43622453{6}[source]▶

>>43622207 #

I hear you but FWIW I don't think it's on purpose as it feels like an inferior product to me as a paid user

146. summerlight ◴[08 Apr 25 18:48 UTC] No.43625227{7}[source]▶

>>43619704 #

Yes, that's what Google's doing for AI overview IIUC. From what I've seen from my experiences, this is working okay and improving over time but not close to perfection. The results are stale for developing stories, some bad results are kept there for a long time, effectively same queries are returning different caches etc etc...

147. motorest ◴[09 Apr 25 04:00 UTC] No.43628772{3}[source]▶

>>43611877 #

> In my experience LLMs can't get basic western music theory right, there's no way I would use an LLM for something harder than that.

This take is completely oblivious, and frankly sounds like a desperate jab. There are a myriad of activities whose core requirement is a) derive info from a complex context which happens to be supported by a deep and plentiful corpus, b) employ glorified template and rule engines.

LLMs excel at what might be described as interpolating context following input and output in natural language. As in a chatbot that is extensivey trained in domain-specific tasks, which can also parse and generate content. There is absolutely zero lines of intellectual work that do not benefit extensively from this sort of tool. Zero.

replies(1): >>43642753 #

148. motorest ◴[09 Apr 25 04:03 UTC] No.43628785{3}[source]▶

>>43609801 #

> This effectively makes LLMs useless for education.

No. You're only arguing LLMs are useless at regurgitating homework assignments to allow students to avoid doing it.

The point of education is not mindless doing homework.

149. billforsternz ◴[10 Apr 25 00:52 UTC] No.43639574{5}[source]▶

>>43617644 #

Fair enough, I agree, simple arithmetic calculations shouldn't generate mysterious answers.

150. apercu ◴[10 Apr 25 11:17 UTC] No.43642753{4}[source]▶

>>43628772 #

A desperate jab? But I _want_ LLM's to be able to do basic, deterministic things accurately. Seems like I touched a nerve? Lol.

151. utopcell ◴[10 Apr 25 18:35 UTC] No.43646840[source]▶

>>43604503 (TP) #

This is simply using LLMs directly. Google has demonstrated that this is not the way to go when it comes to solving math problems. AlphaProof, which used AlphaZero code, got a silver medal in last year's IMO. It also didn't use any human proofs(!), only theorem statements in lean, without their corresponding proofs [1].

[1] https://www.youtube.com/watch?v=zzXyPGEtseI

152. SergeAx ◴[11 Apr 25 20:12 UTC] No.43658014[source]▶

>>43604503 (TP) #

Because of the vast number of problems reused, removing those data from training sets will just make models worse. Why would anyone do it?

↑