Anecdotally, I've been playing around with o3-mini on undergraduate math questions: it is much better at "plug-and-chug" proofs than GPT-4, but those problems aren't independently interesting, they are explicitly pedagogical. For anything requiring insight, it's either:
1) A very good answer that reveals the LLM has seen the problem before (e.g. naming the theorem, presenting a "standard" proof, using a much more powerful result)
2) A bad answer that looks correct and takes an enormous amount of effort to falsify. (This is the secret sauce of LLM hype.)
I dread undergraduate STEM majors using this thing - I asked it a problem about rotations and spherical geometry, but got back a pile of advanced geometric algebra, when I was looking for "draw a spherical triangle." If I didn't know the answer, I would have been badly confused. See also this real-world example of an LLM leading a recreational mathematician astray: https://xcancel.com/colin_fraser/status/1900655006996390172#...
I will add that in 10 years the field will be intensely criticized for its reliance on multiple-choice benchmarks; it is not surprising or interesting that next-token prediction can game multiple-choice questions!
- USAMO - United States of America Mathematical Olympiad
- IMO - International Mathematical Olympiad
- ICPC - International Collegiate Programming Contest
Relevant paper: https://arxiv.org/abs/2503.21934 - "Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad" submitted 27th March 2025.
Instead they're barely able to eek out wins against a bot that plays completely random moves: https://maxim-saplin.github.io/llm_chess/
So, I describe the mathematics to ChatGPT-o3-mini-high to try to help reason about what’s going on. It was almost completely useless. Like blog-slop “intro to ML” solutions and ideas. It ignores all the mathematical context, and zeros in on “doesn’t converge” and suggests that I lower the learning rate. Like, no shit I tried that three weeks ago. No amount of cajoling can get it to meaningfully “reason” about the problem, because it hasn’t seen the problem before. The closest point in latent space is apparently a thousand identical Medium articles about Adam, so I get the statistical average of those.
I can’t stress how frustrating this is, especially with people like Terence Tao saying that these models are like a mediocre grad student. I would really love to have a mediocre (in Terry’s eyes) grad student looking at this, but I can’t seem to elicit that. Instead I get low tier ML blogspam author.
**PS** if anyone read this far (doubtful) and knows about density estimation and wants to help my email is bglazer1@gmail.com
I promise its a fun mathematical puzzle and the biology is pretty wild too
Sometimes when I'm anxious just to get on with my original task, I'll paste the code and output/errors into the LLM and iterate over its solutions, but the experience is like rolling dice, cycling through possible solutions without any kind of deductive analysis that might bring it gradually closer to a solution. If I keep asking, it eventually just starts cycling through variants of previous answers with solutions that contradict the established logic of the error/output feedback up to this point.
Not to say that the LLMs aren't productive tools, but they're more like calculators of language than agents that reason.
https://dynomight.substack.com/p/chess
Discussion here: https://news.ycombinator.com/item?id=42138289
AcmeAssistant is "helpful" and "clever" in the same way that Vampire Count Dracula is "brooding" and "immortal".
Open AI, Anthropic and the like simply don't care much about their LLMs playing chess. That or post training is messing things up.
Much in the same way a human who only just learnt the rules but 0 strategy would very, very rarely lose here
These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?
We're on the verge of AGI but there's not even the tiniest spark of general reasoning ability in something they haven't been trained for
"Reasoning" or "Thinking" are marketing terms and nothing more. If an LLM is trained for chess then its performance would just come from memorization, not any kind of "reasoning"
I mean, surely there's a reason you decided to mention 3.5 turbo instruct and not.. 3.5 turbo? Or any other model? Even the ones that came after? It's clearly a big outlier, at least when you consider "LLMs" to be a wide selection of recent models.
If you're saying that LLMs/transformer models are capable of being trained to play chess by training on chess data, I agree with you.
I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?
https://dictionary.cambridge.org/us/dictionary/english/eke-o... to obtain or win something only with difficulty or great effort
So 1) 2) and 3) were out by 1,1 and 3 orders of magnitude respectively (the errors partially cancelled out) and 4) was nonsensical.
This little experiment made my skeptical about the state of the art of AI. I have seen much AI output which is extraordinary it's funny how one serious fail can impact my point of view so dramatically.
>I think AstroBen was pointing out that LLMs, despite having the ability to solve some very impressive mathematics and programming tasks, don't seem to generalize their reasoning abilities to a domain like chess. That's surprising, isn't it?
Not really. The LLMs play chess like they have no clue what the rules of the game are, not like poor reasoners. Trying to predict and failing is how they learn anything. If you want them to learn a game like chess then how you get them to learn it - by trying to predict chess moves. Chess books during training only teach them how to converse about chess.
If you think you can play chess at that level over that many games and moves with memorization then i don't know what to tell you except that you're wrong. It's not possible so let's just get that out of the way.
>These companies are shouting that their products are passing incredibly hard exams, solving PHD level questions, and are about to displace humans, and yet they still fail to crush a random-only strategy chess bot? How does this make any sense?
Why doesn't it ? Have you actually looked at any of these games ? Those LLMs aren't playing like poor reasoners. They're playing like machines who have no clue what the rules of the game are. LLMs learn by predicting and failing and getting a little better at it, repeat ad nauseum. You want them to learn the rules of a complex game ? That's how you do it. By training them to predict it. Training on chess books just makes them learn how to converse about chess.
Humans have weird failure modes that are odds with their 'intelligence'. We just choose to call them funny names and laugh about it sometimes. These Machines have theirs. That's all there is to it. The top comment we are both replying to had gemini-2.5-pro which released less than 5 days later hit 25% on the benchmark. Now that was particularly funny.
o1 screwing up a trivially easy variation: https://xcancel.com/colin_fraser/status/1864787124320387202
Claude 3.7, utterly incoherent: https://xcancel.com/colin_fraser/status/1898158943962271876
DeepSeek: https://xcancel.com/colin_fraser/status/1882510886163943443#...
Overflowing wine glass also isn't meaningfully solved! I understand it is sort of solved for wine glasses (even though it looks terrible and unphysical, always seems to have weird fizz). But asking GPT to "generate an image of a transparent vase with flowers which has been overfilled with water, so that water is spilling over" had the exact same problem as the old wine glasses: the vase was clearly half-full, yet water was mysteriously trickling over the sides. Presumably OpenAI RLHFed wine glasses since it was a well-known failure, but (as always) this is just whack-a-mole, it does not generalize into understanding the physical principle.
Gotcha, fair enough. Throw enough chess data in during training, I'm sure they'd be pretty good at chess.
I don't really understand what you're trying to say in your next paragraph. LLMs surely have plenty of training data to be familiar with the rules of chess. They also purportedly have the reasoning skills to use their familiarity to connect the dots and actually play. It's trivially true that this issue can be plastered over by shoving lots of chess game training data into them, but the success of that route is not a positive reflection on their reasoning abilities.
https://openai.com/index/learning-to-reason-with-llms/
The paper tested it on o1-pro as well. Correct me if I'm getting some versioning mixed up here.
A particular nonstandard eval that is currently top comment on this HN thread, due to the fact that, unlike every other eval out there, LLMs score badly on it?
Doesn't seem implausible to me at all. If I was running that team, I would be "Drop what you're doing, boys and girls, and optimise the hell out of this test! This is our differentiator!"
And that post had a follow-up. Post-training messing things up could well be the issue seeing the impact even a little more examples and/or regurgitation made. https://dynomight.net/more-chess/
It was surprising to me because I would have expected if there was reasoning ability then it would translate across domains at least somewhat, but yeah what you say makes sense. I'm thinking of it in human terms
Like how
- Training LLMs on code makes them solve reasoning problems better - Training Language Y alongside X makes them much better at Y than if they were trained on language Y alone and so on.
Probably because well gradient descent is a dumb optimizer and training is more like evolution than a human reading a book.
Also, there is something genuinely weird going on with LLM chess. And it's possible base models are better. https://dynomight.net/more-chess/
The AI will create something for you and tell you it was them.
But Google search gave me the exact same slop you mentioned. So whatever Search is using, they must be using their crappiest, cheapest model. It's nowhere near state of the art.
USAMO : USA Math Olympiad. Referred here https://arxiv.org/pdf/2503.21934v1
IMO : International Math Olympiad
SOTA : State of the Art
OP is probably referring to this referred to this paper here https://arxiv.org/pdf/2503.21934v1. The paper explains out how a rigorous testing revealed abysmal performance of LLMs (results that are at odds with how they are hyped about).
This whole premise crashes and burns if you need task-specific training, like explicit chess training. That is because there are far too many tasks that humans need to be competent at in order to be useful in society. Even worse, the vast majority of those tasks are very hard to source training data for, unlike chess.
So, if we accept that LLMs can't learn chess unless they explicitly include chess games in the training set, then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.
Math packages of the time like Mathematica and MATLAB helped me immensely, once you could get the problem accurately described in the correct form, they could walk through the steps and solve systems of equations, integrate tricky functions, even though AI was nowhere to be found back then.
I feel like ChatGPT is doing something similar when doing maths with its chain of thoughts method, and while its method might be somewhat more generic, I'm not sure it's strictly superior.
This might be honing in on both the issue and the actual value of LLM:s. I think there's a lot of value in a "language calculator" but if it's continuously being sold as something it's not we will dismiss it or build heaps of useless apps that will just form a market bubble. I think the value is there but it's different from how we think about it.
I feel the same way. It's like discovering for the first time that magicians aren't doing "real" magic, just sleight of hand and psychological tricks. From that point on, it's impossible to be convinced that a future trick is real magic, no matter how impressive it seems. You know it's fake even if you don't know how it works.
The same may work with you problem. If it's unstable try introduce extra 'brakes' which theoretically are not required. May be even incorrect. Whatever it is in your domain. Another thing to check is optimizer, try several. Check default parameters. I've heard Adams defaults lead to instability later in training.
PS: it would be heaven if models could work at human expert level. Not sure why some really expect this. We are just at the beginning.
PPS: the fact that they can do known tasks with minor variations is already a huge time saver.
In common terms suppose I say: there is only room for one person or one animal in my car to go home, one can suppose that it is referring to additional room besides that occupied by the driver. There is a problem when we try to use LLM trained in common use of language to solve puzzle in formal logic or math. I think the current LLMs are not able to have a specialized context to become a logical reasoning agent, but perhaps such thing could be possible if the evaluation function of the LLM was designed to give high credit to changing context with a phrase or token.
This effectively makes LLMs useless for education. (Also sours the next generation on LLMs in general, these things are extremely lame to the proverbial "kids these days".)
I'm in the second camp but find it kind of sad and often envy the people who can stay entertained even though they know better.
(Yes, that's a lot of work for a teacher. Gone are the days when you could just assign reports as homework.)
And they do, just not always in the ways we expect.
>This whole premise crashes and burns if you need task-specific training, like explicit chess training.
Everyone needs task specific training. Any human good at chess or anything enough to make it a profession needs it. So I have no idea why people would expect any less for a Machine.
>then we have to accept that they can't learn, say, to sell business software unless they include business software pitches in the training set, and there are going to be FAR fewer of those than chess games.
Yeah so ? How much business pitches they need in the training set has no correlation with chess. I don't see any reason to believe what is already present isn't enough. There's enough chess data on the internet to teach them chess too, it's just a matter of how much open AI care about it.
Models getting 5X better at things all the time is at least as easy to interpret as evidence of task-specific tuning than as breakthroughs in general ability, especially when the 'things being improved on' are published evals with history.
If you don't see anyone mentioning what you wrote that's not surprising at all, because you totally misunderstood the paper. The models didn't suddenly drop to 5% accuracy on math olympiad questions. Instead this paper came up with a human evaluation that looks at the whole reasoning process (instead of just the final answer) and their finding is that the "thoughts" of reasoning models are not sufficiently human understandable or rigorous (at least for expert mathematicians). This is something that was already well known, because "reasoning" is essentially CoT prompting baked into normal responses. But the empirics also tell us it greatly helps for final outputs nonetheless.
Looking up the math ability of the average American this is given as an example for the median (from https://www.wyliecomm.com/2021/11/whats-the-latest-u-s-numer...):
>Review a motor vehicle logbook with columns for dates of trip, odometer readings and distance traveled; then calculate trip expenses at 35 cents a mile plus $40 a day.
Which is ok but easier than golf balls in a 747 and hugely easier than USAMO.
Another question you could try from the easy math end is: Someone calculated the tariff rate for a country as (trade deficit)/(total imports from the country). Explain why this is wrong.
Not sure whether their (INSAIT's) agenda is purely scientific, as there's a lot of PR on linkedin by these guys, literally celebrating every PHD they get, which is at minimum very weird. I'd take anything they release with a grain of sand if not caution.
Edit: Then again, maybe they have a point, going by an answer I just got from Google's best current model ( https://g.co/gemini/share/374ac006497d ) I haven't seen anything that ridiculous from a leading-edge model for a year or more.
More than even filling the gaps in knowledge / skills, would be a huge advancement in AI for it to admit when it doesn't know the answer or is just wildly guessing.
You can still do this to the current models, though it takes more creativity; you can bait it into giving wrong answers if you ask a question that is "close" to a well-known one but is different in an important way that does not manifest as a terribly large English change (or, more precisely, a very large change in the model's vector space).
The downside is that the frontier between what fools the LLMs and what would fool a great deal of the humans in the class too shrinks all the time. Humans do not infinitely carefully parse their input either... as any teacher could tell you! Ye Olde "Read this entire problem before proceeding, {a couple of paragraphs of complicated instruction that will take 45 minutes to perform}, disregard all the previous and simply write 'flower' in the answer space" is an old chestnut that has been fooling humans for a long time, for instance. Given how jailbreaks work on LLMs, LLMs are probably much better at that than humans are, which I suppose shows you can construct problems in the other direction too.
(BRB... off to found a new CAPTCHA company for detecting LLMs based on LLMs being too much better than humans at certain tasks...)
It got the golf ball volume right (0.00004068 cubic meters), but it still overestimated the cabin volume at 1000 cubic meters.
It's final calculation was reasonably accurate at 24,582,115 golf balls - even though 1000 ÷ 0.00004068 = 24,582,104. Maybe it was using more significant figures for the golf ball size than it showed in its answer?
It didn't acknowledge other items in the cabin (like seats) reducing its volume, but it did at least acknowlesge inefficiencies in packing spherical objects and suggested the actual number would be "somewhat lower", though it did not offer an estimate.
When I pressed it for an estimate, it used a packing density of 74% and gave an estimate of 18,191,766 golf balls. That's one more than the calculation should have produced, but arguably insignificant in context.
Next I asked it to account for fixtures in the cabin such as seats. It estimated a 30% reduction in cabin volume and redid the calculations with a cabin volume of 700 cubic meters. These calculations were much less accurate. It told me 700 ÷ 0.00004068 = 17,201,480 (off by ~6k). And it told me 17,201,480 × 0.74 was 12,728,096 (off by ~1k).
I told it the calculations were wrong and to try again, but it produced the same numbers. Then I gave it the correct answer for 700 ÷ 0.00004068. It told me I was correct and redid the last calculation correctly using the value I provided.
Of all the things for an AI chatbot which can supposedly "reason" to fail at, I didn't expect it to be basic arithmetic. The one I used was closer, but it was still off by a lot at times despite the calculations being simple multiplication and division. Even if might not matter in the context of filling an air plane cabin with golf balls, it does not inspire trust for more serious questions.
If you asked a multimodal system questions about the image it just generated, it would tell you the wine was almost overflowing out of the top of the glass.
But any trick prompt like this is going to start giving expected results once it gets well-known enough.
Late edit: Another one was the farmer/fox/chicken/cabbage/river problem, but you modify the problem in unexpected ways, by stating, for example, that the cabbage will eat the fox, or that the farmer can bring three items per trip. LLMs used to ignore your modifications and answer the original problem.
No wonder Trump isn't afraid to put taxes against Canada. Who could take a 3.8 sqare miles country seriously?
In my view, the trick as it is intended to appear to the audience and the explanation of how the trick is performed are equal and inseparable aspects of my interest as a viewer. Either one without the other is less interesting than the pair.
The reason I expected better mathematical reasoning is because the companies making them are very loudly proclaiming that these models are capable of high level mathematical reasoning.
And yes the fact I don’t have to look at matplotlib documentation anymore makes these models extremely useful already, but thats qualitatively different from having Putnam prize winning reasoning ability
The entire point of USAMO problems is that they demand novel insight and rigorous, original proofs. They are intentionally designed not to be variations of things you can just look up. You have to reason your way through, step by logical step.
Getting 25% (~11 points) is exceptionally difficult. That often means fully solving one problem and maybe getting solid partial credit on another. The median score is often in the single digits.
Which makes it difficult to fairly evaluate whether the models have actually gotten better at the feather/iron problem or if it just got enough samples of trick questions that it learned better, either naturally from the internet, or fed as part of the training data. I am fairly certain the training data has had "trick questions" like this added to it, because, I mean, why wouldn't it?
I have noticed in my playing with image AIs that they do seem more prone to getting dragged into local maxima when a human would know the prompt than the LLMs. Perhaps it's all the additional data in an image that reveals it.
The point is the analogy to LLMs. A lot of people are very optimistic about their capabilities, while other people who have "seen behind the curtain" are skeptical, and feel that the fundamental flaws are still there even if they're better-hidden.
Very hard for me to wrap my head around the idea that an LLM being able to discuss, even perhaps teach high level chess strategy wouldn't transfer at all to its playing performance
Some teachers try to collect the phones beforehand, but then students simply give out older phones and keep their active ones with them.
You could try to verify that the phones they're giving out are working by calling them, but that would take an enormous amount of time and it's impractical for simple exams.
We really have no idea how much AI is ruining education right now.
And to be clear, that's pretty much all this was: there's six problems, it got almost-full credit on one and half credit on another and bombed the rest, whereas all the other models bombed all the problems.
That's true, but of course, not what I claimed.
The claim is that, given the ability to memorize an every mathematical result that has ever been published (in print or online), it is not so difficult to get 25% correct on an exam by pattern matching.
Note that this is skill is, by definition, completely out of the reach of any human being, but that possessing it does not imply creativity or the ability to "think".
As a long-time close-up magician and magical inventor who's spent a lot of time studying magic theory (which has been a serious field of magical research since the 1960s), it depends on which way we interpret "how the trick works." Frankly, for most magic tricks the method isn't very interesting, although there are some notable exceptions where the method is fascinating, sometimes to the extent it can be far more interesting than the effect it creates.
However, in general, most magic theorists and inventors agree that the method, for example, "palm a second coin in the other hand", isn't usually especially interesting. Often the actual immediate 'secret' of the method is so simple and, in hindsight, obvious that many non-magicians feel rather let down if the method is revealed. This is the main reason magicians usually don't reveal secret methods to non-magicians. It's not because of some code of honor, it's simply because the vast majority of people think they'll be happy if they know the secret but are instead disappointed.
Where studying close-up magic gets really fascinating is understanding why that simple, obvious thing works to mislead and then surprise audiences in the context of this trick. Very often changing subtle things seemingly unrelated to the direct method will cause the trick to stop fooling people or to be much less effective. Comparing a master magician to even a competent, well-practiced novice performing the exact same effect with the same method can be a night and day difference. Typically, both performances will fool and entertain audiences but the master's performance can have an intensely more powerful impact. Like leaving most audience members in stunned shock vs just pleasantly surprised and fooled. While neither the master nor novice's audiences have any idea of the secret method, this dramatic difference in impact is fascinating because careful deconstruction reveals it often has little to do with mechanical proficiency in executing the direct method. In other words, it's rarely driven by being able to do the sleight of hand faster or more dexterously. I've seen legendary close-up masters like a Dai Vernon or Albert Goshman when in their 80s and 90s perform sleight of hand with shriveled, arthritic hands incapable of even cleanly executing a basic palm, absolutely blow away a roomful of experienced magicians with a trick all the magicians already knew. How? It turns out there's something deep and incredibly interesting about the subtle timing, pacing, body language, posture, and psychology surrounding the "secret method" that elevates the impact to almost transcendence compared to a good, competent but uninspired performance of the same method and effect.
Highly skilled, experienced magicians refer to the complex set of these non-method aspects, which can so powerfully elevate an effect to another level, as "the real work" of the trick. At the top levels, most magicians don't really care about the direct methods which some audience members get so obsessed about. They aren't even interesting. And, contrary to what most non-magicians think, these non-methods are the "secrets" master magicians tend to guard from widespread exposure. And it's pretty easy to keep this crucially important "real work" secret because it's so seemingly boring and entirely unlike what people expect a magic secret to be. You have to really "get it" on a deeper level to even understand that what elevated the effect was intentionally establishing a completely natural-seeming, apparently random three-beat pattern of motion and then carefully injecting a subtle pause and slight shift in posture to the left six seconds before doing "the move". Audiences mistakenly think that "the hidden move" is the secret to the trick when it's just the proximate first-order secret. Knowing that secret won't get you very far toward recreating the absolute gob-smacking impact resulting from a master's years of experimentation figuring out and deeply understanding which elements beyond the "secret method" really elevate the visceral impact of the effect to another level.
I look at optical illusions like The Dress™ and am impressed that I cannot force my brain to see it correctly even though I logically know what color it is supposed to be.
Finding new ways that our brains can be fooled despite knowing better is kind of a fun exercise in itself.
After all, they will grow up next to these things. They will do the homework today, by the time they graduate the LLM will take their job. There might be human large langage model managers for a while, soon to be replaced by the age of idea men.
Any of the following could work, though the specific tradeoffs & implementation details do vary:
- have <n> teachers walking around the room to watch for cheaters
- mount a few cameras to various points in the room and give the teacher a dashboard so that they can watch from all angles
- record from above and use AI to flag potential cheaters for manual review
- disable Wi-Fi + activate cell jammers during exam time (with a land-line in the room in case of emergencies?)
- build dedicated examination rooms lined with metal mesh to disrupt cell reception
So unlike "beating LLMs" (where it's an open question as to whether it's even possible, and a moving target to boot), barring serious advances in wearable technology this just seems like a question of funding and therefore political will.
> However, in general, most magic theorists and inventors agree that the method, for example, "palm a second coin in the other hand", isn't usually especially interesting.
Fair enough. It sounds like I simply fundamentally disagree, because I think nearly any explanation of method is very interesting. For close-up maginc, the only exceptions for me would be if the explanation is "the video you were watching contains visual effects" or "the entire in-person audience was in on it."
Palming is awesome. Misdirection is awesome. I fully expect these sorts of things to be used in most magic tricks, but I still want to know precisely how. The fact that I'm aware of most close-up magic techniques but am still often fooled by magic tricks should make it pretty clear that the methods are interesting!
Since studying magic has been a lifelong passion since I was a kid, I clearly couldn't agree more. However, experience has shown that despite claiming otherwise, most people aren't actually interested in the answer to "How did you do that?" beyond the first 30 seconds. So... you're unusual - and that's great!
> but I still want to know precisely how.
Well, you're extremely fortunate to be interested in learning how magic is really done at the best time in history for doing so. I was incredibly lucky to be accepted into the Magic Castle as a teenager and mentored by Dai Vernon (widely thought to be the greatest close-up magician of the 20th century) who was in his late 80s at the time. I also had access the Castle's library of magic books, the largest in the world at the time. 99% of other kids on Earth interested in magic at the time only had a handful of local public library books and mail-order tricks.
Today there's an incredible amount of insanely high-quality magic instruction available in streaming videos, books and online forums. There are even master magicians who teach those willing to learn via Zoom. While most people think magicians want to hoard their secrets, the reality couldn't be more different. Magicians love teaching how to actually do magic to anyone who really wants to learn. However, most magicians aren't interested in wasting time satisfying the extremely fleeting curiosity of those who only want to know "how it works" in the surface sense of that first 30 seconds of only revealing the proximate 'secret method'.
Yet many magicians will happily devote hours to teaching anyone who really wants to actually learn how to do magic themselves and is willing put in the time and effort to develop the skills, even if those people have no intention of ever performing magic for others - and even if the student isn't particularly good at it. It just requires the interest to go really deep on understanding the underlying principles and developing the skills, even if for no other purpose than just having the knowledge and skills. Personally, I haven't performed magic for non-magicians in over a decade but I still spend hours learning and mastering new high-level skills because it's fun, super intellectually interesting and extremely satisfying. If you're really interested, I encourage you to dive in. There's quite literally never been a better time to learn magic.
1000 ÷ 0.00004068 = 25,000,000. I think this is an important point that's increasingly widely misunderstood. All those extra digits you show are just meaningless noise and should be ruthlessly eliminated. If 1000 cubic metres in this context really meant 1000.000 cubic metres, then by all means show maybe the four digits of precision you get from the golf ball (but I am more inclined to think 1000 cubic metres is actually the roughest of rough approximations, with just one digit of precision).
In other words, I don't fault the AI for mismatching one set of meaninglessly precise digits for another, but I do fault it for using meaninglessly precise digits in the first place.
This is still the case. Very few non-reasoning models can solve such variations correctly, even SOTA models. Worse yet, not only they confidently give wrong responses, but they often do so even when specifically told to use CoT, and they continue giving wrong answers in a loop even if you specifically point out where they are wrong.
Reasoning models do much better, though. E.g. QwQ-32b can solve it pretty reliably, although it takes a lot of tokens for it to explore the possibilities. But at least it can fairly consistently tell when it's doing something wrong and then backtrack.
One other example that befuddles even the reasoning models is frying-cubes-in-a-pan and equivalents, e.g. this version from Simple Bench:
> Beth places four whole ice cubes in a frying pan at the start of the first minute, then five at the start of the second minute and some more at the start of the third minute, but none in the fourth minute. If the average number of ice cubes per minute placed in the pan while it was frying a crispy egg was five, how many whole ice cubes can be found in the pan at the end of the third minute? Pick the most realistic answer option. A) 5 B) 11 C) 0 D) 20
So, the fact that LLMs can't learn this sample game despite probably including all of the books ever written on it in their training set tells us something about their general reasoning skills.
Yes, that's all there is to it and it's not enough. I ain't paying for another defective organism that makes mistakes in entirely novel ways. At least with humans you know how to guide them back on course.
If that's the peak of "AI" evolution today, I am not impressed.
Just feels less "stable" or "tight" overall.
This take is completely oblivious, and frankly sounds like a desperate jab. There are a myriad of activities whose core requirement is a) derive info from a complex context which happens to be supported by a deep and plentiful corpus, b) employ glorified template and rule engines.
LLMs excel at what might be described as interpolating context following input and output in natural language. As in a chatbot that is extensivey trained in domain-specific tasks, which can also parse and generate content. There is absolutely zero lines of intellectual work that do not benefit extensively from this sort of tool. Zero.