Reasoning models reason well, until they don't

(arxiv.org)

Show context

iLoveOncall ◴[31 Oct 25 09:48 UTC] No.45770127[source]▶

> [...] recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs) -- LLMs fine-tuned with incentives for step-by-step argumentation and self-verification

This was the obvious outcome of the study (don't get me wrong, obvious outcomes are still worth having research on).

"LRMs" *are* just LLMs. There's no such thing as a reasoning model, it's just having an LLM write a better prompt than the human would and then sending it to the LLM again.

Despite what Amodei and Altman want Wall Street to believe, they did not suddenly unlock reasoning capabilities in LLMs by essentially just running two different prompts in sequence to answer the user's question.

The truly amazing thing is that reasoning models show ANY improvement at all compared to non-reasoning models, when they're the same exact thing.

replies(5): >>45770198 #>>45770203 #>>45770220 #>>45770276 #>>45770473 #

sothatsit ◴[31 Oct 25 09:58 UTC] No.45770198[source]▶

>>45770127 #

What do you mean by reasoning?

If you mean solving logic problems, then reasoning LLMs seem to pass that bar as they do very well programming and maths competitions. Reasoning LLMs can also complete problems like multiplying large numbers, which requires applying some sort of algorithm where the results cannot just be memorised. They also do this much better than standard pre-trained LLMs with no RL.

So, that makes me come back to this question of what definition of reasoning do people use that reasoning models do not meet? They're not perfect, obviously, but that is not a requirement of reasoning if you agree that humans can reason. We make mistakes as well, and we also suffer under higher complexity. Perhaps they are less reliable in knowing when they have made mistakes or not than trained humans, but I wouldn't personally include reliability in my definition for reasoning (just look at how often humans make mistakes in tests).

I am yet to see any serious, reasoned, arguments that suggest why the amazing achievements of reasoning LLMs in maths and programming competitions, on novel problems, does not count as "real reasoning". It seems much more that people just don't like the idea of LLMs reasoning, and so reject the idea without giving an actual reason themselves, which seems somewhat ironic to me.

replies(3): >>45770258 #>>45770588 #>>45775592 #

riku_iki ◴[31 Oct 25 19:11 UTC] No.45775592[source]▶

>>45770198 #

> then reasoning LLMs seem to pass that bar as they do very well programming and maths competitions.

it could be this is just result of good stochastic parroting and not reasoning. Both of those niches are narrow with high amount of training data (e.g. corps buying solutions from leetcode and training LLMs on them).

From another hand we see that LLMs fail in more complex environment: e.g. ask to build some new feature in postgres database.

replies(1): >>45777285 #

sothatsit ◴[31 Oct 25 22:15 UTC] No.45777285[source]▶

>>45775592 #

This is clearly false. LLMs being able to multiply large numbers is the clear example to me that there is more than just memorisation going on. They cannot just memorise the answers to multipling huge numbers like they do.

That's not to mention that these programming competition problems are designed to be novel. They are as novel as the competition designers can get while sticking to the bounds of the competition. This is clearly not stochastic parrot behaviour.

Additionally, them falling over in large codebases is not evidence that they cannot reason over smaller well-defined problems. It is just evidence that their reasoning has limits, which should not be surprising to anyone. Humans also have limits in our reasoning. That does not mean we do not reason.

replies(1): >>45777382 #

riku_iki ◴[31 Oct 25 22:27 UTC] No.45777382[source]▶

>>45777285 #

I think you just made lots of handwaving statements. Here is result which says LLMs can't do multi-digit multiplications well: https://arxiv.org/pdf/2510.00184

replies(1): >>45777872 #

sothatsit ◴[31 Oct 25 23:33 UTC] No.45777872[source]▶

>>45777382 #

We are talking about reasoning models here, not old non-reasoning models like Llama-90B and GPT-4. Obviously, they cannot multiply numbers. That was never in question.

Maybe at least give a cursory glance at a paper before trying to cite it to support your point?

I find it fun that this paper also points out that using another training method, IcoT, they can produce models that can multiply numbers perfectly. The frontier reasoning models can still make mistakes, they just get very close a lot of the time, even with 10-20 digit numbers. But the IcoT models can do it perfectly, they just can only multiply numbers.

replies(1): >>45777885 #

1. riku_iki ◴[31 Oct 25 23:35 UTC] No.45777885[source]▶

>>45777872 #

so, give ref on results which prove that they reliably can multiply arbitrary numbers

> Maybe at least give a cursory glance at a paper before trying to cite it to support your invalid point?

they use CoT aka reasoning steps

replies(1): >>45777968 #

2. sothatsit ◴[31 Oct 25 23:49 UTC] No.45777968[source]▶

>>45777885 (TP) #

They do not apply reinforcement learning, which is what most people mean when they talk about reasoning LLMs. This means it is not comparable to the frontier reasoning models.

Here is the post I remember seeing: https://www.reddit.com/r/singularity/comments/1ip3vpa/multid...

This shows that o3-mini has >90% accuracy at multiplying numbers up to 8-digits, and it is capable of multiplying numbers much larger than that. Whereas, gpt-4o could only multiply 2-digit numbers reliably.

Here's another person I saw experimenting with this: https://sanand0.github.io/llmmath/

That's not to mention that if you give these models a Python interpreter, they can also do this task perfectly and tackle much more complicated tasks as well. Although, that is rather separate to the models themselves being able to apply the reasoning steps to multiply numbers.

replies(1): >>45778258 #

3. riku_iki ◴[01 Nov 25 00:42 UTC] No.45778258[source]▶

>>45777968 #

> reinforcement learning, which is what most people mean when they talk about reasoning LLMs

popularity contest has no place in tech discussion, and even then not clear on which evidence you make such statement.

imo, reasoning model is model trained on lots of reasoning steps, so it is strong in producing those.

rl is used in niches where there is no much training data, so data is synthetically generated, which produces lots of garbage and model need feedback to adjust. And multiplication is not such niche.

> This shows that o3-mini has >90% accuracy at multiplying numbers up to 8-digits, and it is capable of multiplying numbers much larger than that. Whereas, gpt-4o could only multiply 2-digit numbers reliably.

it could be just a matter that one model has training data for this and another doesn't, you can't come to any conclusion without inspecting oai data.

Also, your examples actually demonstrate that frontier LLMs can't learn and reproduce trivial algorithm reliably, and results actually in a quality range of stochastic parrot.

replies(1): >>45778354 #

4. sothatsit ◴[01 Nov 25 00:58 UTC] No.45778354{3}[source]▶

>>45778258 #

1. This is obviously not about popularity... It is about capability. You cannot use crappy 2-year-old models with chain-of-thought to make inferences about frontier reasoning models that were released less than a year ago.

2. It is literally impossible for the models to have memorised all the results from multiplying 8-digit numbers. There are at least 10^14 8-digit multiplications that are possible (lower-bound), which from an information theory perspective would require a minimum of 48 PiB of data to hold. They have to be applying algorithms internally to perform this task, even if that algorithm is just uncompressing some unbelievably-well-compressed form of the results.

3. If you expect 100% reliability, obviously all humans would also fail. Therefore, do humans not reason? The answer is obviously not. We are trying to demonstrate that LLMs can exhibit reasoning here, not whether or not their reasoning has flaws or limitations (which it obviously does).

replies(1): >>45778407 #

5. riku_iki ◴[01 Nov 25 01:08 UTC] No.45778407{4}[source]▶

>>45778354 #

> You cannot use crappy 2-year-old models with chain-of-thought to make inferences about frontier reasoning models that were released less than a year ago.

the idea is to train new specialized model, which could specifically demonstrate if LLM can learn multiplication.

> It is literally impossible for the models to have memorised how to multiply 8-digit numbers. There are at least 10^14 8-digit multiplications that are possible

sure, they could memorize fragments: if that fragment contains that seq of digits, then that fragment must contains that seq of digits, which is much smaller space

> If you expect 100% reliability, obviously all humans would also fail. Therefore, do humans not reason?

Human fail because they are weak in this case because they can't reliably do arithmetic, and sometimes make mistake, also I speculate if you give enough time, and ask human to triple check calculations, result will be very good.

replies(1): >>45778740 #

6. sothatsit ◴[01 Nov 25 02:18 UTC] No.45778740{5}[source]▶

>>45778407 #

We also cap how long we let reasoning LLMs think for. OpenAI researchers have already discussed models they let reason for hours that could solve much harder problems.

But regardless, I feel like this conversation is useless. You are clearly motivated to not think LLMs are reasoning by 1) only looking at crappy old models as some sort of evidence about new models, which is nonsense, and 2) coming up with nonsensical arguments about how they could still be memorising answers that make no sense. Even if they memorised sequences, they still have to put that together to get the exact right answers to 8-digit multiplication in >90% of cases. That requires the application of algorithms, aka reasoning.

replies(1): >>45778763 #

7. riku_iki ◴[01 Nov 25 02:23 UTC] No.45778763{6}[source]▶

>>45778740 #

> only looking at crappy old model

let me repeat this: it was newly trained specialized model

other rants are ignored.

replies(1): >>45778768 #

8. sothatsit ◴[01 Nov 25 02:24 UTC] No.45778768{7}[source]▶

>>45778763 #

They did not use modern techniques. Therefore it is meaningless.

That’s not to mention that modern frontier LLMs can also be demonstrated to do this task, which is an existence proof in and of itself.

replies(1): >>45778776 #