Reasoning models reason well, until they don't

(arxiv.org)

214 points optimalsolver | 2 comments | 31 Oct 25 09:23 UTC | HN request time: 0.421s | source

Show context

iLoveOncall ◴[31 Oct 25 09:48 UTC] No.45770127[source]▶

> [...] recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs) -- LLMs fine-tuned with incentives for step-by-step argumentation and self-verification

This was the obvious outcome of the study (don't get me wrong, obvious outcomes are still worth having research on).

"LRMs" *are* just LLMs. There's no such thing as a reasoning model, it's just having an LLM write a better prompt than the human would and then sending it to the LLM again.

Despite what Amodei and Altman want Wall Street to believe, they did not suddenly unlock reasoning capabilities in LLMs by essentially just running two different prompts in sequence to answer the user's question.

The truly amazing thing is that reasoning models show ANY improvement at all compared to non-reasoning models, when they're the same exact thing.

replies(5): >>45770198 #>>45770203 #>>45770220 #>>45770276 #>>45770473 #

sothatsit ◴[31 Oct 25 09:58 UTC] No.45770198[source]▶

>>45770127 #

What do you mean by reasoning?

If you mean solving logic problems, then reasoning LLMs seem to pass that bar as they do very well programming and maths competitions. Reasoning LLMs can also complete problems like multiplying large numbers, which requires applying some sort of algorithm where the results cannot just be memorised. They also do this much better than standard pre-trained LLMs with no RL.

So, that makes me come back to this question of what definition of reasoning do people use that reasoning models do not meet? They're not perfect, obviously, but that is not a requirement of reasoning if you agree that humans can reason. We make mistakes as well, and we also suffer under higher complexity. Perhaps they are less reliable in knowing when they have made mistakes or not than trained humans, but I wouldn't personally include reliability in my definition for reasoning (just look at how often humans make mistakes in tests).

I am yet to see any serious, reasoned, arguments that suggest why the amazing achievements of reasoning LLMs in maths and programming competitions, on novel problems, does not count as "real reasoning". It seems much more that people just don't like the idea of LLMs reasoning, and so reject the idea without giving an actual reason themselves, which seems somewhat ironic to me.

replies(3): >>45770258 #>>45770588 #>>45775592 #

js8 ◴[31 Oct 25 10:53 UTC] No.45770588[source]▶

>>45770198 #

> So, that makes me come back to this question of what definition of reasoning do people use that reasoning models do not meet?

The models can learn reasoning rules, but they are not able to apply them consistently or recognize the rules they have learned are inconsistent. (See also my other comment which references comments I made earlier.)

And I think they can't without a tradeoff, as I commented https://news.ycombinator.com/item?id=45717855 ; the consistency requires certain level of close-mindedness.

replies(1): >>45770690 #

sothatsit ◴[31 Oct 25 11:05 UTC] No.45770690[source]▶

>>45770588 #

Yes, so I think in this case we use different definitions of reasoning. You include reliability as a part of reasoning, whereas I do not.

I would argue that humans are not 100% reliable in their reasoning, and yet we still claim that they can reason. So, even though I would agree that the reasoning of LLMs is much less reliable, careful, and thoughtful than smart humans, that does not mean that they are not reasoning. Rather, it means that their reasoning is more unreliable and less well-applied than people. But they are still performing reasoning tasks (even if their application of reasoning can be flawed).

Maybe the problem is that I am holding out a minimum bar for LLMs to jump to count as reasoning (demonstrated application of logical algorithms to solve novel problems in any domain), whereas other people are holding the bar higher (consistent and logical application of rules in all/most domains).

replies(1): >>45771135 #

1. js8 ◴[31 Oct 25 12:09 UTC] No.45771135[source]▶

>>45770690 #

The problem is if you're not able to apply the reasoning rules consistently, then you will always fail on large enough problem. If you have an inconsistent set of reasoning rules, then you can set up a problem as a trap so that the reasoning fails.

You can argue that damaged toaster is still a toaster, conceptually. But if it doesn't work, then it's useless. As it stands, models lack ability to reason because they can fail to reason and you can't do anything about it. In case of humans, it's valid to say they can reason, because humans can at least fix themselves, models can't.

replies(1): >>45771498 #

2. sothatsit ◴[31 Oct 25 12:57 UTC] No.45771498[source]▶

>>45771135 (TP) #

The reasoning does not need to be 100% accurate to be useful. Humans are rarely 100% accurate at anything, and yet over time we can build up large models of problems using verification and review. We can do the exact same thing with LLMs.

The best example of this is Sean Heelan, who used o3 to find a real security vulnerability in the Linux kernel: https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...

Sean Heelan ran o3 100 times, and it found a known vulnerability in 8% of runs. For a security audit, that is immensely useful, since an expert can spend the time to look at the results from a dozen runs and quickly decide if there is anything real. Even more remarkably though, this same testing exposed a zero-day that they were not even looking for. That is pretty incredible for a system that makes mistakes.

This is why LLM reasoning absolutely does not need to be perfect to be useful. Human reasoning is inherently flawed as well, and yet through systems like peer review and reproducing results, we can still make tremendous progress over time. It is just about figuring out systems of verification and review so that we don't need to trust any LLM output blindly. That said, greater reliability would be massively beneficial to how easy it is to get good results from LLMs. But it's not required.

↑