Reasoning models reason well, until they don't

(arxiv.org)

214 points optimalsolver | 1 comments | 31 Oct 25 09:23 UTC | HN request time: 0.208s | source

Show context

iLoveOncall ◴[31 Oct 25 09:48 UTC] No.45770127[source]▶

> [...] recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs) -- LLMs fine-tuned with incentives for step-by-step argumentation and self-verification

This was the obvious outcome of the study (don't get me wrong, obvious outcomes are still worth having research on).

"LRMs" *are* just LLMs. There's no such thing as a reasoning model, it's just having an LLM write a better prompt than the human would and then sending it to the LLM again.

Despite what Amodei and Altman want Wall Street to believe, they did not suddenly unlock reasoning capabilities in LLMs by essentially just running two different prompts in sequence to answer the user's question.

The truly amazing thing is that reasoning models show ANY improvement at all compared to non-reasoning models, when they're the same exact thing.

replies(5): >>45770198 #>>45770203 #>>45770220 #>>45770276 #>>45770473 #

qsort ◴[31 Oct 25 10:09 UTC] No.45770276[source]▶

>>45770127 #

Don't they have a significant RL component? The "we'll just make it bigger" idea that was peddled a lot after GPT3.5 was nonsense, but that's not the only thing they're doing right now.

replies(1): >>45770883 #

ACCount37 ◴[31 Oct 25 11:34 UTC] No.45770883[source]▶

>>45770276 #

"We'll just make it bigger" works. RLVR just gives better performance gains and spends less inference compute - as long as you have a solid way of verifying the tasks.

A simplified way of thinking about it is: pretraining gives LLMs useful features, SFT arranges them into useful configurations, RLVR glues them together and makes them work together well, especially in long reasoning traces. Makes sense to combine it all in practice.

How much pretraining gives an LLM depends on the scale of that LLM, among other things. But raw scale is bounded by the hardware capabilities and the economics - of training and especially of inference.

Scale is still quite desirable - GPT-4.5 scale models are going to become the norm for high end LLMs quite soon.

replies(1): >>45770917 #

1. qsort ◴[31 Oct 25 11:40 UTC] No.45770917[source]▶

>>45770883 #

I'm not against "we'll make it bigger" (although it's as of yet unknown if it hits diminishing returns, 4.5 isn't exactly remembered as a great release), I'm against "we'll just (i.e. 'only') make it bigger".

I'm doubtful you'd have useful LLMs today if labs hadn't scaled in post-training.

↑