Most active commenters

hansmayer(4)
sothatsit(3)
cryptonym(3)
vidarh(3)

Reasoning models reason well, until they don't

(arxiv.org)

Show context

iLoveOncall ◴[31 Oct 25 09:48 UTC] No.45770127[source]▶

> [...] recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs) -- LLMs fine-tuned with incentives for step-by-step argumentation and self-verification

This was the obvious outcome of the study (don't get me wrong, obvious outcomes are still worth having research on).

"LRMs" *are* just LLMs. There's no such thing as a reasoning model, it's just having an LLM write a better prompt than the human would and then sending it to the LLM again.

Despite what Amodei and Altman want Wall Street to believe, they did not suddenly unlock reasoning capabilities in LLMs by essentially just running two different prompts in sequence to answer the user's question.

The truly amazing thing is that reasoning models show ANY improvement at all compared to non-reasoning models, when they're the same exact thing.

replies(5): >>45770198 #>>45770203 #>>45770220 #>>45770276 #>>45770473 #

sothatsit ◴[31 Oct 25 09:58 UTC] No.45770198[source]▶

>>45770127 #

What do you mean by reasoning?

If you mean solving logic problems, then reasoning LLMs seem to pass that bar as they do very well programming and maths competitions. Reasoning LLMs can also complete problems like multiplying large numbers, which requires applying some sort of algorithm where the results cannot just be memorised. They also do this much better than standard pre-trained LLMs with no RL.

So, that makes me come back to this question of what definition of reasoning do people use that reasoning models do not meet? They're not perfect, obviously, but that is not a requirement of reasoning if you agree that humans can reason. We make mistakes as well, and we also suffer under higher complexity. Perhaps they are less reliable in knowing when they have made mistakes or not than trained humans, but I wouldn't personally include reliability in my definition for reasoning (just look at how often humans make mistakes in tests).

I am yet to see any serious, reasoned, arguments that suggest why the amazing achievements of reasoning LLMs in maths and programming competitions, on novel problems, does not count as "real reasoning". It seems much more that people just don't like the idea of LLMs reasoning, and so reject the idea without giving an actual reason themselves, which seems somewhat ironic to me.

replies(3): >>45770258 #>>45770588 #>>45775592 #

fsloth ◴[31 Oct 25 10:06 UTC] No.45770258[source]▶

>>45770198 #

I guess we mean here ”usefull reasoning” instead of the idiot-savant. I mean it’s a fair ask since these are marketed as _tools_ you can use to implement _industrial processes_ and even replace you human workers.

In that I guess the model does not need to be the most reasonable intepreter of vague and poorly formulated user inputs but I think to improve a bit at least, to become usefull general appliances and not just test-scoring-automatons.

The key differentiator here is that tests generally _are made to be unambiguously scoreable_. Real world problems are often more vague from the point of view of optimal outcome.

replies(2): >>45770349 #>>45770457 #

sothatsit ◴[31 Oct 25 10:19 UTC] No.45770349[source]▶

>>45770258 #

Thanks. So, people are extending "reasoning" to include making good decisions, rather than just solving logic problems. That makes sense to me that if people use that definition, LLMs are pretty bad at "reasoning".

Although, I would argue that this is not reasoning at all, but rather "common sense" or the ability to have a broader perspective or think of the future. These are tasks that come with experience. That is why these do not seem like reasoning tasks to me, but rather soft skills that LLMs lack. In my mind these are pretty separate concerns to whether LLMs can logically step through problems or apply algorithms, which is what I would call reasoning.

replies(1): >>45770470 #

1. hansmayer ◴[31 Oct 25 10:36 UTC] No.45770470[source]▶

>>45770349 #

Ah yes then, let me then unchain my LLM on those nasty unsolved math and logic problems I've absolutely not be struggling with in the course of my career.

replies(3): >>45770540 #>>45770590 #>>45770997 #

2. sothatsit ◴[31 Oct 25 10:45 UTC] No.45770540[source]▶

>>45770470 (TP) #

A lot of maths students would also struggle to contribute to frontier math problems, but we would still say they are reasoning. Their skill at reasoning might not be as good as professional mathematicians, but that does not stop us from recognising that they can solve logic problems without memorisation, which is a form of reasoning.

I am just saying that LLMs have demonstrated they can reason, at least a little bit. Whereas it seems other people are saying that LLM reasoning is flawed, which does not negate the fact that they can reason, at least some of the time.

Maybe generalisation is one area where LLM's reasoning is weakest though. They can be near-elite performance at nicely boxed up competition math problems, but their performance dramatically drops on real-world problems where things aren't so neat. We see similar problems in programming as well. I'd argue the progress on this has been promising, but other people would probably vehemently disagree with that. Time will tell.

replies(1): >>45771034 #

3. cryptonym ◴[31 Oct 25 10:54 UTC] No.45770590[source]▶

>>45770470 (TP) #

That's the real deal.

They say LLM are PhD-level. Despite billion dollars, PhD-LLMs sure are not contributing a lot solving known problems. Except of course few limited marketing stunts.

replies(2): >>45770671 #>>45770720 #

4. fsloth ◴[31 Oct 25 11:03 UTC] No.45770671[source]▶

>>45770590 #

IMHO that's the key differentiator.

You can give a human PhD an _unsolved problem_ in field adjacent to their expertise and expect some reasonable resolution. LLM PhD:s solve only known problems.

That said humans can also be really bad problem solvers.

If you don't care about solving the problem and only want to create paperwork for bureaucracy I guess you don't care either way ("My team's on it!") but companies that don't go out of business generally recognize pretty soon lack of outcomes where it matters.

replies(1): >>45771146 #

5. hansmayer ◴[31 Oct 25 11:11 UTC] No.45770720[source]▶

>>45770590 #

I wish our press was not effectively muted or bought by the money, so none of the journos has cojones to call out the specific people who were blabbing about PhD-levels, AGI etc. They should be god damn calling them out every single day, essentially doing their job, but they are now too timid for that.

6. vidarh ◴[31 Oct 25 11:50 UTC] No.45770997[source]▶

>>45770470 (TP) #

I've "unchained" my LLM on a lot of problems that I probably could solve, but that would take me time I don't have, and that it has solved in many case faster than I could. It may not be good enough to solve problems that are beyond us for most of us, but it certainly can solve a lot of problems for a lot of us that have gone unsolved for lack of resources.

replies(2): >>45772820 #>>45780209 #

7. vidarh ◴[31 Oct 25 11:56 UTC] No.45771034[source]▶

>>45770540 #

Thank you for picking at this.

A lot of people appear to be - often not consciously or intentionally - setting the bar for "reasoning" at a level many or most people would not meet.

Sometimes that is just a reaction to wanting an LLM that is producing result that is good for their own level. Sometimes it reveals a view of fellow humans that would be quite elitist if stated outright. Sometimes it's a kneejerk attempt at setting the bar at a point that would justify a claim that LLMs aren't reasoning.

Whatever the reason, it's a massive pet peeve of mine that it is rarely made explicit in these conversations, and it makes a lot of these conversations pointless because people keep talking past each other.

For my part a lot of these models often clearly reason by my standard, even if poorly. People also often reason poorly, even when they demonstrably attempt to reason step by step. Either because they have motivations to skip over uncomfortable steps, or because they don't know how to do it right. But we still would rarely claim they are not capable of reasoning.

I wish more evaluations of LLMs would establish a human baseline to test them against for much this reason. It would be illuminating in terms of actually telling us more about how LLMs match up to humans in different areas.

replies(2): >>45772974 #>>45780248 #

8. nl ◴[31 Oct 25 12:10 UTC] No.45771146{3}[source]▶

>>45770671 #

> LLM PhD:s solve only known problems.

Terry Tao would disagree: https://mathstodon.xyz/@tao/114508029896631083

https://deepmind.google/discover/blog/alphaevolve-a-gemini-p...

9. cryptonym ◴[31 Oct 25 15:06 UTC] No.45772820[source]▶

>>45770997 #

Can solve problems you already know how to solve, if you micro-manage it and it'll BS a lot on the way.

If this is the maximum AGI-PhD-LRM can do, that'll be disappointing compared to investments. Curious to see what all this will become in few years.

replies(2): >>45773274 #>>45780970 #

10. cryptonym ◴[31 Oct 25 15:16 UTC] No.45772974{3}[source]▶

>>45771034 #

Computers have forever been doing stuff people can't do.

The real question is how useful this tool is and if this is as transformative as investors expect. Understanding its limits is crucial.

11. vidarh ◴[31 Oct 25 15:37 UTC] No.45773274{3}[source]▶

>>45772820 #

I'm not usually micro-managing it, that's the point.

I sometimes do on problems where I have particular insight, but I mostly find it is far more effective to give it test cases and give it instructions on how to approach a task, and then let it iterate with little to no oversight.

I'm letting Claude Code run for longer and longer with --dangerously-skip-permissions, to the point I'm pondering rigging up something to just keep feeding it "continue" and run it in parallel on multiple problems.

Because at least when you have a good way of measuring success, it works.

12. hansmayer ◴[01 Nov 25 08:53 UTC] No.45780209[source]▶

>>45770997 #

Unless you can show us concrete metrics and problems solved, I am inclined not to believe your statement (source: own intensive experience with the LLMs).

13. ◴[01 Nov 25 09:05 UTC] No.45780248{3}[source]▶

>>45771034 #

14. hansmayer ◴[01 Nov 25 11:58 UTC] No.45780970{3}[source]▶

>>45772820 #

Exactly my experience too. Whoever says they're able to solve "very complex" problems with LLMs, is clearly not working on objectively complex problems.

↑