Reasoning models reason well, until they don't

(arxiv.org)

214 points optimalsolver | 1 comments | 31 Oct 25 09:23 UTC | HN request time: 0s | source

Show context

alyxya ◴[31 Oct 25 10:33 UTC] No.45770449[source]▶

The key point the paper seems to make is that existing benchmarks have relatively low complexity on reasoning complexity, so they made a new dataset DeepRD with arbitrarily large reasoning complexity and demonstrated that existing models fail at a complex enough problem. Complexity is defined from the complexity of a graph created by modeling the problem as a graph and determining the traversals needed to go from some source node to a target node.

My main critique is that I don't think there's evidence that this issue would persist after continuing to scale models to be larger and doing more RL. With a harness like what coding agents do these days and with sufficient tool use, I bet models could go much further on that reasoning benchmark. Otherwise, if the reasoning problem were entirely done within a single context window, it's expected that a complex enough reasoning problem would be too difficult for the model to solve.

replies(5): >>45771061 #>>45771156 #>>45772667 #>>45775565 #>>45775741 #

jeremyjh ◴[31 Oct 25 11:59 UTC] No.45771061[source]▶

>>45770449 #

The burden of evidence here is on you. They don’t need to prove LRMs can’t scale to meet these problems; their only claim is current models can’t handle these problems. Others will take this up as a challenge - and chances may be good they will overcome it. This is how science works.

replies(1): >>45773412 #

alyxya ◴[31 Oct 25 15:48 UTC] No.45773412[source]▶

>>45771061 #

They can’t claim current models aren’t able to handle these problems if they didn’t use a setup similar to coding agents like Claude Code and OpenAI Codex. Using a suboptimal setup is akin to verbally telling a person the whole reasoning problem without letting them write down notes and expecting them to memorize and solve it after only hearing it once.

replies(2): >>45775175 #>>45775224 #

1. rdedev ◴[31 Oct 25 18:37 UTC] No.45775224{3}[source]▶

>>45773412 #

The thing they are testing for is reasoning performance. It makes sense to not give tool access.

This is same as the critiques of the LLM paper by apple where they showed that LLMs fail to solve the tower of hanoi problem after a set number of towers. The test was to see how well these models can reason out a long task. People online were like they could solve that problem if they had access to a coding enviornment. Again the test was to check reasoning capability not if it knew how to code and algorithm to solve the problem.

If model performance degrade a lot after a number of reasoning steps it's good to know where the limits are. Wheather the model had access to tools or not is orthogonal to this problem

↑