Reasoning models reason well, until they don't

(arxiv.org)

215 points optimalsolver | 1 comments | 31 Oct 25 09:23 UTC | HN request time: 0.2s | source

Show context

alyxya ◴[31 Oct 25 10:33 UTC] No.45770449[source]▶

The key point the paper seems to make is that existing benchmarks have relatively low complexity on reasoning complexity, so they made a new dataset DeepRD with arbitrarily large reasoning complexity and demonstrated that existing models fail at a complex enough problem. Complexity is defined from the complexity of a graph created by modeling the problem as a graph and determining the traversals needed to go from some source node to a target node.

My main critique is that I don't think there's evidence that this issue would persist after continuing to scale models to be larger and doing more RL. With a harness like what coding agents do these days and with sufficient tool use, I bet models could go much further on that reasoning benchmark. Otherwise, if the reasoning problem were entirely done within a single context window, it's expected that a complex enough reasoning problem would be too difficult for the model to solve.

replies(5): >>45771061 #>>45771156 #>>45772667 #>>45775565 #>>45775741 #

usrbinbash ◴[31 Oct 25 14:56 UTC] No.45772667[source]▶

>>45770449 #

> I don't think there's evidence that this issue would persist after continuing to scale models to be larger and doing more RL

And how much larger do we need to make the models? 2x? 3x? 10x? 100x? How large do they need to get before scaling-up somehow solves everything?

Because: 2x larger, means 2x more memory and compute required. Double the cost or half the capacity. Would people still pay for this tech if it doubles in price? Bear in mind, much of it is already running at a loss even now.

And what if 2x isn't good enough? Would anyone pay for a 10x larger model? Can we even realistically run such models as anything other than a very expensive PoC and for a very short time? And whos to say that even 10x will finally solve things? What if we need 40x? Or 100x?

Oh, and of course: Larger models also require more data to train them on. And while the Internet is huge, it's still finite. And when things grow geometrically, even `sizeof(internet)` eventually runs out ... and, in fact, may have done so already [1] [2]

What if we actually discover that scaling up doesn't even work at all, because of diminishing returns? Oh wait, looks like we did that already: [3]

[1]: https://observer.com/2024/12/openai-cofounder-ilya-sutskever...

[2]: https://biztechweekly.com/ai-training-data-crisis-how-synthe...

[3]: https://garymarcus.substack.com/p/confirmed-llms-have-indeed...

replies(1): >>45773612 #

alyxya ◴[31 Oct 25 16:03 UTC] No.45773612[source]▶

>>45772667 #

Scaling applies to multiple dimensions simultaneously over time. A frontier model today could be replicated a year later with a model half the size, with a quarter of the FLOPS, etc. I don’t know the real numbers for optimization scaling, but you could check out NanoGPT speedrun [1] as an example.

The best solution in the meantime is giving the LLM a harness that allows tool use like what coding agents have. I suspect current models are fully capable of solving arbitrary complexity artificial reasoning problems here, provided that they’re used in the context of a coding agent tool.

[1] https://github.com/KellerJordan/modded-nanogpt

replies(2): >>45775779 #>>45775910 #

1. Infinity315 ◴[31 Oct 25 19:42 UTC] No.45775910[source]▶

>>45773612 #

What? Fundamentally, information can only be so dense. Current models may be inefficient w.r.t. information density, however, there is a lower bound of compute required. As a pathological example, we shouldn't expect a megabyte worth of parameters to be able to encode the entirety of Wikipedia.

↑