Reasoning models reason well, until they don't

1. alyxya ◴[31 Oct 25 10:33 UTC] No.45770449[source]▶

The key point the paper seems to make is that existing benchmarks have relatively low complexity on reasoning complexity, so they made a new dataset DeepRD with arbitrarily large reasoning complexity and demonstrated that existing models fail at a complex enough problem. Complexity is defined from the complexity of a graph created by modeling the problem as a graph and determining the traversals needed to go from some source node to a target node.

My main critique is that I don't think there's evidence that this issue would persist after continuing to scale models to be larger and doing more RL. With a harness like what coding agents do these days and with sufficient tool use, I bet models could go much further on that reasoning benchmark. Otherwise, if the reasoning problem were entirely done within a single context window, it's expected that a complex enough reasoning problem would be too difficult for the model to solve.

replies(5): >>45771061 #>>45771156 #>>45772667 #>>45775565 #>>45775741 #

2. jeremyjh ◴[31 Oct 25 11:59 UTC] No.45771061[source]▶

>>45770449 (TP) #

The burden of evidence here is on you. They don’t need to prove LRMs can’t scale to meet these problems; their only claim is current models can’t handle these problems. Others will take this up as a challenge - and chances may be good they will overcome it. This is how science works.

replies(1): >>45773412 #

3. tomlockwood ◴[31 Oct 25 12:11 UTC] No.45771156[source]▶

>>45770449 (TP) #

So the answer is a few more trillion?

replies(1): >>45771324 #

4. code_martial ◴[31 Oct 25 12:33 UTC] No.45771324[source]▶

>>45771156 #

It’s a worthwhile answer if it can be proven correct because it means that we’ve found a way to create intelligence, even if that way is not very efficient. It’s still one step better than not knowing how to do so.

replies(2): >>45771753 #>>45772739 #

5. tomlockwood ◴[31 Oct 25 13:27 UTC] No.45771753{3}[source]▶

>>45771324 #

So we're sending a trillion on faith?

replies(1): >>45771805 #

6. code_martial ◴[31 Oct 25 13:32 UTC] No.45771805{4}[source]▶

>>45771753 #

No, that’s not what I said.

replies(1): >>45772216 #

7. tomlockwood ◴[31 Oct 25 14:14 UTC] No.45772216{5}[source]▶

>>45771805 #

Why are we sending the trillion?

replies(1): >>45775705 #

8. usrbinbash ◴[31 Oct 25 14:56 UTC] No.45772667[source]▶

>>45770449 (TP) #

> I don't think there's evidence that this issue would persist after continuing to scale models to be larger and doing more RL

And how much larger do we need to make the models? 2x? 3x? 10x? 100x? How large do they need to get before scaling-up somehow solves everything?

Because: 2x larger, means 2x more memory and compute required. Double the cost or half the capacity. Would people still pay for this tech if it doubles in price? Bear in mind, much of it is already running at a loss even now.

And what if 2x isn't good enough? Would anyone pay for a 10x larger model? Can we even realistically run such models as anything other than a very expensive PoC and for a very short time? And whos to say that even 10x will finally solve things? What if we need 40x? Or 100x?

Oh, and of course: Larger models also require more data to train them on. And while the Internet is huge, it's still finite. And when things grow geometrically, even `sizeof(internet)` eventually runs out ... and, in fact, may have done so already [1] [2]

What if we actually discover that scaling up doesn't even work at all, because of diminishing returns? Oh wait, looks like we did that already: [3]

[1]: https://observer.com/2024/12/openai-cofounder-ilya-sutskever...

[2]: https://biztechweekly.com/ai-training-data-crisis-how-synthe...

[3]: https://garymarcus.substack.com/p/confirmed-llms-have-indeed...

replies(1): >>45773612 #

9. usrbinbash ◴[31 Oct 25 15:02 UTC] No.45772739{3}[source]▶

>>45771324 #

> if it can be proven correct

Then the first step would be to prove that this works WITHOUT needing to burn through the trillions to do so.

10. alyxya ◴[31 Oct 25 15:48 UTC] No.45773412[source]▶

>>45771061 #

They can’t claim current models aren’t able to handle these problems if they didn’t use a setup similar to coding agents like Claude Code and OpenAI Codex. Using a suboptimal setup is akin to verbally telling a person the whole reasoning problem without letting them write down notes and expecting them to memorize and solve it after only hearing it once.

replies(2): >>45775175 #>>45775224 #

11. alyxya ◴[31 Oct 25 16:03 UTC] No.45773612[source]▶

>>45772667 #

Scaling applies to multiple dimensions simultaneously over time. A frontier model today could be replicated a year later with a model half the size, with a quarter of the FLOPS, etc. I don’t know the real numbers for optimization scaling, but you could check out NanoGPT speedrun [1] as an example.

The best solution in the meantime is giving the LLM a harness that allows tool use like what coding agents have. I suspect current models are fully capable of solving arbitrary complexity artificial reasoning problems here, provided that they’re used in the context of a coding agent tool.

[1] https://github.com/KellerJordan/modded-nanogpt

replies(2): >>45775779 #>>45775910 #

12. jeremyjh ◴[31 Oct 25 18:32 UTC] No.45775175{3}[source]▶

>>45773412 #

If the models can’t do it they can make that claim. If you want to make claims about agents then design that experiment, collect the data and write a paper. That is how science works.

13. rdedev ◴[31 Oct 25 18:37 UTC] No.45775224{3}[source]▶

>>45773412 #

The thing they are testing for is reasoning performance. It makes sense to not give tool access.

This is same as the critiques of the LLM paper by apple where they showed that LLMs fail to solve the tower of hanoi problem after a set number of towers. The test was to see how well these models can reason out a long task. People online were like they could solve that problem if they had access to a coding enviornment. Again the test was to check reasoning capability not if it knew how to code and algorithm to solve the problem.

If model performance degrade a lot after a number of reasoning steps it's good to know where the limits are. Wheather the model had access to tools or not is orthogonal to this problem

14. BriggyDwiggs42 ◴[31 Oct 25 19:08 UTC] No.45775565[source]▶

>>45770449 (TP) #

The issue is that no matter how much you train them they don’t generalize to arbitrary sized problems. Sure you can push out the horizon, but you won’t make something that can solve the problem always (assuming resources permit, and that isn’t the issue here).

15. measurablefunc ◴[31 Oct 25 19:22 UTC] No.45775705{6}[source]▶

>>45772216 #

It must be deposited into OpenAI's bank account so that they can then deposit it into NVIDIA's account who can then in turn make a deal w/ OpenAI to deposit it back into OpenAI's account for some stock options. I think you can see how it works from here but if not then maybe one of the scaled up "reasoning" AIs will figure it out for you.

replies(1): >>45778896 #

16. galaxyLogic ◴[31 Oct 25 19:26 UTC] No.45775741[source]▶

>>45770449 (TP) #

> complexity of a graph created by modeling the problem as a graph and determining the traversals needed to go from some source node to a target node

Sounds interesting: Formalizing a problem once you know the solution. Seems like LLMs can't do that, or if they could they would evaluate where their problem solving is inadequate?

17. galaxyLogic ◴[31 Oct 25 19:29 UTC] No.45775779{3}[source]▶

>>45773612 #

Some problems are just too complex and the effort to solve them increases exponentially. No LLM can keep up with exponenentially increasing effort unless you run them for adequatte number of years.

18. Infinity315 ◴[31 Oct 25 19:42 UTC] No.45775910{3}[source]▶

>>45773612 #

What? Fundamentally, information can only be so dense. Current models may be inefficient w.r.t. information density, however, there is a lower bound of compute required. As a pathological example, we shouldn't expect a megabyte worth of parameters to be able to encode the entirety of Wikipedia.

19. tomlockwood ◴[01 Nov 25 02:57 UTC] No.45778896{7}[source]▶

>>45775705 #

I understand perfectly, thank you!!!