Reasoning models reason well, until they don't

(arxiv.org)

214 points optimalsolver | 1 comments | 31 Oct 25 09:23 UTC | HN request time: 0.206s | source

Show context

equinox_nl ◴[31 Oct 25 09:48 UTC] No.45770131[source]▶

>>45769971 (OP) #

But I also fail catastrophically once a reasoning problem exceeds modest complexity.

replies(4): >>45770215 #>>45770281 #>>45770402 #>>45770506 #

davidhs ◴[31 Oct 25 10:09 UTC] No.45770281[source]▶

>>45770131 #

Do you? Don't you just halt and say this is too complex?

replies(3): >>45770311 #>>45770398 #>>45770868 #

dspillett ◴[31 Oct 25 10:25 UTC] No.45770398[source]▶

>>45770281 #

Some would consider that to be failing catastrophically. The task is certainly failed.

replies(3): >>45770566 #>>45770851 #>>45770961 #

carlmr ◴[31 Oct 25 10:50 UTC] No.45770566[source]▶

>>45770398 #

Halting is sometimes preferable to thrashing around and running in circles.

I feel like if LLMs "knew" when they're out of their depth, they could be much more useful. The question is whether knowing when to stop can be meaningfully learned from examples with RL. From all we've seen the hallucination problem and this stopping problem all boil down to this problem that you could teach the model to say "I don't know" but if that's part of the training dataset it might just spit out "I don't know" to random questions, because it's a likely response in the realm of possible responses, instead of spitting out "I don't know" to not knowing.

SocratesAI is still unsolved, and LLMs are probably not the path to get knowing that you know nothing.

replies(1): >>45770706 #

1. ukuina ◴[31 Oct 25 11:09 UTC] No.45770706[source]▶

>>45770566 #

> if LLMs "knew" when they're out of their depth, they could be much more useful.

I used to think this, but no longer sure.

Large-scale tasks just grind to a halt with more modern LLMs because of this perception of impassable complexity.

And it's not that they need extensive planning, the LLM knows what needs to be done (it'll even tell you!), it's just more work than will fit within a "session" (arbitrary) and so it would rather refuse than get started.

So you're now looking at TODOs, and hierarchical plans, and all this unnecessary pre-work even when the task scales horizontally very well (if it just jumped into it).

↑