Reasoning models reason well, until they don't

1. equinox_nl ◴[31 Oct 25 09:48 UTC] No.45770131[source]▶

But I also fail catastrophically once a reasoning problem exceeds modest complexity.

replies(4): >>45770215 #>>45770281 #>>45770402 #>>45770506 #

2. monkeydust ◴[31 Oct 25 10:00 UTC] No.45770215[source]▶

But you recognise you are likely to fail and thus dont respond or redirect the problem to someone who has a greater likelihood of not failing.

replies(2): >>45770433 #>>45770440 #

3. davidhs ◴[31 Oct 25 10:09 UTC] No.45770281[source]▶

>>45770131 (TP) #

Do you? Don't you just halt and say this is too complex?

replies(3): >>45770311 #>>45770398 #>>45770868 #

4. p_v_doom ◴[31 Oct 25 10:15 UTC] No.45770311[source]▶

>>45770281 #

Nope, audacity and Dunning-Krueger all the way, baby

5. dspillett ◴[31 Oct 25 10:25 UTC] No.45770398[source]▶

>>45770281 #

Some would consider that to be failing catastrophically. The task is certainly failed.

replies(3): >>45770566 #>>45770851 #>>45770961 #

6. AlecSchueler ◴[31 Oct 25 10:26 UTC] No.45770402[source]▶

>>45770131 (TP) #

I also fail catastrophically when trying to push nails through walls by I expect my hammer to do better.

replies(2): >>45770452 #>>45770581 #

7. antonvs ◴[31 Oct 25 10:30 UTC] No.45770433[source]▶

>>45770215 #

I’ve had models “redirect the problem to someone who has a greater likelihood of not failing”. Gemini in particular will do this when it runs into trouble.

I don’t find all these claims that models are somehow worse than humans in such areas convincing. Yes, they’re worse in some respects. But when you’re talking about things related to failures and accuracy, they’re mostly superhuman.

For example, how many humans can write hundred of lines of code (in seconds mind you) and regularly not have any syntax errors or bugs?

replies(2): >>45770627 #>>45772112 #

8. exe34 ◴[31 Oct 25 10:31 UTC] No.45770440[source]▶

>>45770215 #

If that were true, we would live in a utopia. People vote/legislate/govern/live/raise/teach/preach without ever learning to reason correctly.

9. moffkalast ◴[31 Oct 25 10:33 UTC] No.45770452[source]▶

>>45770402 #

I have one hammer and I expect it to work on every nail and screw. If it's not a general hammer, what good is it now?

replies(1): >>45770707 #

10. raddan ◴[31 Oct 25 10:40 UTC] No.45770506[source]▶

>>45770131 (TP) #

Yes, but you are not a computer. There is no point building another human. We have plenty of them.

replies(1): >>45775941 #

11. carlmr ◴[31 Oct 25 10:50 UTC] No.45770566{3}[source]▶

>>45770398 #

Halting is sometimes preferable to thrashing around and running in circles.

I feel like if LLMs "knew" when they're out of their depth, they could be much more useful. The question is whether knowing when to stop can be meaningfully learned from examples with RL. From all we've seen the hallucination problem and this stopping problem all boil down to this problem that you could teach the model to say "I don't know" but if that's part of the training dataset it might just spit out "I don't know" to random questions, because it's a likely response in the realm of possible responses, instead of spitting out "I don't know" to not knowing.

SocratesAI is still unsolved, and LLMs are probably not the path to get knowing that you know nothing.

replies(1): >>45770706 #

12. hshdhdhehd ◴[31 Oct 25 10:53 UTC] No.45770581[source]▶

>>45770402 #

Gold and shovels might be a more fitting analogy for AI

13. ffsm8 ◴[31 Oct 25 10:58 UTC] No.45770627{3}[source]▶

>>45770433 #

> For example, how many humans can write hundred of lines of code (in seconds mind you) and regularly not have any syntax errors or bugs?

Ez, just use codegen.

Also the second part (not having bugs) is unlikely to be true for the LLM generated code, whereas traditional codegen will actually generate code with pretty much no bugs.

replies(2): >>45770818 #>>45778731 #

14. ukuina ◴[31 Oct 25 11:09 UTC] No.45770706{4}[source]▶

>>45770566 #

> if LLMs "knew" when they're out of their depth, they could be much more useful.

I used to think this, but no longer sure.

Large-scale tasks just grind to a halt with more modern LLMs because of this perception of impassable complexity.

And it's not that they need extensive planning, the LLM knows what needs to be done (it'll even tell you!), it's just more work than will fit within a "session" (arbitrary) and so it would rather refuse than get started.

So you're now looking at TODOs, and hierarchical plans, and all this unnecessary pre-work even when the task scales horizontally very well (if it just jumped into it).

15. arethuza ◴[31 Oct 25 11:09 UTC] No.45770707{3}[source]▶

>>45770452 #

You don't need a "general hammer" - they are old fashioned - you need a "general-purpose tool-building factory factory factory":

https://www.danstroot.com/posts/2018-10-03-hammer-factories

replies(1): >>45771103 #

16. vidarh ◴[31 Oct 25 11:24 UTC] No.45770818{4}[source]▶

>>45770627 #

I have Claude reducing the number of bugs in my traditional codegen right now.

17. benterix ◴[31 Oct 25 11:30 UTC] No.45770851{3}[source]▶

>>45770398 #

This seems to be the stance of creators of agentic coders. They are so bound on creating something, even if this something makes no sense whatsoever.

18. moritzwarhier ◴[31 Oct 25 11:32 UTC] No.45770868[source]▶

>>45770281 #

Ah yes, the function that halts if the input problem would take too long to halt.

But yes, I assume you mean they abort their loop after a while, which they do.

This whole idea of a "reasoning benchmark" doesn't sit well with me. It seems still not well-defined to me.

Maybe it's just bias I have or my own lack of intelligence, but it seems to me that using language models for "reasoning" is still more or less a gimmick and convenience feature (to automate re-prompts, clarifications etc, as far as possible).

But reading this pop-sci article from summer 2022 seems like this definition problem hasn't changed very much since then.

Although it's about AI progress before ChatGPT and it doesn't even mention the GPT base models. Sure, some of the tasks mentioned in the article seem dated today.

But IMO, there is still no AI model that can be trusted to, for example, accurately summarize a Wikipedia article.

Not all humans can do that either, sure. But humans are better at knowing what they don't know, and deciding what other humans can be trusted. And of course, none of this is an arithmetic or calculation task.

https://www.science.org/content/article/computers-ace-iq-tes...

19. LunaSea ◴[31 Oct 25 11:46 UTC] No.45770961{3}[source]▶

>>45770398 #

I would consider that detecting your own limits when trying to solve a problem is preferable to having the illusion of thinking that your solution is working and correct.

20. code_martial ◴[31 Oct 25 12:05 UTC] No.45771103{4}[source]▶

>>45770707 #

Reminds me of a 10 letter Greek word that starts with a k.

21. pessimizer ◴[31 Oct 25 14:05 UTC] No.45772112{3}[source]▶

>>45770433 #

> I’ve had models “redirect the problem to someone who has a greater likelihood of not failing”. Gemini in particular will do this when it runs into trouble.

I have too, and I sense that this is something that has been engineered in rather than coming up naturally. I like it very much and they should do it a lot more often. They're allergic to "I can't figure this out" but hearing "I can't figure this out" gives me the alert to help it over the hump.

> But when you’re talking about things related to failures and accuracy, they’re mostly superhuman.

Only if you consider speed to failure and inaccuracy. They're very much subhuman in output, but you can make them retry a lot in a short time, and refine what you're asking them each time to avoid the mistakes they're repeatedly making. But that's you doing the work.

22. aoeusnth1 ◴[31 Oct 25 19:45 UTC] No.45775941[source]▶

>>45770506 #

Others would beg to disagree that we should be build a machine which can act as a human.

23. antonvs ◴[01 Nov 25 02:14 UTC] No.45778731{4}[source]▶

>>45770627 #

What's your point? Traditional codegen tools are inflexible in the extreme compared to what LLMs can do.

The realistic comparison is between humans and LLMs, not LLMs and codegen tools.

replies(1): >>45779464 #

24. ffsm8 ◴[01 Nov 25 05:30 UTC] No.45779464{5}[source]▶

>>45778731 #

The point was that the listed argument of production tons of boilerplate code within a short period of time is a... Pointless metric to cite