Reasoning models reason well, until they don't

(arxiv.org)

Show context

equinox_nl ◴[31 Oct 25 09:48 UTC] No.45770131[source]▶

>>45769971 (OP) #

But I also fail catastrophically once a reasoning problem exceeds modest complexity.

replies(4): >>45770215 #>>45770281 #>>45770402 #>>45770506 #

monkeydust ◴[31 Oct 25 10:00 UTC] No.45770215[source]▶

>>45770131 #

But you recognise you are likely to fail and thus dont respond or redirect the problem to someone who has a greater likelihood of not failing.

replies(2): >>45770433 #>>45770440 #

1. antonvs ◴[31 Oct 25 10:30 UTC] No.45770433[source]▶

>>45770215 #

I’ve had models “redirect the problem to someone who has a greater likelihood of not failing”. Gemini in particular will do this when it runs into trouble.

I don’t find all these claims that models are somehow worse than humans in such areas convincing. Yes, they’re worse in some respects. But when you’re talking about things related to failures and accuracy, they’re mostly superhuman.

For example, how many humans can write hundred of lines of code (in seconds mind you) and regularly not have any syntax errors or bugs?

replies(2): >>45770627 #>>45772112 #

2. ffsm8 ◴[31 Oct 25 10:58 UTC] No.45770627[source]▶

>>45770433 (TP) #

> For example, how many humans can write hundred of lines of code (in seconds mind you) and regularly not have any syntax errors or bugs?

Ez, just use codegen.

Also the second part (not having bugs) is unlikely to be true for the LLM generated code, whereas traditional codegen will actually generate code with pretty much no bugs.

replies(2): >>45770818 #>>45778731 #

3. vidarh ◴[31 Oct 25 11:24 UTC] No.45770818[source]▶

>>45770627 #

I have Claude reducing the number of bugs in my traditional codegen right now.

4. pessimizer ◴[31 Oct 25 14:05 UTC] No.45772112[source]▶

>>45770433 (TP) #

> I’ve had models “redirect the problem to someone who has a greater likelihood of not failing”. Gemini in particular will do this when it runs into trouble.

I have too, and I sense that this is something that has been engineered in rather than coming up naturally. I like it very much and they should do it a lot more often. They're allergic to "I can't figure this out" but hearing "I can't figure this out" gives me the alert to help it over the hump.

> But when you’re talking about things related to failures and accuracy, they’re mostly superhuman.

Only if you consider speed to failure and inaccuracy. They're very much subhuman in output, but you can make them retry a lot in a short time, and refine what you're asking them each time to avoid the mistakes they're repeatedly making. But that's you doing the work.

5. antonvs ◴[01 Nov 25 02:14 UTC] No.45778731[source]▶

>>45770627 #

What's your point? Traditional codegen tools are inflexible in the extreme compared to what LLMs can do.

The realistic comparison is between humans and LLMs, not LLMs and codegen tools.

replies(1): >>45779464 #

6. ffsm8 ◴[01 Nov 25 05:30 UTC] No.45779464{3}[source]▶

>>45778731 #

The point was that the listed argument of production tons of boilerplate code within a short period of time is a... Pointless metric to cite

↑