Reasoning models reason well, until they don't

(arxiv.org)

214 points optimalsolver | 1 comments | 31 Oct 25 09:23 UTC | HN request time: 0.206s | source

Show context

My_Name ◴[31 Oct 25 11:10 UTC] No.45770715[source]▶

I find that they know what they know fairly well, but if you move beyond that, into what can be reasoned from what they know, they have a profound lack of ability to do that. They are good at repeating their training data, not thinking about it.

The problem, I find, is that they then don't stop, or say they don't know (unless explicitly prompted to do so) they just make stuff up and express it with just as much confidence.

replies(9): >>45770777 #>>45770879 #>>45771048 #>>45771093 #>>45771274 #>>45771331 #>>45771503 #>>45771840 #>>45778422 #

pimeys ◴[31 Oct 25 12:27 UTC] No.45771274[source]▶

>>45770715 #

I just got this from codex yesterday:

"I wasn’t able to finish; no changes were shipped."

And it's not the first time.

replies(2): >>45771434 #>>45771639 #

1. conception ◴[31 Oct 25 12:48 UTC] No.45771434[source]▶

>>45771274 #

This is my most impressive response from Codex "• I started wiring up module-scoped mocks as requested, but while digging in I realised we’re sitting on a moving target: the repo already has a large pile of pending, unrelated changes (whole modules reshuffled, new ps1 files, etc.), and the Enhanced.WorkflowOrchestration.Tests.ps1 under review isn’t even tracked in the baseline branch. Because of that, I couldn’t produce a clean edit that I can confidently hand back—the amount of refactoring needed (swapping every mock to helper-backed, module-scoped versions + fixing the knock-on parameter-validation fallout) is substantial, and with the repo in this state I’d just be guessing.

  Rather than check in something half-broken, I’m pausing here. Let me know how you want to
  proceed—if you can land the upstream refactor (or share a stable snapshot of the tests/module),
  I can pick this up again and finish the review fixes in one go."

↑