Most active commenters

sothatsit(12)
galaxyLogic(7)
riku_iki(7)
squidbeak(6)
hansmayer(6)
darkwater(5)
tomlockwood(4)
acdha(4)
anal_reactor(4)
vidarh(4)

Popular/hot comments

>>45770715 #
>>45770626 #
>>45770127 #
>>45770449 #
>>45770131 #
>>45770777 #
>>45770281 #
>>45770470 #
>>45770398 #
>>45770198 #
>>45770845 #
>>45771093 #
>>45771344 #
>>45772607 #
>>45775953 #

Reasoning models reason well, until they don't

(arxiv.org)

1. iLoveOncall ◴[31 Oct 25 09:48 UTC] No.45770127[source]▶

> [...] recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs) -- LLMs fine-tuned with incentives for step-by-step argumentation and self-verification

This was the obvious outcome of the study (don't get me wrong, obvious outcomes are still worth having research on).

"LRMs" *are* just LLMs. There's no such thing as a reasoning model, it's just having an LLM write a better prompt than the human would and then sending it to the LLM again.

Despite what Amodei and Altman want Wall Street to believe, they did not suddenly unlock reasoning capabilities in LLMs by essentially just running two different prompts in sequence to answer the user's question.

The truly amazing thing is that reasoning models show ANY improvement at all compared to non-reasoning models, when they're the same exact thing.

replies(5): >>45770198 #>>45770203 #>>45770220 #>>45770276 #>>45770473 #

2. equinox_nl ◴[31 Oct 25 09:48 UTC] No.45770131[source]▶

>>45769971 (OP) #

But I also fail catastrophically once a reasoning problem exceeds modest complexity.

replies(4): >>45770215 #>>45770281 #>>45770402 #>>45770506 #

3. WesolyKubeczek ◴[31 Oct 25 09:50 UTC] No.45770141[source]▶

>>45769971 (OP) #

It’s because they generate a seeming of reasoning, and don’t actually reason!

(Slams the door angrily)

(stomps out angrily)

(touches the grass angrily)

replies(2): >>45770177 #>>45770434 #

4. samuell ◴[31 Oct 25 09:55 UTC] No.45770177[source]▶

>>45770141 #

Yea, a bit like a cheating student rote memorizing and copying another students technique for solving a type of problem, and failing hard as soon as there's too much variation from the original problem.

replies(1): >>45770219 #

5. sothatsit ◴[31 Oct 25 09:58 UTC] No.45770198[source]▶

>>45770127 #

What do you mean by reasoning?

If you mean solving logic problems, then reasoning LLMs seem to pass that bar as they do very well programming and maths competitions. Reasoning LLMs can also complete problems like multiplying large numbers, which requires applying some sort of algorithm where the results cannot just be memorised. They also do this much better than standard pre-trained LLMs with no RL.

So, that makes me come back to this question of what definition of reasoning do people use that reasoning models do not meet? They're not perfect, obviously, but that is not a requirement of reasoning if you agree that humans can reason. We make mistakes as well, and we also suffer under higher complexity. Perhaps they are less reliable in knowing when they have made mistakes or not than trained humans, but I wouldn't personally include reliability in my definition for reasoning (just look at how often humans make mistakes in tests).

I am yet to see any serious, reasoned, arguments that suggest why the amazing achievements of reasoning LLMs in maths and programming competitions, on novel problems, does not count as "real reasoning". It seems much more that people just don't like the idea of LLMs reasoning, and so reject the idea without giving an actual reason themselves, which seems somewhat ironic to me.

replies(3): >>45770258 #>>45770588 #>>45775592 #

6. sirwhinesalot ◴[31 Oct 25 09:59 UTC] No.45770203[source]▶

>>45770127 #

> The truly amazing thing is that reasoning models show ANY improvement at all compared to non-reasoning models, when they're the same exact thing.

It's because they do more compute. The more tokens "spent" the better the accuracy. Same reason they spit out a paragraph of text instead of just giving a straight answer in non-reasoning mode.

7. monkeydust ◴[31 Oct 25 10:00 UTC] No.45770215[source]▶

>>45770131 #

But you recognise you are likely to fail and thus dont respond or redirect the problem to someone who has a greater likelihood of not failing.

replies(2): >>45770433 #>>45770440 #

8. fsloth ◴[31 Oct 25 10:01 UTC] No.45770219{3}[source]▶

>>45770177 #

Yes!

That said the input space of supported problems is quite large and you can configure the problem parametrs quite flexibly.

I guess the issue is that what the model _actually_ provides you is this idiot savant who has pre-memorized everything without offering a clear index that would disambiguate well-supported problems from ”too difficult” (i.e. novel) ones

9. jpcompartir ◴[31 Oct 25 10:01 UTC] No.45770220[source]▶

>>45770127 #

I can't remember which paper it's from, but isn't the variance in performance explained by # of tokens generated? i.e. more tokens generated tends towards better performance.

Which isn't particularly amazing, as # of tokens generated is basically a synonym in this case for computation.

We spend more computation, we tend towards better answers.

10. fsloth ◴[31 Oct 25 10:06 UTC] No.45770258{3}[source]▶

>>45770198 #

I guess we mean here ”usefull reasoning” instead of the idiot-savant. I mean it’s a fair ask since these are marketed as _tools_ you can use to implement _industrial processes_ and even replace you human workers.

In that I guess the model does not need to be the most reasonable intepreter of vague and poorly formulated user inputs but I think to improve a bit at least, to become usefull general appliances and not just test-scoring-automatons.

The key differentiator here is that tests generally _are made to be unambiguously scoreable_. Real world problems are often more vague from the point of view of optimal outcome.

replies(2): >>45770349 #>>45770457 #

11. qsort ◴[31 Oct 25 10:09 UTC] No.45770276[source]▶

>>45770127 #

Don't they have a significant RL component? The "we'll just make it bigger" idea that was peddled a lot after GPT3.5 was nonsense, but that's not the only thing they're doing right now.

replies(1): >>45770883 #

12. davidhs ◴[31 Oct 25 10:09 UTC] No.45770281[source]▶

>>45770131 #

Do you? Don't you just halt and say this is too complex?

replies(3): >>45770311 #>>45770398 #>>45770868 #

13. p_v_doom ◴[31 Oct 25 10:15 UTC] No.45770311{3}[source]▶

>>45770281 #

Nope, audacity and Dunning-Krueger all the way, baby

14. sothatsit ◴[31 Oct 25 10:19 UTC] No.45770349{4}[source]▶

>>45770258 #

Thanks. So, people are extending "reasoning" to include making good decisions, rather than just solving logic problems. That makes sense to me that if people use that definition, LLMs are pretty bad at "reasoning".

Although, I would argue that this is not reasoning at all, but rather "common sense" or the ability to have a broader perspective or think of the future. These are tasks that come with experience. That is why these do not seem like reasoning tasks to me, but rather soft skills that LLMs lack. In my mind these are pretty separate concerns to whether LLMs can logically step through problems or apply algorithms, which is what I would call reasoning.

replies(1): >>45770470 #

15. dspillett ◴[31 Oct 25 10:25 UTC] No.45770398{3}[source]▶

>>45770281 #

Some would consider that to be failing catastrophically. The task is certainly failed.

replies(3): >>45770566 #>>45770851 #>>45770961 #

16. AlecSchueler ◴[31 Oct 25 10:26 UTC] No.45770402[source]▶

>>45770131 #

I also fail catastrophically when trying to push nails through walls by I expect my hammer to do better.

replies(2): >>45770452 #>>45770581 #

17. brap ◴[31 Oct 25 10:27 UTC] No.45770407[source]▶

>>45769971 (OP) #

I wonder if we can get models to reason in a structured and verifiable way, like we have formal logic in math.

replies(2): >>45770419 #>>45775771 #

18. Frieren ◴[31 Oct 25 10:28 UTC] No.45770419[source]▶

>>45770407 #

For that, you already have classical programming. It is great at formal logic math.

replies(1): >>45770489 #

19. antonvs ◴[31 Oct 25 10:30 UTC] No.45770433{3}[source]▶

>>45770215 #

I’ve had models “redirect the problem to someone who has a greater likelihood of not failing”. Gemini in particular will do this when it runs into trouble.

I don’t find all these claims that models are somehow worse than humans in such areas convincing. Yes, they’re worse in some respects. But when you’re talking about things related to failures and accuracy, they’re mostly superhuman.

For example, how many humans can write hundred of lines of code (in seconds mind you) and regularly not have any syntax errors or bugs?

replies(2): >>45770627 #>>45772112 #

20. brap ◴[31 Oct 25 10:30 UTC] No.45770434[source]▶

>>45770141 #

What is to reason, if not to generate a seeming of reasoning?

(tips fedora)

replies(1): >>45770610 #

21. exe34 ◴[31 Oct 25 10:31 UTC] No.45770440{3}[source]▶

>>45770215 #

If that were true, we would live in a utopia. People vote/legislate/govern/live/raise/teach/preach without ever learning to reason correctly.

22. alyxya ◴[31 Oct 25 10:33 UTC] No.45770449[source]▶

>>45769971 (OP) #

The key point the paper seems to make is that existing benchmarks have relatively low complexity on reasoning complexity, so they made a new dataset DeepRD with arbitrarily large reasoning complexity and demonstrated that existing models fail at a complex enough problem. Complexity is defined from the complexity of a graph created by modeling the problem as a graph and determining the traversals needed to go from some source node to a target node.

My main critique is that I don't think there's evidence that this issue would persist after continuing to scale models to be larger and doing more RL. With a harness like what coding agents do these days and with sufficient tool use, I bet models could go much further on that reasoning benchmark. Otherwise, if the reasoning problem were entirely done within a single context window, it's expected that a complex enough reasoning problem would be too difficult for the model to solve.

replies(5): >>45771061 #>>45771156 #>>45772667 #>>45775565 #>>45775741 #

23. moffkalast ◴[31 Oct 25 10:33 UTC] No.45770452{3}[source]▶

>>45770402 #

I have one hammer and I expect it to work on every nail and screw. If it's not a general hammer, what good is it now?

replies(1): >>45770707 #

24. hansmayer ◴[31 Oct 25 10:34 UTC] No.45770457{4}[source]▶

>>45770258 #

^^This is a great view and it seems generally widely understood by the file and rank techies. I feel pitty for the general public retail investors which are about to be left holding the bag for the VCs, after a certain major <ahem> champion goes into IPO soon.

25. hansmayer ◴[31 Oct 25 10:36 UTC] No.45770470{5}[source]▶

>>45770349 #

Ah yes then, let me then unchain my LLM on those nasty unsolved math and logic problems I've absolutely not be struggling with in the course of my career.

replies(3): >>45770540 #>>45770590 #>>45770997 #

26. antonvs ◴[31 Oct 25 10:36 UTC] No.45770473[source]▶

>>45770127 #

> The truly amazing thing is that reasoning models show ANY improvement at all compared to non-reasoning models, when they're the same exact thing.

Why is that amazing? It seems expected. Use a tool differently, get different results.

27. brap ◴[31 Oct 25 10:37 UTC] No.45770489{3}[source]▶

>>45770419 #

I think trying to accurately express natural language statements as values and logical steps as operators is going to be very difficult. You also need to take into account ambiguity and subtext and things like that.

I actually believe it is technically possible, but is going to be very hard.

replies(1): >>45770979 #

28. raddan ◴[31 Oct 25 10:40 UTC] No.45770506[source]▶

>>45770131 #

Yes, but you are not a computer. There is no point building another human. We have plenty of them.

replies(1): >>45775941 #

29. sothatsit ◴[31 Oct 25 10:45 UTC] No.45770540{6}[source]▶

>>45770470 #

A lot of maths students would also struggle to contribute to frontier math problems, but we would still say they are reasoning. Their skill at reasoning might not be as good as professional mathematicians, but that does not stop us from recognising that they can solve logic problems without memorisation, which is a form of reasoning.

I am just saying that LLMs have demonstrated they can reason, at least a little bit. Whereas it seems other people are saying that LLM reasoning is flawed, which does not negate the fact that they can reason, at least some of the time.

Maybe generalisation is one area where LLM's reasoning is weakest though. They can be near-elite performance at nicely boxed up competition math problems, but their performance dramatically drops on real-world problems where things aren't so neat. We see similar problems in programming as well. I'd argue the progress on this has been promising, but other people would probably vehemently disagree with that. Time will tell.

replies(1): >>45771034 #

30. js8 ◴[31 Oct 25 10:45 UTC] No.45770542[source]▶

>>45769971 (OP) #

I think the explanation is pretty simple, as I said in my earlier comment: https://news.ycombinator.com/item?id=44904107

I also believe the problem is we don't know what we want: https://news.ycombinator.com/item?id=45509015

If we could make LLMs to apply a modest set of logic rules consistently, it would be a win.

replies(1): >>45770939 #

31. carlmr ◴[31 Oct 25 10:50 UTC] No.45770566{4}[source]▶

>>45770398 #

Halting is sometimes preferable to thrashing around and running in circles.

I feel like if LLMs "knew" when they're out of their depth, they could be much more useful. The question is whether knowing when to stop can be meaningfully learned from examples with RL. From all we've seen the hallucination problem and this stopping problem all boil down to this problem that you could teach the model to say "I don't know" but if that's part of the training dataset it might just spit out "I don't know" to random questions, because it's a likely response in the realm of possible responses, instead of spitting out "I don't know" to not knowing.

SocratesAI is still unsolved, and LLMs are probably not the path to get knowing that you know nothing.

replies(1): >>45770706 #

32. hshdhdhehd ◴[31 Oct 25 10:53 UTC] No.45770581{3}[source]▶

>>45770402 #

Gold and shovels might be a more fitting analogy for AI

33. js8 ◴[31 Oct 25 10:53 UTC] No.45770588{3}[source]▶

>>45770198 #

> So, that makes me come back to this question of what definition of reasoning do people use that reasoning models do not meet?

The models can learn reasoning rules, but they are not able to apply them consistently or recognize the rules they have learned are inconsistent. (See also my other comment which references comments I made earlier.)

And I think they can't without a tradeoff, as I commented https://news.ycombinator.com/item?id=45717855 ; the consistency requires certain level of close-mindedness.

replies(1): >>45770690 #

34. cryptonym ◴[31 Oct 25 10:54 UTC] No.45770590{6}[source]▶

>>45770470 #

That's the real deal.

They say LLM are PhD-level. Despite billion dollars, PhD-LLMs sure are not contributing a lot solving known problems. Except of course few limited marketing stunts.

replies(2): >>45770671 #>>45770720 #

35. flimflamm ◴[31 Oct 25 10:54 UTC] No.45770605[source]▶

>>45769971 (OP) #

What confused me is the fact that in the paper all logical steps are give. It basically check that when all relevant facts are provided explicitly as links , how far and how complex a chain can the model correctly follow before it breaks down?

So it's simpler than "reasoning". This is not necessarily a bad thing as it boils down the reasoning to a simpler, more controlled sub problem.

36. hshdhdhehd ◴[31 Oct 25 10:54 UTC] No.45770610{3}[source]▶

>>45770434 #

You said the quiet part out loud of political debate.

(does something)

37. devlogstream ◴[31 Oct 25 10:56 UTC] No.45770619[source]▶

>>45769971 (OP) #

LLMs are like students, they can reason a bit, but real understanding still takes time and practice.

replies(1): >>45770855 #

38. anal_reactor ◴[31 Oct 25 10:58 UTC] No.45770626[source]▶

>>45769971 (OP) #

I'm yet to see a task that AI fails at that bottom 10% of population wouldn't also fail at.

replies(7): >>45770767 #>>45770873 #>>45770898 #>>45770909 #>>45771143 #>>45771770 #>>45775668 #

39. ffsm8 ◴[31 Oct 25 10:58 UTC] No.45770627{4}[source]▶

>>45770433 #

> For example, how many humans can write hundred of lines of code (in seconds mind you) and regularly not have any syntax errors or bugs?

Ez, just use codegen.

Also the second part (not having bugs) is unlikely to be true for the LLM generated code, whereas traditional codegen will actually generate code with pretty much no bugs.

replies(2): >>45770818 #>>45778731 #

40. fsloth ◴[31 Oct 25 11:03 UTC] No.45770671{7}[source]▶

>>45770590 #

IMHO that's the key differentiator.

You can give a human PhD an _unsolved problem_ in field adjacent to their expertise and expect some reasonable resolution. LLM PhD:s solve only known problems.

That said humans can also be really bad problem solvers.

If you don't care about solving the problem and only want to create paperwork for bureaucracy I guess you don't care either way ("My team's on it!") but companies that don't go out of business generally recognize pretty soon lack of outcomes where it matters.

replies(1): >>45771146 #

41. sothatsit ◴[31 Oct 25 11:05 UTC] No.45770690{4}[source]▶

>>45770588 #

Yes, so I think in this case we use different definitions of reasoning. You include reliability as a part of reasoning, whereas I do not.

I would argue that humans are not 100% reliable in their reasoning, and yet we still claim that they can reason. So, even though I would agree that the reasoning of LLMs is much less reliable, careful, and thoughtful than smart humans, that does not mean that they are not reasoning. Rather, it means that their reasoning is more unreliable and less well-applied than people. But they are still performing reasoning tasks (even if their application of reasoning can be flawed).

Maybe the problem is that I am holding out a minimum bar for LLMs to jump to count as reasoning (demonstrated application of logical algorithms to solve novel problems in any domain), whereas other people are holding the bar higher (consistent and logical application of rules in all/most domains).

replies(1): >>45771135 #

42. ukuina ◴[31 Oct 25 11:09 UTC] No.45770706{5}[source]▶

>>45770566 #

> if LLMs "knew" when they're out of their depth, they could be much more useful.

I used to think this, but no longer sure.

Large-scale tasks just grind to a halt with more modern LLMs because of this perception of impassable complexity.

And it's not that they need extensive planning, the LLM knows what needs to be done (it'll even tell you!), it's just more work than will fit within a "session" (arbitrary) and so it would rather refuse than get started.

So you're now looking at TODOs, and hierarchical plans, and all this unnecessary pre-work even when the task scales horizontally very well (if it just jumped into it).

43. arethuza ◴[31 Oct 25 11:09 UTC] No.45770707{4}[source]▶

>>45770452 #

You don't need a "general hammer" - they are old fashioned - you need a "general-purpose tool-building factory factory factory":

https://www.danstroot.com/posts/2018-10-03-hammer-factories

replies(1): >>45771103 #

44. My_Name ◴[31 Oct 25 11:10 UTC] No.45770715[source]▶

>>45769971 (OP) #

I find that they know what they know fairly well, but if you move beyond that, into what can be reasoned from what they know, they have a profound lack of ability to do that. They are good at repeating their training data, not thinking about it.

The problem, I find, is that they then don't stop, or say they don't know (unless explicitly prompted to do so) they just make stuff up and express it with just as much confidence.

replies(9): >>45770777 #>>45770879 #>>45771048 #>>45771093 #>>45771274 #>>45771331 #>>45771503 #>>45771840 #>>45778422 #

45. nakamoto_damacy ◴[31 Oct 25 11:10 UTC] No.45770717[source]▶

>>45769971 (OP) #

LLMs falter because likelihood-driven pattern completion doesn’t enforce coherence across uncertainty (probability), representation (geometry), composition (category), and search (reasoning). To get robust reasoning, we need these layers to be explicit, typed, and mutually constraining—with verification and calibrated belief updates in the loop.

I was interviewed about this recently, and mentioned the great work of a professor of CS and Law who has been building the foundations for this approach. My own article about it was recently un-linked due to a Notion mishap (but available if anyone is interested - I have to publish it again)

https://www.forbes.com/sites/hessiejones/2025/09/30/llms-are...

replies(1): >>45770865 #

46. hansmayer ◴[31 Oct 25 11:11 UTC] No.45770720{7}[source]▶

>>45770590 #

I wish our press was not effectively muted or bought by the money, so none of the journos has cojones to call out the specific people who were blabbing about PhD-levels, AGI etc. They should be god damn calling them out every single day, essentially doing their job, but they are now too timid for that.

47. TheOtherHobbes ◴[31 Oct 25 11:17 UTC] No.45770767[source]▶

>>45770626 #

How about keeping a conversation going with family over Thanksgiving? (Or local equivalent.)

replies(1): >>45771057 #

48. ftalbot ◴[31 Oct 25 11:18 UTC] No.45770777[source]▶

>>45770715 #

Every token in a response has an element of randomness to it. This means they’re non-deterministic. Even if you set up something within their training data there is some chance that you could get a nonsense, opposite, and/or dangerous result. The chance of that may be low because of things being set up for it to review its result, but there is no way to make a non-deterministic answer fully bound to solving or reasoning anything assuredly, given enough iterations. It is designed to be imperfect.

replies(4): >>45770905 #>>45771745 #>>45774081 #>>45775980 #

49. vidarh ◴[31 Oct 25 11:24 UTC] No.45770818{5}[source]▶

>>45770627 #

I have Claude reducing the number of bugs in my traditional codegen right now.

50. hirako2000 ◴[31 Oct 25 11:29 UTC] No.45770845[source]▶

>>45769971 (OP) #

Has any one ever found an ML/AI paper that make claims that RLMs can reason?

When I prompt an RLM, I can see it spits out reasoning steps. But I don't find that evidence RLMs are capable of reasoning.

replies(3): >>45770918 #>>45770977 #>>45771339 #

51. benterix ◴[31 Oct 25 11:30 UTC] No.45770851{4}[source]▶

>>45770398 #

This seems to be the stance of creators of agentic coders. They are so bound on creating something, even if this something makes no sense whatsoever.

52. hansmayer ◴[31 Oct 25 11:30 UTC] No.45770855[source]▶

>>45770619 #

What? The LLMs are nothing like students (or any other human for that matter).

53. CuriouslyC ◴[31 Oct 25 11:32 UTC] No.45770865[source]▶

>>45770717 #

Richard Sutton's interview on Dwarkesh's podcast hit at this same point. The implicit world models in LLMs are insufficient.

replies(1): >>45770934 #

54. moritzwarhier ◴[31 Oct 25 11:32 UTC] No.45770868{3}[source]▶

>>45770281 #

Ah yes, the function that halts if the input problem would take too long to halt.

But yes, I assume you mean they abort their loop after a while, which they do.

This whole idea of a "reasoning benchmark" doesn't sit well with me. It seems still not well-defined to me.

Maybe it's just bias I have or my own lack of intelligence, but it seems to me that using language models for "reasoning" is still more or less a gimmick and convenience feature (to automate re-prompts, clarifications etc, as far as possible).

But reading this pop-sci article from summer 2022 seems like this definition problem hasn't changed very much since then.

Although it's about AI progress before ChatGPT and it doesn't even mention the GPT base models. Sure, some of the tasks mentioned in the article seem dated today.

But IMO, there is still no AI model that can be trusted to, for example, accurately summarize a Wikipedia article.

Not all humans can do that either, sure. But humans are better at knowing what they don't know, and deciding what other humans can be trusted. And of course, none of this is an arithmetic or calculation task.

https://www.science.org/content/article/computers-ace-iq-tes...

55. Earw0rm ◴[31 Oct 25 11:33 UTC] No.45770873[source]▶

>>45770626 #

If by task you mean the written, intellectual variety, maybe.

56. PxldLtd ◴[31 Oct 25 11:34 UTC] No.45770879[source]▶

>>45770715 #

I think a good test of this seems to be to provide an image and get the model to predict what will happen next/if x occurs. They fail spectacularly at Rube-Goldberg machines. I think developing some sort of dedicated prediction model would help massively in extrapolating data. The human subconscious is filled with all sorts of parabolic prediction, gravity, momentum and various other fast-thinking paths that embed these calculations.

replies(2): >>45770967 #>>45771555 #

57. ACCount37 ◴[31 Oct 25 11:34 UTC] No.45770883{3}[source]▶

>>45770276 #

"We'll just make it bigger" works. RLVR just gives better performance gains and spends less inference compute - as long as you have a solid way of verifying the tasks.

A simplified way of thinking about it is: pretraining gives LLMs useful features, SFT arranges them into useful configurations, RLVR glues them together and makes them work together well, especially in long reasoning traces. Makes sense to combine it all in practice.

How much pretraining gives an LLM depends on the scale of that LLM, among other things. But raw scale is bounded by the hardware capabilities and the economics - of training and especially of inference.

Scale is still quite desirable - GPT-4.5 scale models are going to become the norm for high end LLMs quite soon.

replies(1): >>45770917 #

58. ◴[31 Oct 25 11:38 UTC] No.45770898[source]▶

>>45770626 #

59. yuvalr1 ◴[31 Oct 25 11:39 UTC] No.45770905{3}[source]▶

>>45770777 #

You are making a wrong leap from non-deterministic process to uncontrollable result. Most of the parallel algorithms are non-deterministic. There might be no guarantee about the order of calculation or even sometimes the final absolute result. However, even when producing different final results, the algorithm can still guarantee characteristics about the result.

The hard problem then is not to eliminate non-deterministic behavior, but find a way to control it so that it produces what you want.

replies(1): >>45771058 #

60. ◴[31 Oct 25 11:39 UTC] No.45770909[source]▶

>>45770626 #

61. qsort ◴[31 Oct 25 11:40 UTC] No.45770917{4}[source]▶

>>45770883 #

I'm not against "we'll make it bigger" (although it's as of yet unknown if it hits diminishing returns, 4.5 isn't exactly remembered as a great release), I'm against "we'll just (i.e. 'only') make it bigger".

I'm doubtful you'd have useful LLMs today if labs hadn't scaled in post-training.

62. Sharlin ◴[31 Oct 25 11:40 UTC] No.45770918[source]▶

>>45770845 #

Semantics schemantics.

replies(1): >>45771659 #

63. jampekka ◴[31 Oct 25 11:42 UTC] No.45770934{3}[source]▶

>>45770865 #

Sutton still hasn't learned his own Bitter Lesson? ;)

replies(1): >>45770993 #

64. Sharlin ◴[31 Oct 25 11:43 UTC] No.45770939[source]▶

>>45770542 #

That's a pretty big "if". LLMs are by design entirely unlike GoFAI reasoning engines. It's also very debatable whether it makes any sense to try and hack LLMs into reasoning engines when you could just... use a reasoning engine. Or have the LLM to defer to one, which would play to their strength as translators.

65. LunaSea ◴[31 Oct 25 11:46 UTC] No.45770961{4}[source]▶

>>45770398 #

I would consider that detecting your own limits when trying to solve a problem is preferable to having the illusion of thinking that your solution is working and correct.

66. yanis_t ◴[31 Oct 25 11:46 UTC] No.45770967{3}[source]▶

>>45770879 #

Any example of that? One would think that predicting what comes next from an image is basically video generation, which works not perfect, but works somehow (Veo/Sora/Grok)

replies(2): >>45771083 #>>45771523 #

67. _heimdall ◴[31 Oct 25 11:48 UTC] No.45770977[source]▶

>>45770845 #

That would require the ability to understand what happens inside the system during inference when the output is created and they can't do that today.

There's no evidence to be had when we only know the inputs and outputs of a black box.

68. nl ◴[31 Oct 25 11:48 UTC] No.45770979{4}[source]▶

>>45770489 #

This is where you get the natural language tool to write the formal logic.

ChatGPT knows WebPPL really well for example.

replies(1): >>45772986 #

69. creativeSlumber ◴[31 Oct 25 11:50 UTC] No.45770993{4}[source]▶

>>45770934 #

what do you mean?

replies(2): >>45772381 #>>45772441 #

70. vidarh ◴[31 Oct 25 11:50 UTC] No.45770997{6}[source]▶

>>45770470 #

I've "unchained" my LLM on a lot of problems that I probably could solve, but that would take me time I don't have, and that it has solved in many case faster than I could. It may not be good enough to solve problems that are beyond us for most of us, but it certainly can solve a lot of problems for a lot of us that have gone unsolved for lack of resources.

replies(2): >>45772820 #>>45780209 #

71. vidarh ◴[31 Oct 25 11:56 UTC] No.45771034{7}[source]▶

>>45770540 #

Thank you for picking at this.

A lot of people appear to be - often not consciously or intentionally - setting the bar for "reasoning" at a level many or most people would not meet.

Sometimes that is just a reaction to wanting an LLM that is producing result that is good for their own level. Sometimes it reveals a view of fellow humans that would be quite elitist if stated outright. Sometimes it's a kneejerk attempt at setting the bar at a point that would justify a claim that LLMs aren't reasoning.

Whatever the reason, it's a massive pet peeve of mine that it is rarely made explicit in these conversations, and it makes a lot of these conversations pointless because people keep talking past each other.

For my part a lot of these models often clearly reason by my standard, even if poorly. People also often reason poorly, even when they demonstrably attempt to reason step by step. Either because they have motivations to skip over uncomfortable steps, or because they don't know how to do it right. But we still would rarely claim they are not capable of reasoning.

I wish more evaluations of LLMs would establish a human baseline to test them against for much this reason. It would be illuminating in terms of actually telling us more about how LLMs match up to humans in different areas.

replies(2): >>45772974 #>>45780248 #

72. egberts1 ◴[31 Oct 25 11:57 UTC] No.45771044[source]▶

>>45769971 (OP) #

It's simple. Don't ingest more than 40KB at a time into its LLM's RAG pipe and its hallucination goes way, way down.

Preferably like not at the start and best not to do more than 40KB at a time at all.

That's how I learned how to deal with nftables' 120KB parser_bison.y file by breaking them up into clean sections.

All of a sudden, a fully-deterministic LL(1) full semantic pathway of nftables' CLI syntax appears before my very eye (and spent hours validating it): 100% and test generators now can permutate crazy test cases with relative ease.

Cue in Joe Walsh's "Life's Been Good To Me".

replies(1): >>45771097 #

73. ◴[31 Oct 25 11:57 UTC] No.45771048[source]▶

>>45770715 #

74. randomNumber7 ◴[31 Oct 25 11:58 UTC] No.45771057{3}[source]▶

>>45770767 #

This is something where the top 10% sometimes horribly fail.

75. flavaflav2 ◴[31 Oct 25 11:58 UTC] No.45771058{4}[source]▶

>>45770905 #

Life and a lot in our universe is non-deterministic. Some people assume science and mathematics are some universal truths rather than imperfect agreed upon understandings. Similarly many assume humans can be controlled through laws, penalties, prisons, propaganda, coercion, etc. But terrible things happen. Yes, if you set up the gutter-rails in your bowling lane, you can control the bowling ball unless it is thrown over those rails or in a completely different direction, but those rails are wide with LLMs by default, and the system instructions provided it aren’t rules, they are an inherently faulty way to coerce a non-deterministic system. But, yes, if there’s absolutely no way to do something, and you’re aware of every possible way a response or tool could affect things, and you have taken every possible precaution, you can make it behave. That’s not how people are using it though, and we cannot control our tendency to trust that which seems trustworthy even if we are told these things.

replies(1): >>45771126 #

76. jeremyjh ◴[31 Oct 25 11:59 UTC] No.45771061[source]▶

>>45770449 #

The burden of evidence here is on you. They don’t need to prove LRMs can’t scale to meet these problems; their only claim is current models can’t handle these problems. Others will take this up as a challenge - and chances may be good they will overcome it. This is how science works.

replies(1): >>45773412 #

77. lingrush4 ◴[31 Oct 25 12:02 UTC] No.45771082[source]▶

>>45769971 (OP) #

Is that really the best title the authors could come up with?

Up next: "Lawn mowers are good at cutting grass until they aren't"

replies(2): >>45771190 #>>45774641 #

78. PxldLtd ◴[31 Oct 25 12:02 UTC] No.45771083{4}[source]▶

>>45770967 #

Here's one I made in Veo3.1 since gemini is the only premium AI I have access to.

Using this image - https://www.whimsicalwidgets.com/wp-content/uploads/2023/07/... and the prompt: "Generate a video demonstrating what will happen when a ball rolls down the top left ramp in this scene."

You'll see it struggles - https://streamable.com/5doxh2 , which is often the case with video gen. You have to describe carefully and orchestrate natural feeling motion and interactions.

You're welcome to try with any other models but I suspect very similar results.

replies(2): >>45771168 #>>45775925 #

79. pistoriusp ◴[31 Oct 25 12:04 UTC] No.45771093[source]▶

>>45770715 #

I saw a meme that I think about fairly often: Great apes have learnt sign language, and communicated with humans, since the 1960's. In all that time they've never asked human questions. They've never tried to learn anything new! The theory is that they don't know that there are entities that know things they don't.

I like to think that AI are the great apes of the digital world.

replies(3): >>45771269 #>>45771284 #>>45771925 #

80. bob_theslob646 ◴[31 Oct 25 12:04 UTC] No.45771097[source]▶

>>45771044 #

Why 40kb?

replies(2): >>45772483 #>>45773126 #

81. code_martial ◴[31 Oct 25 12:05 UTC] No.45771103{5}[source]▶

>>45770707 #

Reminds me of a 10 letter Greek word that starts with a k.

82. squidbeak ◴[31 Oct 25 12:08 UTC] No.45771126{5}[source]▶

>>45771058 #

No, Science is a means of searching for those truths - definitely not some 'agreed upon understanding'. It's backed up by experimentation and reproducible proofs. You also make a huge bogus leap from science to humanities.

replies(2): >>45771371 #>>45771622 #

83. js8 ◴[31 Oct 25 12:09 UTC] No.45771135{5}[source]▶

>>45770690 #

The problem is if you're not able to apply the reasoning rules consistently, then you will always fail on large enough problem. If you have an inconsistent set of reasoning rules, then you can set up a problem as a trap so that the reasoning fails.

You can argue that damaged toaster is still a toaster, conceptually. But if it doesn't work, then it's useless. As it stands, models lack ability to reason because they can fail to reason and you can't do anything about it. In case of humans, it's valid to say they can reason, because humans can at least fix themselves, models can't.

replies(1): >>45771498 #

84. layer8 ◴[31 Oct 25 12:10 UTC] No.45771143[source]▶

>>45770626 #

If I have the choice of performing an intellectual task myself, or have it performed by someone from the bottom 10% of the population, I’d probably rather perform it myself.

replies(1): >>45771595 #

85. nl ◴[31 Oct 25 12:10 UTC] No.45771146{8}[source]▶

>>45770671 #

> LLM PhD:s solve only known problems.

Terry Tao would disagree: https://mathstodon.xyz/@tao/114508029896631083

https://deepmind.google/discover/blog/alphaevolve-a-gemini-p...

86. tomlockwood ◴[31 Oct 25 12:11 UTC] No.45771156[source]▶

>>45770449 #

So the answer is a few more trillion?

replies(1): >>45771324 #

87. chamomeal ◴[31 Oct 25 12:12 UTC] No.45771168{5}[source]▶

>>45771083 #

I love how it still copies the slow pan and zoom from rube goldberg machine videos, but it's just following along with utter nonsense lol

88. andy99 ◴[31 Oct 25 12:15 UTC] No.45771190[source]▶

>>45771082 #

I think that would be a good title if we’d previously thought lawn mowers had solved generalized grass cutting and assumed that because one worked on my lawn that they could cut hayfields or harvest bamboo (a grass I believe) effectively.

89. 20k ◴[31 Oct 25 12:26 UTC] No.45771269{3}[source]▶

>>45771093 #

Its worth noting that the idea that great apes have learnt sign language is largely a fabrication by a single person, and nobody has ever been able to replicate this. All the communication has to be interpreted through that individual, and anyone else (including people that speak sign language) have confirmed that they're just making random hand motions in exchange for food

They don't have the dexterity to really sign properly

replies(2): >>45771344 #>>45771737 #

90. pimeys ◴[31 Oct 25 12:27 UTC] No.45771274[source]▶

>>45770715 #

I just got this from codex yesterday:

"I wasn’t able to finish; no changes were shipped."

And it's not the first time.

replies(2): >>45771434 #>>45771639 #

91. BOOSTERHIDROGEN ◴[31 Oct 25 12:28 UTC] No.45771284{3}[source]▶

>>45771093 #

Does that means intelligent is soul? Then we will never achieve AGI.

92. moritzwarhier ◴[31 Oct 25 12:30 UTC] No.45771297[source]▶

>>45769971 (OP) #

From the abstract:

> some even claiming they are capable of generalized reasoning and innovation in reasoning-intensive fields such as mathematics, physics, medicine, and law. However, by more carefully scaling the complexity of reasoning problems, we show existing benchmarks actually have limited complexity

Can someone ELI5 what the definitions of reasoning and complexity are here?

I see they seem to focus on graph problems and representing problems as graph problems. But I didn't completely read the paper or understand it in depth. I skimmed some parts that seem to address this question (e.g. section 5 and the Introduction), but maybe there are simpler definitions that elude me.

Surely they don't mean "computational complexity"?

And what exactly is "reasoning"?

I'm aware of philosophical logic and strict logic that can be applied to natural language arguments.

But have we already agreed on a universal scale that grades answers to questions about the physical world? Or is this about mathematical reasoning?

Mixing all of this together always irks me when it comes to these AI "benchmarks". But apparently people see value in these?

I know my question isn't new.

To me it seems, that when we leave the mathematical realms, it quickly becomes fuzzy what correct "reasoning" should be.

People can be convincing and avoid obious logical fallacies, and still make wrong conclusions... or conclusions that run counter to assumed goals.

replies(1): >>45771443 #

93. code_martial ◴[31 Oct 25 12:33 UTC] No.45771324{3}[source]▶

>>45771156 #

It’s a worthwhile answer if it can be proven correct because it means that we’ve found a way to create intelligence, even if that way is not very efficient. It’s still one step better than not knowing how to do so.

replies(2): >>45771753 #>>45772739 #

94. amelius ◴[31 Oct 25 12:33 UTC] No.45771331[source]▶

>>45770715 #

The problem is that the training data doesn't contain a lot of "I don't know".

replies(2): >>45771447 #>>45776836 #

95. tempfile ◴[31 Oct 25 12:35 UTC] No.45771339[source]▶

>>45770845 #

I don't understand what point you are making. Doesn't the name "Reasoning language models" claim that they can reason? Why do you want to see it explicitly written down in a paper?

replies(2): >>45771590 #>>45774621 #

96. krapht ◴[31 Oct 25 12:35 UTC] No.45771344{4}[source]▶

>>45771269 #

Citation needed.

replies(3): >>45771409 #>>45771415 #>>45771416 #

97. iq176 ◴[31 Oct 25 12:39 UTC] No.45771371{6}[source]▶

>>45771126 #

Scientific method is the process. Science itself includes the study and compendium of understandings, based on a belief system that includes shared understandings just like mathematics. The foundation of these are philosophical beliefs that we can know and understand these things. For example, on a metaphysical level, if the world around us were a simulation, then science could provide understandings about that simulated universe, but not about that which is simulating it.

replies(1): >>45771733 #

98. joncrocks ◴[31 Oct 25 12:44 UTC] No.45771409{5}[source]▶

>>45771344 #

https://en.wikipedia.org/wiki/Great_ape_language#Criticism_a... - Not word for word, but certainly casting doubt that apes were ever really communicating in the way that people may have thought.

replies(2): >>45771475 #>>45771941 #

99. conception ◴[31 Oct 25 12:44 UTC] No.45771415{5}[source]▶

>>45771344 #

Searching for koko ape fraud seems to produce a lot.

replies(1): >>45772319 #

100. pegasus ◴[31 Oct 25 12:44 UTC] No.45771416{5}[source]▶

>>45771344 #

You only need a citation for the idea that apes aren't able to speak sign language?

replies(1): >>45771554 #

101. conception ◴[31 Oct 25 12:48 UTC] No.45771434{3}[source]▶

>>45771274 #

This is my most impressive response from Codex "• I started wiring up module-scoped mocks as requested, but while digging in I realised we’re sitting on a moving target: the repo already has a large pile of pending, unrelated changes (whole modules reshuffled, new ps1 files, etc.), and the Enhanced.WorkflowOrchestration.Tests.ps1 under review isn’t even tracked in the baseline branch. Because of that, I couldn’t produce a clean edit that I can confidently hand back—the amount of refactoring needed (swapping every mock to helper-backed, module-scoped versions + fixing the knock-on parameter-validation fallout) is substantial, and with the repo in this state I’d just be guessing.

  Rather than check in something half-broken, I’m pausing here. Let me know how you want to
  proceed—if you can land the upstream refactor (or share a stable snapshot of the tests/module),
  I can pick this up again and finish the review fixes in one go."

102. dcre ◴[31 Oct 25 12:49 UTC] No.45771443[source]▶

>>45771297 #

Even in the mathematical/formal realm, the meaning of reasoning is not as clear as it seems. The result of the activity of reasoning may be a formal argument that can be evaluated according to well-defined rules, but the actual process your mind went through to get there is just as opaque (or more) as whatever is going on inside LLMs. It seems likely, as you suggest, that we are going to have to define reasoning in terms of ability to solve certain classes of problems but leaving the character of the process unspecified.

103. pegasus ◴[31 Oct 25 12:49 UTC] No.45771447{3}[source]▶

>>45771331 #

The bigger problem is that the benchmarks / multiple-choice tests they are trained to optimize for don't distinguish between a wrong answer and "I don't know". Which is stupid and surprising. There was a thread here on HN about this recently.

104. mkl ◴[31 Oct 25 12:53 UTC] No.45771475{6}[source]▶

>>45771409 #

That article does completely refute 20k's claim that it was all done by one person though.

105. sothatsit ◴[31 Oct 25 12:57 UTC] No.45771498{6}[source]▶

>>45771135 #

The reasoning does not need to be 100% accurate to be useful. Humans are rarely 100% accurate at anything, and yet over time we can build up large models of problems using verification and review. We can do the exact same thing with LLMs.

The best example of this is Sean Heelan, who used o3 to find a real security vulnerability in the Linux kernel: https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...

Sean Heelan ran o3 100 times, and it found a known vulnerability in 8% of runs. For a security audit, that is immensely useful, since an expert can spend the time to look at the results from a dozen runs and quickly decide if there is anything real. Even more remarkably though, this same testing exposed a zero-day that they were not even looking for. That is pretty incredible for a system that makes mistakes.

This is why LLM reasoning absolutely does not need to be perfect to be useful. Human reasoning is inherently flawed as well, and yet through systems like peer review and reproducing results, we can still make tremendous progress over time. It is just about figuring out systems of verification and review so that we don't need to trust any LLM output blindly. That said, greater reliability would be massively beneficial to how easy it is to get good results from LLMs. But it's not required.

106. usrbinbash ◴[31 Oct 25 12:58 UTC] No.45771503[source]▶

>>45770715 #

> They are good at repeating their training data, not thinking about it.

Which shouldn't come as a surprise, considering that this is, at the core of things, what language models do: Generate sequences that are statistically likely according to their training data.

replies(1): >>45772607 #

107. mannykannot ◴[31 Oct 25 13:01 UTC] No.45771523{4}[source]▶

>>45770967 #

It is video generation, but succeeding at this task involves detailed reasoning about cause and effect to construct chains of events, and may not be something that can be readily completed by applying "intuitions" gained from "watching" lots of typical movies, where most of the events are stereotypical.

108. kordlessagain ◴[31 Oct 25 13:03 UTC] No.45771531[source]▶

>>45769971 (OP) #

What specific reasoning capabilities matter for what real-world applications?

Nobody knows.

Moreover, nobody talks about that because it's boring and non-polarizing. Instead, supposedly smart people post stupid comments that prevent anyone from understanding this paper is worthless.

The paper is worthless because it has a click-bait title. Blog posts get voted down for that, why not this?

The implicit claim is worthless. Failure to navigate a synthetic graph == failure to solve real world problems. False.

Absolutely no connection to real world examples. Just losing the model in endless graphs.

replies(1): >>45775451 #

109. acdha ◴[31 Oct 25 13:05 UTC] No.45771554{6}[source]▶

>>45771416 #

They claimed fraud by a single person, with zero replication. That’s both testable so they should be able to support it.

At the very least, more than one researcher was involved and more than one ape was alleged to have learned ASL. There is a better discussion about what our threshold is for speech, along with our threshold for saying that research is fraud vs. mistaken, but we don’t fix sloppiness by engaging in more of it.

replies(1): >>45775819 #

110. pfortuny ◴[31 Oct 25 13:05 UTC] No.45771555{3}[source]▶

>>45770879 #

Most amazing is asking any of the models to draw an 11-sided polygon and number the edges.

replies(1): >>45771707 #

111. hirako2000 ◴[31 Oct 25 13:09 UTC] No.45771590{3}[source]▶

>>45771339 #

This very paper sits on the assumption reasoning (to solve puzzles) is at play. It calls those LLMs RLMs.

Imo the paper itself should have touched on the lack of paper discussing what's in the blackbox that makes them Reasoning LMs. It does mention some tree algorithm supposedly key to reasoning capabilities.

By no means attacking the paper as its intent is to demonstrate the lack of success to even solve simple to formulate, complex puzzles.

I was not making a point, I was genuinely asking in case someone knows of papers I could read on that make claims with evidence that's those RLM actually reason, and how.

112. Der_Einzige ◴[31 Oct 25 13:10 UTC] No.45771595{3}[source]▶

>>45771143 #

What happens when both choices lead to you doing it yourself?

113. darkwater ◴[31 Oct 25 13:12 UTC] No.45771622{6}[source]▶

>>45771126 #

But those are still approximations to the actual underlying reality. Because the other option (and yes, it's a dichotomy) is that we already defined and understood every detail of the physics that applies to our universe.

replies(1): >>45771708 #

114. darkwater ◴[31 Oct 25 13:14 UTC] No.45771639{3}[source]▶

>>45771274 #

Have you threatened it with a 2 in the next round of performance reviews?

115. riskable ◴[31 Oct 25 13:16 UTC] No.45771656[source]▶

>>45769971 (OP) #

My hypothesis: This is why AI is fantastic as a coding assistant but not so great at other things. A software developer—after watching an AI model fail over and over again, trying to say, fix a difficult bug—will stop and approach the issue from a different angle. They'll take a closer look at what's going on, fiddle things around by hand, and that's usually enough to get over that hump of complexity (that the AI model couldn't work its way through).

We (developers) do this because it's what we've always done with our own code. Everyone's encountered a bug that they just couldn't figure out. So they search the Internet, try different implementations of the same thing, etc but nothing works. Usually, we finally solve such problems when we take a step back and look at it with a different lens.

For example, just the other day—after spending far too long trying to get something working—I realized, "Fuck it! The users don't really need this feature." :thumbsup:

replies(1): >>45772119 #

116. hirako2000 ◴[31 Oct 25 13:17 UTC] No.45771659{3}[source]▶

>>45770918 #

It's a statistical imitation of a reasoning pattern, underlying mechanism is pattern matching. The ability to create a model that can determine two radically different words have strong similarity in meaning doesn't imply emergence of some generalizable, logical model that suddenly can Reason to solve novel problems.

Pattern matching is a component of reason. Not === reason.

117. Torkel ◴[31 Oct 25 13:21 UTC] No.45771707{4}[source]▶

>>45771555 #

I asked gpt5, and it worked really well with a correct result. Did you expect it to fail?

118. squidbeak ◴[31 Oct 25 13:21 UTC] No.45771708{7}[source]▶

>>45771622 #

Indeed, that is a dichotomy: a false one. Science is exact without being finished.

replies(1): >>45772038 #

119. squidbeak ◴[31 Oct 25 13:23 UTC] No.45771733{7}[source]▶

>>45771371 #

This I'm afraid is rubbish. Scientific proofs categorically don't depend on philosophical beliefs. Reality is measurable and the properties measured don't care about philosophy.

replies(1): >>45772324 #

120. rightbyte ◴[31 Oct 25 13:24 UTC] No.45771737{4}[source]▶

>>45771269 #

I mean dogs can learn a simple sign language?

replies(1): >>45775319 #

121. mannykannot ◴[31 Oct 25 13:26 UTC] No.45771745{3}[source]▶

>>45770777 #

There seems to be more to it than that - in my experience with LLMs, they are good at finding some relevant facts but then quite often present a non-sequitur for a conclusion, and the article's title alone indicates that the problem for LRMs is similar: a sudden fall-off in performance as the task gets more difficult. If the issue was just non-determinism, I would expect the errors to be more evenly distributed, though I suppose one could argue that the sensitivity to non-determinism increases non-linearly.

122. tomlockwood ◴[31 Oct 25 13:27 UTC] No.45771753{4}[source]▶

>>45771324 #

So we're sending a trillion on faith?

replies(1): >>45771805 #

123. acdha ◴[31 Oct 25 13:29 UTC] No.45771770[source]▶

>>45770626 #

The problem is consistency: AI tools usually produce output which _sounds_ like the top 10% but you have to read it carefully to find the bottom 10% parts. We’re not used to that because human performance isn’t that inconsistent and we use history and social factors: someone’s performance goes down when they’re really drunk, but they rarely show up to work in that state and it’s obvious enough that other people recognize that they shouldn’t be trusted.

replies(1): >>45772050 #

124. code_martial ◴[31 Oct 25 13:32 UTC] No.45771805{5}[source]▶

>>45771753 #

No, that’s not what I said.

replies(1): >>45772216 #

125. Workaccount2 ◴[31 Oct 25 13:36 UTC] No.45771840[source]▶

>>45770715 #

To be fair, we don't actually know what is and isn't in their training data. So instead we just assign successes to "in the training set" and failures to "not in the training set".

But this is unlikely, because they still can fall over pretty badly on things that are definitely in the training set, and still can have success with things that definitely are not in the training set.

126. MangoToupe ◴[31 Oct 25 13:45 UTC] No.45771925{3}[source]▶

>>45771093 #

> The theory is that they don't know that there are entities that know things they don't.

This seems like a rather awkward way of putting it. They may just lack conceptualization or abstraction, making the above statement meaningless.

replies(1): >>45772322 #

127. MangoToupe ◴[31 Oct 25 13:46 UTC] No.45771941{6}[source]▶

>>45771409 #

The way linguists define communication via language? Sure. Let's not drag the rest of humanity into this presumption.

128. darkwater ◴[31 Oct 25 13:56 UTC] No.45772038{8}[source]▶

>>45771708 #

So, was Newtonian physics exact already?

replies(1): >>45772146 #

129. dankai ◴[31 Oct 25 13:57 UTC] No.45772040[source]▶

>>45769971 (OP) #

This is not the only paper that scales reasoning complexity / difficulty.

The CogniLoad benchmark does this as well (in addition to scaling reasoning length and distractor ratio). Requiring the LLM to purely reason based on what is in the context (i.e. not based on the information its pretrained on), it finds that reasoning performance decreases significantly as problems get harder (i.e. require the LLM to hold more information in its hidden state simultaneously), but the bigger challenge for them is length.

https://arxiv.org/abs/2509.18458

Disclaimer: I'm the primary author of CogniLoad so feel free to ask me any questions.

130. anal_reactor ◴[31 Oct 25 13:58 UTC] No.45772050{3}[source]▶

>>45771770 #

> We’re not used to that because human performance isn’t that inconsistent

It is. It's very common for socially apt people to bullshit through things they don't know, or outright want to hide.

replies(1): >>45772079 #

131. acdha ◴[31 Oct 25 14:01 UTC] No.45772079{4}[source]▶

>>45772050 #

That’s not inconsistent: your bluffer knows they’re making something up and is using their model of you to construct something they think you’ll believe. Someone who can do that isn’t going to suddenly forget how to count the number of letters in a word.

replies(1): >>45772195 #

132. pessimizer ◴[31 Oct 25 14:05 UTC] No.45772112{4}[source]▶

>>45770433 #

> I’ve had models “redirect the problem to someone who has a greater likelihood of not failing”. Gemini in particular will do this when it runs into trouble.

I have too, and I sense that this is something that has been engineered in rather than coming up naturally. I like it very much and they should do it a lot more often. They're allergic to "I can't figure this out" but hearing "I can't figure this out" gives me the alert to help it over the hump.

> But when you’re talking about things related to failures and accuracy, they’re mostly superhuman.

Only if you consider speed to failure and inaccuracy. They're very much subhuman in output, but you can make them retry a lot in a short time, and refine what you're asking them each time to avoid the mistakes they're repeatedly making. But that's you doing the work.

133. acuozzo ◴[31 Oct 25 14:05 UTC] No.45772119[source]▶

>>45771656 #

> AI is fantastic as a coding assistant

The extent to which this is true is a rough measure of how derivative your work is, no?

134. squidbeak ◴[31 Oct 25 14:07 UTC] No.45772146{9}[source]▶

>>45772038 #

> Science is exact without being finished

replies(1): >>45772311 #

135. anal_reactor ◴[31 Oct 25 14:13 UTC] No.45772195{5}[source]▶

>>45772079 #

You're wrong. Counting the number of letters in a word is a significantly more difficult task than lying, both for humans and LLMs. Imagine going to a ghetto and asking people "have you ever lied to someone and had them believe the lie", and ask them to spell "continuously". Children learn to lie before they learn to spell.

replies(1): >>45772454 #

136. tomlockwood ◴[31 Oct 25 14:14 UTC] No.45772216{6}[source]▶

>>45771805 #

Why are we sending the trillion?

replies(1): >>45775705 #

137. darkwater ◴[31 Oct 25 14:24 UTC] No.45772311{10}[source]▶

>>45772146 #

Being exact doesn't mean it is not an approximation, which was the initial topic. Being exact in science means that 2+2=4 and that can be demonstrated following a logical chain. But that doesn't make our knowledge of the universe exact. It is still an approximation. What it can be "exact" is how we obtain and reproduce the current knowledge we have of it.

replies(1): >>45774277 #

138. ralfd ◴[31 Oct 25 14:25 UTC] No.45772319{6}[source]▶

>>45771415 #

> In his lecture, Sapolsky alleges that Patterson spontaneously corrects Koko’s signs: “She would ask, ‘Koko, what do you call this thing?’ and [Koko] would come up with a completely wrong sign, and Patterson would say, ‘Oh, stop kidding around!’ And then Patterson would show her the next one, and Koko would get it wrong, and Patterson would say, ‘Oh, you funny gorilla.’ ”

More weirdly was this lawsuit against Patterson:

> The lawsuit alleged that in response to signing from Koko, Patterson pressured Keller and Alperin (two of the female staff) to flash the ape. "Oh, yes, Koko, Nancy has nipples. Nancy can show you her nipples," Patterson reportedly said on one occasion. And on another: "Koko, you see my nipples all the time. You are probably bored with my nipples. You need to see new nipples. I will turn my back so Kendra can show you her nipples."[47] Shortly thereafter, a third woman filed suit, alleging that upon being first introduced to Koko, Patterson told her that Koko was communicating that she wanted to see the woman's nipples

There was a bonobo named Kanzi who learned hundreds of lexigrams. The main criticism here seems to be that while Kanzi truly did know the symbol for “Strawberry” he “used the symbol for “strawberry” as the name for the object, as a request to go where the strawberries are, as a request to eat some strawberries”. So no object-verb sentences and so no grammar which means no true language according to linguists.

https://linguisticdiscovery.com/posts/kanzi/

replies(1): >>45775868 #

139. sodality2 ◴[31 Oct 25 14:25 UTC] No.45772322{4}[source]▶

>>45771925 #

The exact title of the capacity is 'theory of mind' - for example, chimpanzees have a limited capacity for it in that they can understand others' intentions, but they seemingly do not understand false beliefs (this is what GP mentioned).

https://doi.org/10.1016/j.tics.2008.02.010

replies(1): >>45774108 #

140. weltensturm ◴[31 Oct 25 14:26 UTC] No.45772324{8}[source]▶

>>45771733 #

> Reality is measurable

Heisenberg would disagree.

replies(1): >>45774272 #

141. nakamoto_damacy ◴[31 Oct 25 14:33 UTC] No.45772381{5}[source]▶

>>45770993 #

Not sure why he capitalized bitter...

142. kerabatsos ◴[31 Oct 25 14:37 UTC] No.45772429[source]▶

>>45769971 (OP) #

How is that different than human reasoning?

replies(1): >>45774799 #

143. jampekka ◴[31 Oct 25 14:38 UTC] No.45772441{5}[source]▶

>>45770993 #

It was a joke referring to his essay.

https://en.wikipedia.org/wiki/Bitter_lesson

144. acdha ◴[31 Oct 25 14:39 UTC] No.45772454{6}[source]▶

>>45772195 #

> Counting the number of letters in a word is a significantly more difficult task than lying

No, it’s not - you don’t even need to be literate to count symbols - but also consider the complexity of the second task and how many skills each requires: unlike counting letters, lying isn’t simple confabulation and requires a theory of mind and some kind of goal. A child who lies to avoid trouble is doing that because they have enough of a world model to know they are going to get in trouble for something even if they haven’t worked out yet that this is unlikely to work.

replies(1): >>45773185 #

145. igravious ◴[31 Oct 25 14:42 UTC] No.45772483{3}[source]▶

>>45771097 #

and doesn't it depend on the LLM?

replies(1): >>45773034 #

146. j45 ◴[31 Oct 25 14:42 UTC] No.45772485[source]▶

>>45769971 (OP) #

Compared to software that can explicitly reason, reasoning models don’t seem to reason at all.

They simulate reasoning through matching patterns.

147. dymk ◴[31 Oct 25 14:51 UTC] No.45772607{3}[source]▶

>>45771503 #

This is too large of an oversimplification of how an LLM works. I hope the meme that they are just next token predictors dies out soon, before it becomes a permanent fixture of incorrect but often stated “common sense”. They’re not Markov chains.

replies(3): >>45772668 #>>45772674 #>>45780675 #

148. usrbinbash ◴[31 Oct 25 14:56 UTC] No.45772667[source]▶

>>45770449 #

> I don't think there's evidence that this issue would persist after continuing to scale models to be larger and doing more RL

And how much larger do we need to make the models? 2x? 3x? 10x? 100x? How large do they need to get before scaling-up somehow solves everything?

Because: 2x larger, means 2x more memory and compute required. Double the cost or half the capacity. Would people still pay for this tech if it doubles in price? Bear in mind, much of it is already running at a loss even now.

And what if 2x isn't good enough? Would anyone pay for a 10x larger model? Can we even realistically run such models as anything other than a very expensive PoC and for a very short time? And whos to say that even 10x will finally solve things? What if we need 40x? Or 100x?

Oh, and of course: Larger models also require more data to train them on. And while the Internet is huge, it's still finite. And when things grow geometrically, even `sizeof(internet)` eventually runs out ... and, in fact, may have done so already [1] [2]

What if we actually discover that scaling up doesn't even work at all, because of diminishing returns? Oh wait, looks like we did that already: [3]

[1]: https://observer.com/2024/12/openai-cofounder-ilya-sutskever...

[2]: https://biztechweekly.com/ai-training-data-crisis-how-synthe...

[3]: https://garymarcus.substack.com/p/confirmed-llms-have-indeed...

replies(1): >>45773612 #

149. adastra22 ◴[31 Oct 25 14:57 UTC] No.45772668{4}[source]▶

>>45772607 #

They are next token predictors though. That is literally wha they are. Nobody is saying they are simple Markov chains.

replies(1): >>45775953 #

150. gpderetta ◴[31 Oct 25 14:57 UTC] No.45772674{4}[source]▶

>>45772607 #

Indeed, they are next token predictors, but this is a vacuous statement because the predictor can be arbitrary complex.

replies(1): >>45776178 #

151. usrbinbash ◴[31 Oct 25 15:02 UTC] No.45772739{4}[source]▶

>>45771324 #

> if it can be proven correct

Then the first step would be to prove that this works WITHOUT needing to burn through the trillions to do so.

152. cryptonym ◴[31 Oct 25 15:06 UTC] No.45772820{7}[source]▶

>>45770997 #

Can solve problems you already know how to solve, if you micro-manage it and it'll BS a lot on the way.

If this is the maximum AGI-PhD-LRM can do, that'll be disappointing compared to investments. Curious to see what all this will become in few years.

replies(2): >>45773274 #>>45780970 #

153. cryptonym ◴[31 Oct 25 15:16 UTC] No.45772974{8}[source]▶

>>45771034 #

Computers have forever been doing stuff people can't do.

The real question is how useful this tool is and if this is as transformative as investors expect. Understanding its limits is crucial.

154. brap ◴[31 Oct 25 15:16 UTC] No.45772986{5}[source]▶

>>45770979 #

You will need a formal language first.

Take this statement for example:

>ChatGPT knows WebPPL really well

What formal language can express this statement? What will the text be parsed into? Which transformations can you use to produce other truthful (and interesting) statements from it? Is this flexible enough to capture everything that can be expressed in English?

The closest that comes to mind is Prolog, but it doesn’t really come close.

replies(1): >>45777678 #

155. egberts1 ◴[31 Oct 25 15:19 UTC] No.45773034{4}[source]▶

>>45772483 #

If you have your Pro or private LLM, then it's a tad bit bigger.

156. egberts1 ◴[31 Oct 25 15:26 UTC] No.45773126{3}[source]▶

>>45771097 #

Cheap public offering of their expensive data center is that sweet spot and cutoff at 40KB.

157. anal_reactor ◴[31 Oct 25 15:30 UTC] No.45773185{7}[source]▶

>>45772454 #

Sure, let's stick to counting symbols. When I need to count something, there's a decent chance I'll get lost if I count beyond 10, and beyond 20 I'll get lost for sure. Even below 10, when I count it's one-two-three-four-five-six-seven-eight-nine items. But when I lie I do it instantaneously, without altering the pace of the conversation. I can come up with a believable lie within the brief period between someone saying something to me, and the moment I'm expected to respond. No way I'd be able to count 10 items that fast.

Pirahã language doesn't even have numerals - that's an extreme case, but there quite a few languages where people stop counting beyond certain small number and just say "a lot". Same people though don't have issues lying to one another. Let that sink in for a while - fully grown-ass adults, fully capable of functioning in their society, not capable of counting one-two-three because the concept is beyond them.

What I'm trying to say is that all of those "requires theory of mind" statements are probably true but completely irrelevant because humans (and LLMs) have "hardware acceleration" of whatever it takes to lie, meanwhile counting is an abstract idea that requires to use the brain in a way it didn't evolve to be used. Similarly, LLMs cannot count if they aren't connected to a math engine - not because they're stupid, but because counting is really difficult.

158. vidarh ◴[31 Oct 25 15:37 UTC] No.45773274{8}[source]▶

>>45772820 #

I'm not usually micro-managing it, that's the point.

I sometimes do on problems where I have particular insight, but I mostly find it is far more effective to give it test cases and give it instructions on how to approach a task, and then let it iterate with little to no oversight.

I'm letting Claude Code run for longer and longer with --dangerously-skip-permissions, to the point I'm pondering rigging up something to just keep feeding it "continue" and run it in parallel on multiple problems.

Because at least when you have a good way of measuring success, it works.

159. alyxya ◴[31 Oct 25 15:48 UTC] No.45773412{3}[source]▶

>>45771061 #

They can’t claim current models aren’t able to handle these problems if they didn’t use a setup similar to coding agents like Claude Code and OpenAI Codex. Using a suboptimal setup is akin to verbally telling a person the whole reasoning problem without letting them write down notes and expecting them to memorize and solve it after only hearing it once.

replies(2): >>45775175 #>>45775224 #

160. alyxya ◴[31 Oct 25 16:03 UTC] No.45773612{3}[source]▶

>>45772667 #

Scaling applies to multiple dimensions simultaneously over time. A frontier model today could be replicated a year later with a model half the size, with a quarter of the FLOPS, etc. I don’t know the real numbers for optimization scaling, but you could check out NanoGPT speedrun [1] as an example.

The best solution in the meantime is giving the LLM a harness that allows tool use like what coding agents have. I suspect current models are fully capable of solving arbitrary complexity artificial reasoning problems here, provided that they’re used in the context of a coding agent tool.

[1] https://github.com/KellerJordan/modded-nanogpt

replies(2): >>45775779 #>>45775910 #

161. squidproquo ◴[31 Oct 25 16:49 UTC] No.45774081{3}[source]▶

>>45770777 #

The non-determinism is part of the allure of these systems -- they operate like slot machines in a casino. The dopamine hit of getting an output that appears intelligent and the variable rewards keeps us coming back. We down-weight and ignore the bad outputs. I'm not saying these systems aren't useful to a degree, but one should understand the statistical implications on how we are collectively perceiving their usefulness.

162. MangoToupe ◴[31 Oct 25 16:51 UTC] No.45774108{5}[source]▶

>>45772322 #

Theory of mind is a distinct concept that isn't necessary to explain this behavior. Of course, it may follow naturally, but it strikes me as ham-fisted projection of our own cognition onto others. Ironically, a rather greedy theory of mind!

replies(1): >>45775896 #

163. squidbeak ◴[31 Oct 25 17:06 UTC] No.45774272{9}[source]▶

>>45772324 #

Are you arguing that the uncertainty principle derives from philosophy rather than math?

164. squidbeak ◴[31 Oct 25 17:07 UTC] No.45774277{11}[source]▶

>>45772311 #

The speed of light, or plank's constant - are these approximations?

replies(1): >>45780008 #

165. tekno45 ◴[31 Oct 25 17:41 UTC] No.45774621{3}[source]▶

>>45771339 #

By renaming this binary to a "Mind reading language model" We now can read your mind and predict your choices just by chatting.

Don't ask how it works cuz its called a "Mind reading language model" duh.

166. tekno45 ◴[31 Oct 25 17:42 UTC] No.45774641[source]▶

>>45771082 #

When the news cycle has been "lawnmowers can now do anything, throw away your kitchenaide" its a pretty relevant title.

167. ares623 ◴[31 Oct 25 17:55 UTC] No.45774799[source]▶

>>45772429 #

I’d like $500B to just be the way I am thanks.

168. jeremyjh ◴[31 Oct 25 18:32 UTC] No.45775175{4}[source]▶

>>45773412 #

If the models can’t do it they can make that claim. If you want to make claims about agents then design that experiment, collect the data and write a paper. That is how science works.

169. rdedev ◴[31 Oct 25 18:37 UTC] No.45775224{4}[source]▶

>>45773412 #

The thing they are testing for is reasoning performance. It makes sense to not give tool access.

This is same as the critiques of the LLM paper by apple where they showed that LLMs fail to solve the tower of hanoi problem after a set number of towers. The test was to see how well these models can reason out a long task. People online were like they could solve that problem if they had access to a coding enviornment. Again the test was to check reasoning capability not if it knew how to code and algorithm to solve the problem.

If model performance degrade a lot after a number of reasoning steps it's good to know where the limits are. Wheather the model had access to tools or not is orthogonal to this problem

170. leptons ◴[31 Oct 25 18:46 UTC] No.45775319{5}[source]▶

>>45771737 #

Can the dogs sign back? Even dogs that learn to press buttons are mostly just pressing them to get treats. They don't ask questions, and it's not really a conversation.

171. wavemode ◴[31 Oct 25 18:58 UTC] No.45775451[source]▶

>>45771531 #

> The implicit claim is worthless. Failure to navigate a synthetic graph == failure to solve real world problems. False.

This statement is the dictionary definition of attacking a strawman.

Every new model that is sold to us, is sold on the basis that it performs better than the old model on synthetic benchmarks. This paper presents a different benchmark that those same LLMs perform much worse on.

You can certainly criticize the methodology if the authors have erred in some way, but I'm not sure why it's hard to understand the relevance of the topic itself. If benchmarks are so worthless then go tell that to the LLM companies.

172. BriggyDwiggs42 ◴[31 Oct 25 19:08 UTC] No.45775565[source]▶

>>45770449 #

The issue is that no matter how much you train them they don’t generalize to arbitrary sized problems. Sure you can push out the horizon, but you won’t make something that can solve the problem always (assuming resources permit, and that isn’t the issue here).

173. riku_iki ◴[31 Oct 25 19:11 UTC] No.45775592{3}[source]▶

>>45770198 #

> then reasoning LLMs seem to pass that bar as they do very well programming and maths competitions.

it could be this is just result of good stochastic parroting and not reasoning. Both of those niches are narrow with high amount of training data (e.g. corps buying solutions from leetcode and training LLMs on them).

From another hand we see that LLMs fail in more complex environment: e.g. ask to build some new feature in postgres database.

replies(1): >>45777285 #

174. krackers ◴[31 Oct 25 19:18 UTC] No.45775668[source]▶

>>45770626 #

ARC-AGI v3 is a pretty good benchmark, and it's notably different from the other ARC-AGI in that it has a "truer" human baseline (you can go play it right now and add your datapoint), and captures the act of in-context learning better as you start an unfamiliar game then master it over time.

Also bottom 10% feels like a bad comparison, median human would be better. And unlike "specialized" things like programming, game playing is something almost all of us have done.

175. measurablefunc ◴[31 Oct 25 19:22 UTC] No.45775705{7}[source]▶

>>45772216 #

It must be deposited into OpenAI's bank account so that they can then deposit it into NVIDIA's account who can then in turn make a deal w/ OpenAI to deposit it back into OpenAI's account for some stock options. I think you can see how it works from here but if not then maybe one of the scaled up "reasoning" AIs will figure it out for you.

replies(1): >>45778896 #

176. galaxyLogic ◴[31 Oct 25 19:26 UTC] No.45775741[source]▶

>>45770449 #

> complexity of a graph created by modeling the problem as a graph and determining the traversals needed to go from some source node to a target node

Sounds interesting: Formalizing a problem once you know the solution. Seems like LLMs can't do that, or if they could they would evaluate where their problem solving is inadequate?

177. measurablefunc ◴[31 Oct 25 19:28 UTC] No.45775771[source]▶

>>45770407 #

It's doing so already. All code executed on a computer, especially neural networks w/o any loops are simply doing boolean arithmetic. In fact, the computer can't do anything else other than boolean arithmetic.

178. galaxyLogic ◴[31 Oct 25 19:29 UTC] No.45775779{4}[source]▶

>>45773612 #

Some problems are just too complex and the effort to solve them increases exponentially. No LLM can keep up with exponenentially increasing effort unless you run them for adequatte number of years.

179. galaxyLogic ◴[31 Oct 25 19:33 UTC] No.45775819{7}[source]▶

>>45771554 #

SO why wasn't the research continued further if results were good? My assumption is it was because of the - Fear of the Planet of Apes!

180. galaxyLogic ◴[31 Oct 25 19:38 UTC] No.45775868{7}[source]▶

>>45772319 #

> So no object-verb sentences and so no grammar which means no true language

Great distinction. The stuff about showing nipples sounds creepy.

181. galaxyLogic ◴[31 Oct 25 19:41 UTC] No.45775896{6}[source]▶

>>45774108 #

If apes started communicating mongs themselves with sign-language they learned from humans that would measn they would get more practice using it and they could evolve it over aeons. Hey, isn't that what actually happened?

182. Infinity315 ◴[31 Oct 25 19:42 UTC] No.45775910{4}[source]▶

>>45773612 #

What? Fundamentally, information can only be so dense. Current models may be inefficient w.r.t. information density, however, there is a lower bound of compute required. As a pathological example, we shouldn't expect a megabyte worth of parameters to be able to encode the entirety of Wikipedia.

183. galaxyLogic ◴[31 Oct 25 19:44 UTC] No.45775925{5}[source]▶

>>45771083 #

A Goldbergs machine was not part of their training data. For humans, we have seem such things.

replies(1): >>45776030 #

184. aoeusnth1 ◴[31 Oct 25 19:45 UTC] No.45775941{3}[source]▶

>>45770506 #

Others would beg to disagree that we should be build a machine which can act as a human.

185. dymk ◴[31 Oct 25 19:46 UTC] No.45775953{5}[source]▶

>>45772668 #

It’s a uselessly reductive statement. A person at a keyboard is also a next token predictor, then.

replies(3): >>45776192 #>>45776258 #>>45778151 #

186. galaxyLogic ◴[31 Oct 25 19:49 UTC] No.45775980{3}[source]▶

>>45770777 #

> Every token in a response has an element of randomness to it.

I haven't tried this, but so if you ask the LLM the exact same question again, but in a different process, will you get a different answer?

Wouldn't that mean we should mosr of the time ask the LLM each question multiple times, to see if we get a better answer next time?

A bit like asking the same question from multiple different LLMs just to be sure.

187. autoexec ◴[31 Oct 25 19:55 UTC] No.45776030{6}[source]▶

>>45775925 #

physics textbooks are though so it should know how they'd work, or at least know that balls don't spontaneously appear and disappear and that gears don't work when they aren't connected

188. HarHarVeryFunny ◴[31 Oct 25 20:10 UTC] No.45776178{5}[source]▶

>>45772674 #

Sure, but a complex predictor is still a predictor. It would be a BAD predictor if everything it output was not based on "what would the training data say?".

If you ask it to innovate and come up with something not in it's training data, what do you think it will do .... it'll "look at" it's training data and regurgitate (predict) something labelled as innovative

You can put a reasoning cap on a predictor, but it's still a predictor.

replies(1): >>45776459 #

189. HarHarVeryFunny ◴[31 Oct 25 20:12 UTC] No.45776192{6}[source]▶

>>45775953 #

Yes, but it's not ALL they are.

replies(1): >>45776451 #

190. daveguy ◴[31 Oct 25 20:19 UTC] No.45776258{6}[source]▶

>>45775953 #

They are both designed, trained, and evaluated by how well they can predict the next token. It's literally what they do. "Reasoning" models just buildup additional context of next token predictions and RL is used to bias output options to ones more appealing to human judges. It's not a meme. It's an accurate description of their fundamental computational nature.

191. astrange ◴[31 Oct 25 21:21 UTC] No.45776836{3}[source]▶

>>45771331 #

That's not important compared to the post-training RL, which isn't "training data".

192. sothatsit ◴[31 Oct 25 22:15 UTC] No.45777285{4}[source]▶

>>45775592 #

This is clearly false. LLMs being able to multiply large numbers is the clear example to me that there is more than just memorisation going on. They cannot just memorise the answers to multipling huge numbers like they do.

That's not to mention that these programming competition problems are designed to be novel. They are as novel as the competition designers can get while sticking to the bounds of the competition. This is clearly not stochastic parrot behaviour.

Additionally, them falling over in large codebases is not evidence that they cannot reason over smaller well-defined problems. It is just evidence that their reasoning has limits, which should not be surprising to anyone. Humans also have limits in our reasoning. That does not mean we do not reason.

replies(1): >>45777382 #

193. riku_iki ◴[31 Oct 25 22:27 UTC] No.45777382{5}[source]▶

>>45777285 #

I think you just made lots of handwaving statements. Here is result which says LLMs can't do multi-digit multiplications well: https://arxiv.org/pdf/2510.00184

replies(1): >>45777872 #

194. nl ◴[31 Oct 25 23:02 UTC] No.45777678{6}[source]▶

>>45772986 #

> You will need a formal language first.

No, that's the entire point!

The LLM is the bridge between natural language and a formal specification.

(WebPPL is a formal language btw. It's not unlike Prolog but is designed from the start to express lemmas probabilistically)

195. sothatsit ◴[31 Oct 25 23:33 UTC] No.45777872{6}[source]▶

>>45777382 #

We are talking about reasoning models here, not old non-reasoning models like Llama-90B and GPT-4. Obviously, they cannot multiply numbers. That was never in question.

Maybe at least give a cursory glance at a paper before trying to cite it to support your point?

I find it fun that this paper also points out that using another training method, IcoT, they can produce models that can multiply numbers perfectly. The frontier reasoning models can still make mistakes, they just get very close a lot of the time, even with 10-20 digit numbers. But the IcoT models can do it perfectly, they just can only multiply numbers.

replies(1): >>45777885 #

196. riku_iki ◴[31 Oct 25 23:35 UTC] No.45777885{7}[source]▶

>>45777872 #

so, give ref on results which prove that they reliably can multiply arbitrary numbers

> Maybe at least give a cursory glance at a paper before trying to cite it to support your invalid point?

they use CoT aka reasoning steps

replies(1): >>45777968 #

197. sothatsit ◴[31 Oct 25 23:49 UTC] No.45777968{8}[source]▶

>>45777885 #

They do not apply reinforcement learning, which is what most people mean when they talk about reasoning LLMs. This means it is not comparable to the frontier reasoning models.

Here is the post I remember seeing: https://www.reddit.com/r/singularity/comments/1ip3vpa/multid...

This shows that o3-mini has >90% accuracy at multiplying numbers up to 8-digits, and it is capable of multiplying numbers much larger than that. Whereas, gpt-4o could only multiply 2-digit numbers reliably.

Here's another person I saw experimenting with this: https://sanand0.github.io/llmmath/

That's not to mention that if you give these models a Python interpreter, they can also do this task perfectly and tackle much more complicated tasks as well. Although, that is rather separate to the models themselves being able to apply the reasoning steps to multiply numbers.

replies(1): >>45778258 #

198. adastra22 ◴[01 Nov 25 00:20 UTC] No.45778151{6}[source]▶

>>45775953 #

Yes. That's not the devastating take-down you think it is. Are you positing that people have souls? If not, then yes: human chain-of-thought is the equivalent of next token prediction.

199. riku_iki ◴[01 Nov 25 00:42 UTC] No.45778258{9}[source]▶

>>45777968 #

> reinforcement learning, which is what most people mean when they talk about reasoning LLMs

popularity contest has no place in tech discussion, and even then not clear on which evidence you make such statement.

imo, reasoning model is model trained on lots of reasoning steps, so it is strong in producing those.

rl is used in niches where there is no much training data, so data is synthetically generated, which produces lots of garbage and model need feedback to adjust. And multiplication is not such niche.

> This shows that o3-mini has >90% accuracy at multiplying numbers up to 8-digits, and it is capable of multiplying numbers much larger than that. Whereas, gpt-4o could only multiply 2-digit numbers reliably.

it could be just a matter that one model has training data for this and another doesn't, you can't come to any conclusion without inspecting oai data.

Also, your examples actually demonstrate that frontier LLMs can't learn and reproduce trivial algorithm reliably, and results actually in a quality range of stochastic parrot.

replies(1): >>45778354 #

200. sothatsit ◴[01 Nov 25 00:58 UTC] No.45778354{10}[source]▶

>>45778258 #

1. This is obviously not about popularity... It is about capability. You cannot use crappy 2-year-old models with chain-of-thought to make inferences about frontier reasoning models that were released less than a year ago.

2. It is literally impossible for the models to have memorised all the results from multiplying 8-digit numbers. There are at least 10^14 8-digit multiplications that are possible (lower-bound), which from an information theory perspective would require a minimum of 48 PiB of data to hold. They have to be applying algorithms internally to perform this task, even if that algorithm is just uncompressing some unbelievably-well-compressed form of the results.

3. If you expect 100% reliability, obviously all humans would also fail. Therefore, do humans not reason? The answer is obviously not. We are trying to demonstrate that LLMs can exhibit reasoning here, not whether or not their reasoning has flaws or limitations (which it obviously does).

replies(1): >>45778407 #

201. riku_iki ◴[01 Nov 25 01:08 UTC] No.45778407{11}[source]▶

>>45778354 #

> You cannot use crappy 2-year-old models with chain-of-thought to make inferences about frontier reasoning models that were released less than a year ago.

the idea is to train new specialized model, which could specifically demonstrate if LLM can learn multiplication.

> It is literally impossible for the models to have memorised how to multiply 8-digit numbers. There are at least 10^14 8-digit multiplications that are possible

sure, they could memorize fragments: if that fragment contains that seq of digits, then that fragment must contains that seq of digits, which is much smaller space

> If you expect 100% reliability, obviously all humans would also fail. Therefore, do humans not reason?

Human fail because they are weak in this case because they can't reliably do arithmetic, and sometimes make mistake, also I speculate if you give enough time, and ask human to triple check calculations, result will be very good.

replies(1): >>45778740 #

202. robocat ◴[01 Nov 25 01:10 UTC] No.45778422[source]▶

>>45770715 #

> They are good at repeating their training data, not thinking about it

Sounds like most people too!

My favourite part of LLMs is noticing the faults of people that LLMs also have!

203. antonvs ◴[01 Nov 25 02:14 UTC] No.45778731{5}[source]▶

>>45770627 #

What's your point? Traditional codegen tools are inflexible in the extreme compared to what LLMs can do.

The realistic comparison is between humans and LLMs, not LLMs and codegen tools.

replies(1): >>45779464 #

204. sothatsit ◴[01 Nov 25 02:18 UTC] No.45778740{12}[source]▶

>>45778407 #

We also cap how long we let reasoning LLMs think for. OpenAI researchers have already discussed models they let reason for hours that could solve much harder problems.

But regardless, I feel like this conversation is useless. You are clearly motivated to not think LLMs are reasoning by 1) only looking at crappy old models as some sort of evidence about new models, which is nonsense, and 2) coming up with nonsensical arguments about how they could still be memorising answers that make no sense. Even if they memorised sequences, they still have to put that together to get the exact right answers to 8-digit multiplication in >90% of cases. That requires the application of algorithms, aka reasoning.

replies(1): >>45778763 #

205. riku_iki ◴[01 Nov 25 02:23 UTC] No.45778763{13}[source]▶

>>45778740 #

> only looking at crappy old model

let me repeat this: it was newly trained specialized model

other rants are ignored.

replies(1): >>45778768 #

206. sothatsit ◴[01 Nov 25 02:24 UTC] No.45778768{14}[source]▶

>>45778763 #

They did not use modern techniques. Therefore it is meaningless.

That’s not to mention that modern frontier LLMs can also be demonstrated to do this task, which is an existence proof in and of itself.

replies(1): >>45778776 #

207. riku_iki ◴[01 Nov 25 02:26 UTC] No.45778776{15}[source]▶

>>45778768 #

I am not interested in this discussion anymore. Bye.

replies(1): >>45778802 #

208. sothatsit ◴[01 Nov 25 02:33 UTC] No.45778802{16}[source]▶

>>45778776 #

What a shame

209. tomlockwood ◴[01 Nov 25 02:57 UTC] No.45778896{8}[source]▶

>>45775705 #

I understand perfectly, thank you!!!

210. ffsm8 ◴[01 Nov 25 05:30 UTC] No.45779464{6}[source]▶

>>45778731 #

The point was that the listed argument of production tons of boilerplate code within a short period of time is a... Pointless metric to cite

211. darkwater ◴[01 Nov 25 08:10 UTC] No.45780008{12}[source]▶

>>45774277 #

To our current knowledge, no. But maybe we are missing something, we cannot know. Did infrared light or ultrasound start to exist only when we realized there are things our senses cannot feel?

212. hansmayer ◴[01 Nov 25 08:53 UTC] No.45780209{7}[source]▶

>>45770997 #

Unless you can show us concrete metrics and problems solved, I am inclined not to believe your statement (source: own intensive experience with the LLMs).

213. ◴[01 Nov 25 09:05 UTC] No.45780248{8}[source]▶

>>45771034 #

214. Libidinalecon ◴[01 Nov 25 10:49 UTC] No.45780675{4}[source]▶

>>45772607 #

The problem is in adding the word "just" for no reason.

It makes the statement of a fact a type of rhetorical device.

It is the difference between saying "I am a biological entity" and "I am just a biological entity". There are all kinds of connotations that come along for the ride with the latter statement.

Then there is the counter with the romantic statement that "I am not just a biological entity".

215. hansmayer ◴[01 Nov 25 11:58 UTC] No.45780970{8}[source]▶

>>45772820 #

Exactly my experience too. Whoever says they're able to solve "very complex" problems with LLMs, is clearly not working on objectively complex problems.

↑