Most active commenters

squidbeak(6)
darkwater(5)
galaxyLogic(5)
conception(3)
MangoToupe(3)

Popular/hot comments

>>45770777 #
>>45771093 #
>>45771344 #
>>45772607 #
>>45775953 #

←back to thread

Reasoning models reason well, until they don't

(arxiv.org)

1. My_Name ◴[31 Oct 25 11:10 UTC] No.45770715[source]▶

>>45769971 (OP) #

I find that they know what they know fairly well, but if you move beyond that, into what can be reasoned from what they know, they have a profound lack of ability to do that. They are good at repeating their training data, not thinking about it.

The problem, I find, is that they then don't stop, or say they don't know (unless explicitly prompted to do so) they just make stuff up and express it with just as much confidence.

replies(9): >>45770777 #>>45770879 #>>45771048 #>>45771093 #>>45771274 #>>45771331 #>>45771503 #>>45771840 #>>45778422 #

2. ftalbot ◴[31 Oct 25 11:18 UTC] No.45770777[source]▶

>>45770715 (TP) #

Every token in a response has an element of randomness to it. This means they’re non-deterministic. Even if you set up something within their training data there is some chance that you could get a nonsense, opposite, and/or dangerous result. The chance of that may be low because of things being set up for it to review its result, but there is no way to make a non-deterministic answer fully bound to solving or reasoning anything assuredly, given enough iterations. It is designed to be imperfect.

replies(4): >>45770905 #>>45771745 #>>45774081 #>>45775980 #

3. PxldLtd ◴[31 Oct 25 11:34 UTC] No.45770879[source]▶

>>45770715 (TP) #

I think a good test of this seems to be to provide an image and get the model to predict what will happen next/if x occurs. They fail spectacularly at Rube-Goldberg machines. I think developing some sort of dedicated prediction model would help massively in extrapolating data. The human subconscious is filled with all sorts of parabolic prediction, gravity, momentum and various other fast-thinking paths that embed these calculations.

replies(2): >>45770967 #>>45771555 #

4. yuvalr1 ◴[31 Oct 25 11:39 UTC] No.45770905[source]▶

>>45770777 #

You are making a wrong leap from non-deterministic process to uncontrollable result. Most of the parallel algorithms are non-deterministic. There might be no guarantee about the order of calculation or even sometimes the final absolute result. However, even when producing different final results, the algorithm can still guarantee characteristics about the result.

The hard problem then is not to eliminate non-deterministic behavior, but find a way to control it so that it produces what you want.

replies(1): >>45771058 #

5. yanis_t ◴[31 Oct 25 11:46 UTC] No.45770967[source]▶

>>45770879 #

Any example of that? One would think that predicting what comes next from an image is basically video generation, which works not perfect, but works somehow (Veo/Sora/Grok)

replies(2): >>45771083 #>>45771523 #

6. ◴[31 Oct 25 11:57 UTC] No.45771048[source]▶

>>45770715 (TP) #

7. flavaflav2 ◴[31 Oct 25 11:58 UTC] No.45771058{3}[source]▶

>>45770905 #

Life and a lot in our universe is non-deterministic. Some people assume science and mathematics are some universal truths rather than imperfect agreed upon understandings. Similarly many assume humans can be controlled through laws, penalties, prisons, propaganda, coercion, etc. But terrible things happen. Yes, if you set up the gutter-rails in your bowling lane, you can control the bowling ball unless it is thrown over those rails or in a completely different direction, but those rails are wide with LLMs by default, and the system instructions provided it aren’t rules, they are an inherently faulty way to coerce a non-deterministic system. But, yes, if there’s absolutely no way to do something, and you’re aware of every possible way a response or tool could affect things, and you have taken every possible precaution, you can make it behave. That’s not how people are using it though, and we cannot control our tendency to trust that which seems trustworthy even if we are told these things.

replies(1): >>45771126 #

8. PxldLtd ◴[31 Oct 25 12:02 UTC] No.45771083{3}[source]▶

>>45770967 #

Here's one I made in Veo3.1 since gemini is the only premium AI I have access to.

Using this image - https://www.whimsicalwidgets.com/wp-content/uploads/2023/07/... and the prompt: "Generate a video demonstrating what will happen when a ball rolls down the top left ramp in this scene."

You'll see it struggles - https://streamable.com/5doxh2 , which is often the case with video gen. You have to describe carefully and orchestrate natural feeling motion and interactions.

You're welcome to try with any other models but I suspect very similar results.

replies(2): >>45771168 #>>45775925 #

9. pistoriusp ◴[31 Oct 25 12:04 UTC] No.45771093[source]▶

>>45770715 (TP) #

I saw a meme that I think about fairly often: Great apes have learnt sign language, and communicated with humans, since the 1960's. In all that time they've never asked human questions. They've never tried to learn anything new! The theory is that they don't know that there are entities that know things they don't.

I like to think that AI are the great apes of the digital world.

replies(3): >>45771269 #>>45771284 #>>45771925 #

10. squidbeak ◴[31 Oct 25 12:08 UTC] No.45771126{4}[source]▶

>>45771058 #

No, Science is a means of searching for those truths - definitely not some 'agreed upon understanding'. It's backed up by experimentation and reproducible proofs. You also make a huge bogus leap from science to humanities.

replies(2): >>45771371 #>>45771622 #

11. chamomeal ◴[31 Oct 25 12:12 UTC] No.45771168{4}[source]▶

>>45771083 #

I love how it still copies the slow pan and zoom from rube goldberg machine videos, but it's just following along with utter nonsense lol

12. 20k ◴[31 Oct 25 12:26 UTC] No.45771269[source]▶

>>45771093 #

Its worth noting that the idea that great apes have learnt sign language is largely a fabrication by a single person, and nobody has ever been able to replicate this. All the communication has to be interpreted through that individual, and anyone else (including people that speak sign language) have confirmed that they're just making random hand motions in exchange for food

They don't have the dexterity to really sign properly

replies(2): >>45771344 #>>45771737 #

13. pimeys ◴[31 Oct 25 12:27 UTC] No.45771274[source]▶

>>45770715 (TP) #

I just got this from codex yesterday:

"I wasn’t able to finish; no changes were shipped."

And it's not the first time.

replies(2): >>45771434 #>>45771639 #

14. BOOSTERHIDROGEN ◴[31 Oct 25 12:28 UTC] No.45771284[source]▶

>>45771093 #

Does that means intelligent is soul? Then we will never achieve AGI.

15. amelius ◴[31 Oct 25 12:33 UTC] No.45771331[source]▶

>>45770715 (TP) #

The problem is that the training data doesn't contain a lot of "I don't know".

replies(2): >>45771447 #>>45776836 #

16. krapht ◴[31 Oct 25 12:35 UTC] No.45771344{3}[source]▶

>>45771269 #

Citation needed.

replies(3): >>45771409 #>>45771415 #>>45771416 #

17. iq176 ◴[31 Oct 25 12:39 UTC] No.45771371{5}[source]▶

>>45771126 #

Scientific method is the process. Science itself includes the study and compendium of understandings, based on a belief system that includes shared understandings just like mathematics. The foundation of these are philosophical beliefs that we can know and understand these things. For example, on a metaphysical level, if the world around us were a simulation, then science could provide understandings about that simulated universe, but not about that which is simulating it.

replies(1): >>45771733 #

18. joncrocks ◴[31 Oct 25 12:44 UTC] No.45771409{4}[source]▶

>>45771344 #

https://en.wikipedia.org/wiki/Great_ape_language#Criticism_a... - Not word for word, but certainly casting doubt that apes were ever really communicating in the way that people may have thought.

replies(2): >>45771475 #>>45771941 #

19. conception ◴[31 Oct 25 12:44 UTC] No.45771415{4}[source]▶

>>45771344 #

Searching for koko ape fraud seems to produce a lot.

replies(1): >>45772319 #

20. pegasus ◴[31 Oct 25 12:44 UTC] No.45771416{4}[source]▶

>>45771344 #

You only need a citation for the idea that apes aren't able to speak sign language?

replies(1): >>45771554 #

21. conception ◴[31 Oct 25 12:48 UTC] No.45771434[source]▶

>>45771274 #

This is my most impressive response from Codex "• I started wiring up module-scoped mocks as requested, but while digging in I realised we’re sitting on a moving target: the repo already has a large pile of pending, unrelated changes (whole modules reshuffled, new ps1 files, etc.), and the Enhanced.WorkflowOrchestration.Tests.ps1 under review isn’t even tracked in the baseline branch. Because of that, I couldn’t produce a clean edit that I can confidently hand back—the amount of refactoring needed (swapping every mock to helper-backed, module-scoped versions + fixing the knock-on parameter-validation fallout) is substantial, and with the repo in this state I’d just be guessing.

  Rather than check in something half-broken, I’m pausing here. Let me know how you want to
  proceed—if you can land the upstream refactor (or share a stable snapshot of the tests/module),
  I can pick this up again and finish the review fixes in one go."

22. pegasus ◴[31 Oct 25 12:49 UTC] No.45771447[source]▶

>>45771331 #

The bigger problem is that the benchmarks / multiple-choice tests they are trained to optimize for don't distinguish between a wrong answer and "I don't know". Which is stupid and surprising. There was a thread here on HN about this recently.

23. mkl ◴[31 Oct 25 12:53 UTC] No.45771475{5}[source]▶

>>45771409 #

That article does completely refute 20k's claim that it was all done by one person though.

24. usrbinbash ◴[31 Oct 25 12:58 UTC] No.45771503[source]▶

>>45770715 (TP) #

> They are good at repeating their training data, not thinking about it.

Which shouldn't come as a surprise, considering that this is, at the core of things, what language models do: Generate sequences that are statistically likely according to their training data.

replies(1): >>45772607 #

25. mannykannot ◴[31 Oct 25 13:01 UTC] No.45771523{3}[source]▶

>>45770967 #

It is video generation, but succeeding at this task involves detailed reasoning about cause and effect to construct chains of events, and may not be something that can be readily completed by applying "intuitions" gained from "watching" lots of typical movies, where most of the events are stereotypical.

26. acdha ◴[31 Oct 25 13:05 UTC] No.45771554{5}[source]▶

>>45771416 #

They claimed fraud by a single person, with zero replication. That’s both testable so they should be able to support it.

At the very least, more than one researcher was involved and more than one ape was alleged to have learned ASL. There is a better discussion about what our threshold is for speech, along with our threshold for saying that research is fraud vs. mistaken, but we don’t fix sloppiness by engaging in more of it.

replies(1): >>45775819 #

27. pfortuny ◴[31 Oct 25 13:05 UTC] No.45771555[source]▶

>>45770879 #

Most amazing is asking any of the models to draw an 11-sided polygon and number the edges.

replies(1): >>45771707 #

28. darkwater ◴[31 Oct 25 13:12 UTC] No.45771622{5}[source]▶

>>45771126 #

But those are still approximations to the actual underlying reality. Because the other option (and yes, it's a dichotomy) is that we already defined and understood every detail of the physics that applies to our universe.

replies(1): >>45771708 #

29. darkwater ◴[31 Oct 25 13:14 UTC] No.45771639[source]▶

>>45771274 #

Have you threatened it with a 2 in the next round of performance reviews?

replies(1): >>45785965 #

30. Torkel ◴[31 Oct 25 13:21 UTC] No.45771707{3}[source]▶

>>45771555 #

I asked gpt5, and it worked really well with a correct result. Did you expect it to fail?

replies(1): >>45784360 #

31. squidbeak ◴[31 Oct 25 13:21 UTC] No.45771708{6}[source]▶

>>45771622 #

Indeed, that is a dichotomy: a false one. Science is exact without being finished.

replies(1): >>45772038 #

32. squidbeak ◴[31 Oct 25 13:23 UTC] No.45771733{6}[source]▶

>>45771371 #

This I'm afraid is rubbish. Scientific proofs categorically don't depend on philosophical beliefs. Reality is measurable and the properties measured don't care about philosophy.

replies(1): >>45772324 #

33. rightbyte ◴[31 Oct 25 13:24 UTC] No.45771737{3}[source]▶

>>45771269 #

I mean dogs can learn a simple sign language?

replies(1): >>45775319 #

34. mannykannot ◴[31 Oct 25 13:26 UTC] No.45771745[source]▶

>>45770777 #

There seems to be more to it than that - in my experience with LLMs, they are good at finding some relevant facts but then quite often present a non-sequitur for a conclusion, and the article's title alone indicates that the problem for LRMs is similar: a sudden fall-off in performance as the task gets more difficult. If the issue was just non-determinism, I would expect the errors to be more evenly distributed, though I suppose one could argue that the sensitivity to non-determinism increases non-linearly.

35. Workaccount2 ◴[31 Oct 25 13:36 UTC] No.45771840[source]▶

>>45770715 (TP) #

To be fair, we don't actually know what is and isn't in their training data. So instead we just assign successes to "in the training set" and failures to "not in the training set".

But this is unlikely, because they still can fall over pretty badly on things that are definitely in the training set, and still can have success with things that definitely are not in the training set.

36. MangoToupe ◴[31 Oct 25 13:45 UTC] No.45771925[source]▶

>>45771093 #

> The theory is that they don't know that there are entities that know things they don't.

This seems like a rather awkward way of putting it. They may just lack conceptualization or abstraction, making the above statement meaningless.

replies(1): >>45772322 #

37. MangoToupe ◴[31 Oct 25 13:46 UTC] No.45771941{5}[source]▶

>>45771409 #

The way linguists define communication via language? Sure. Let's not drag the rest of humanity into this presumption.

38. darkwater ◴[31 Oct 25 13:56 UTC] No.45772038{7}[source]▶

>>45771708 #

So, was Newtonian physics exact already?

replies(1): >>45772146 #

39. squidbeak ◴[31 Oct 25 14:07 UTC] No.45772146{8}[source]▶

>>45772038 #

> Science is exact without being finished

replies(1): >>45772311 #

40. darkwater ◴[31 Oct 25 14:24 UTC] No.45772311{9}[source]▶

>>45772146 #

Being exact doesn't mean it is not an approximation, which was the initial topic. Being exact in science means that 2+2=4 and that can be demonstrated following a logical chain. But that doesn't make our knowledge of the universe exact. It is still an approximation. What it can be "exact" is how we obtain and reproduce the current knowledge we have of it.

replies(1): >>45774277 #

41. ralfd ◴[31 Oct 25 14:25 UTC] No.45772319{5}[source]▶

>>45771415 #

> In his lecture, Sapolsky alleges that Patterson spontaneously corrects Koko’s signs: “She would ask, ‘Koko, what do you call this thing?’ and [Koko] would come up with a completely wrong sign, and Patterson would say, ‘Oh, stop kidding around!’ And then Patterson would show her the next one, and Koko would get it wrong, and Patterson would say, ‘Oh, you funny gorilla.’ ”

More weirdly was this lawsuit against Patterson:

> The lawsuit alleged that in response to signing from Koko, Patterson pressured Keller and Alperin (two of the female staff) to flash the ape. "Oh, yes, Koko, Nancy has nipples. Nancy can show you her nipples," Patterson reportedly said on one occasion. And on another: "Koko, you see my nipples all the time. You are probably bored with my nipples. You need to see new nipples. I will turn my back so Kendra can show you her nipples."[47] Shortly thereafter, a third woman filed suit, alleging that upon being first introduced to Koko, Patterson told her that Koko was communicating that she wanted to see the woman's nipples

There was a bonobo named Kanzi who learned hundreds of lexigrams. The main criticism here seems to be that while Kanzi truly did know the symbol for “Strawberry” he “used the symbol for “strawberry” as the name for the object, as a request to go where the strawberries are, as a request to eat some strawberries”. So no object-verb sentences and so no grammar which means no true language according to linguists.

https://linguisticdiscovery.com/posts/kanzi/

replies(1): >>45775868 #

42. sodality2 ◴[31 Oct 25 14:25 UTC] No.45772322{3}[source]▶

>>45771925 #

The exact title of the capacity is 'theory of mind' - for example, chimpanzees have a limited capacity for it in that they can understand others' intentions, but they seemingly do not understand false beliefs (this is what GP mentioned).

https://doi.org/10.1016/j.tics.2008.02.010

replies(1): >>45774108 #

43. weltensturm ◴[31 Oct 25 14:26 UTC] No.45772324{7}[source]▶

>>45771733 #

> Reality is measurable

Heisenberg would disagree.

replies(1): >>45774272 #

44. dymk ◴[31 Oct 25 14:51 UTC] No.45772607[source]▶

>>45771503 #

This is too large of an oversimplification of how an LLM works. I hope the meme that they are just next token predictors dies out soon, before it becomes a permanent fixture of incorrect but often stated “common sense”. They’re not Markov chains.

replies(3): >>45772668 #>>45772674 #>>45780675 #

45. adastra22 ◴[31 Oct 25 14:57 UTC] No.45772668{3}[source]▶

>>45772607 #

They are next token predictors though. That is literally wha they are. Nobody is saying they are simple Markov chains.

replies(1): >>45775953 #

46. gpderetta ◴[31 Oct 25 14:57 UTC] No.45772674{3}[source]▶

>>45772607 #

Indeed, they are next token predictors, but this is a vacuous statement because the predictor can be arbitrary complex.

replies(1): >>45776178 #

47. squidproquo ◴[31 Oct 25 16:49 UTC] No.45774081[source]▶

>>45770777 #

The non-determinism is part of the allure of these systems -- they operate like slot machines in a casino. The dopamine hit of getting an output that appears intelligent and the variable rewards keeps us coming back. We down-weight and ignore the bad outputs. I'm not saying these systems aren't useful to a degree, but one should understand the statistical implications on how we are collectively perceiving their usefulness.

48. MangoToupe ◴[31 Oct 25 16:51 UTC] No.45774108{4}[source]▶

>>45772322 #

Theory of mind is a distinct concept that isn't necessary to explain this behavior. Of course, it may follow naturally, but it strikes me as ham-fisted projection of our own cognition onto others. Ironically, a rather greedy theory of mind!

replies(1): >>45775896 #

49. squidbeak ◴[31 Oct 25 17:06 UTC] No.45774272{8}[source]▶

>>45772324 #

Are you arguing that the uncertainty principle derives from philosophy rather than math?

50. squidbeak ◴[31 Oct 25 17:07 UTC] No.45774277{10}[source]▶

>>45772311 #

The speed of light, or plank's constant - are these approximations?

replies(1): >>45780008 #

51. leptons ◴[31 Oct 25 18:46 UTC] No.45775319{4}[source]▶

>>45771737 #

Can the dogs sign back? Even dogs that learn to press buttons are mostly just pressing them to get treats. They don't ask questions, and it's not really a conversation.

replies(1): >>45785502 #

52. galaxyLogic ◴[31 Oct 25 19:33 UTC] No.45775819{6}[source]▶

>>45771554 #

SO why wasn't the research continued further if results were good? My assumption is it was because of the - Fear of the Planet of Apes!

53. galaxyLogic ◴[31 Oct 25 19:38 UTC] No.45775868{6}[source]▶

>>45772319 #

> So no object-verb sentences and so no grammar which means no true language

Great distinction. The stuff about showing nipples sounds creepy.

54. galaxyLogic ◴[31 Oct 25 19:41 UTC] No.45775896{5}[source]▶

>>45774108 #

If apes started communicating mongs themselves with sign-language they learned from humans that would measn they would get more practice using it and they could evolve it over aeons. Hey, isn't that what actually happened?

55. galaxyLogic ◴[31 Oct 25 19:44 UTC] No.45775925{4}[source]▶

>>45771083 #

A Goldbergs machine was not part of their training data. For humans, we have seem such things.

replies(1): >>45776030 #

56. dymk ◴[31 Oct 25 19:46 UTC] No.45775953{4}[source]▶

>>45772668 #

It’s a uselessly reductive statement. A person at a keyboard is also a next token predictor, then.

replies(3): >>45776192 #>>45776258 #>>45778151 #

57. galaxyLogic ◴[31 Oct 25 19:49 UTC] No.45775980[source]▶

>>45770777 #

> Every token in a response has an element of randomness to it.

I haven't tried this, but so if you ask the LLM the exact same question again, but in a different process, will you get a different answer?

Wouldn't that mean we should mosr of the time ask the LLM each question multiple times, to see if we get a better answer next time?

A bit like asking the same question from multiple different LLMs just to be sure.

58. autoexec ◴[31 Oct 25 19:55 UTC] No.45776030{5}[source]▶

>>45775925 #

physics textbooks are though so it should know how they'd work, or at least know that balls don't spontaneously appear and disappear and that gears don't work when they aren't connected

59. HarHarVeryFunny ◴[31 Oct 25 20:10 UTC] No.45776178{4}[source]▶

>>45772674 #

Sure, but a complex predictor is still a predictor. It would be a BAD predictor if everything it output was not based on "what would the training data say?".

If you ask it to innovate and come up with something not in it's training data, what do you think it will do .... it'll "look at" it's training data and regurgitate (predict) something labelled as innovative

You can put a reasoning cap on a predictor, but it's still a predictor.

replies(1): >>45776459 #

60. HarHarVeryFunny ◴[31 Oct 25 20:12 UTC] No.45776192{5}[source]▶

>>45775953 #

Yes, but it's not ALL they are.

replies(1): >>45776451 #

61. daveguy ◴[31 Oct 25 20:19 UTC] No.45776258{5}[source]▶

>>45775953 #

They are both designed, trained, and evaluated by how well they can predict the next token. It's literally what they do. "Reasoning" models just buildup additional context of next token predictions and RL is used to bias output options to ones more appealing to human judges. It's not a meme. It's an accurate description of their fundamental computational nature.

62. astrange ◴[31 Oct 25 21:21 UTC] No.45776836[source]▶

>>45771331 #

That's not important compared to the post-training RL, which isn't "training data".

63. adastra22 ◴[01 Nov 25 00:20 UTC] No.45778151{5}[source]▶

>>45775953 #

Yes. That's not the devastating take-down you think it is. Are you positing that people have souls? If not, then yes: human chain-of-thought is the equivalent of next token prediction.

64. robocat ◴[01 Nov 25 01:10 UTC] No.45778422[source]▶

>>45770715 (TP) #

> They are good at repeating their training data, not thinking about it

Sounds like most people too!

My favourite part of LLMs is noticing the faults of people that LLMs also have!

65. darkwater ◴[01 Nov 25 08:10 UTC] No.45780008{11}[source]▶

>>45774277 #

To our current knowledge, no. But maybe we are missing something, we cannot know. Did infrared light or ultrasound start to exist only when we realized there are things our senses cannot feel?

66. Libidinalecon ◴[01 Nov 25 10:49 UTC] No.45780675{3}[source]▶

>>45772607 #

The problem is in adding the word "just" for no reason.

It makes the statement of a fact a type of rhetorical device.

It is the difference between saying "I am a biological entity" and "I am just a biological entity". There are all kinds of connotations that come along for the ride with the latter statement.

Then there is the counter with the romantic statement that "I am not just a biological entity".

67. pfortuny ◴[01 Nov 25 19:04 UTC] No.45784360{4}[source]▶

>>45771707 #

It has failed me several times already, drawing at most an octagon or a 12-gon: I mean create an image, not a program to do it.

68. rightbyte ◴[01 Nov 25 21:21 UTC] No.45785502{5}[source]▶

>>45775319 #

They can like barf as part of a trick and do "thing we are searching for is in that direction" etc but not very abstract communications.

69. conception ◴[01 Nov 25 22:21 UTC] No.45785965{3}[source]▶

>>45771639 #

I usually stick with “lives will be lost if you fail at this.” standard.

↑