Most active commenters

measurablefunc(25)
godelski(13)
(9)
JohnKemeny(6)
bubblyworld(5)
Ukv(5)
Certhas(5)
jibal(5)
vidarh(4)
PaulHoule(4)

Popular/hot comments

>>45120049 #
>>45120259 #
>>45120536 #
>>45124182 #
>>45121215 #
>>45120655 #
>>45120369 #
>>45120344 #
>>45118686 #
>>45120411 #
>>45120336 #
>>45120313 #
>>45121110 #
>>45119709 #
>>45123454 #
>>45119694 #

The wall confronting large language models

(arxiv.org)

1. Scene_Cast2 ◴[03 Sep 25 17:58 UTC] No.45118686[source]▶

The paper is hard to read. There is no concrete worked-through example, the prose is over the top, and the equations don't really help. I can't make head or tail of this paper.

replies(3): >>45118775 #>>45119154 #>>45120083 #

2. lumost ◴[03 Sep 25 18:06 UTC] No.45118775[source]▶

>>45118686 #

This appears to be a position paper written by authors outside of their core field. The presentation of "the wall" is only through analogy to derivatives on the discrete values computer's operate in.

replies(2): >>45119119 #>>45119709 #

3. joe_the_user ◴[03 Sep 25 18:43 UTC] No.45119119{3}[source]▶

>>45118775 #

Paper seems to involve a series of analogies and equations. However, I think if the equations accepted, the "wall" is actually derived.

The authors are computer scientists and people who work with large scale dynamic system. They aren't people who've actually produced an industry-scale LLM. However, I have to note that despite lots of practical progress in deep learning/transformers/etc systems, all the theory involved just analogies and equations of a similar sort, it's all alchemy and so people really good at producing these models seem to be using a bunch of effective rules of thumb and not any full or established models (despite books claiming to offer a mathematical foundation for enterprise, etc).

Which is to say, "outside of core competence" doesn't mean as much as it would for medicine or something.

replies(2): >>45119694 #>>45127357 #

4. ◴[03 Sep 25 18:47 UTC] No.45119154[source]▶

>>45118686 #

5. 18cmdick ◴[03 Sep 25 19:17 UTC] No.45119477[source]▶

>>45114579 (OP) #

Grifters in shambles.

6. ACCount37 ◴[03 Sep 25 19:43 UTC] No.45119694{4}[source]▶

>>45119119 #

No, that's all the more reason to distrust major, unverified claims made by someone "outside of core competence".

Applied demon summoning is ruled by empiricism and experimentation. The best summoners in the field are the ones who have a lot of practical experience and a sharp, honed intuition for the bizarre dynamics of the summoning process. And even those very summoners, specialists worth their weight in gold, are slaves to the experiment! Their novel ideas and methods and refinements still fail more often than they succeed!

One of the first lessons you have to learn in the field is that of humility. That your "novel ideas" and "brilliant insights" are neither novel nor brilliant - and the only path to success lies through things small and testable, most of which do not survive the test.

With that, can you trust the demon summoning knowledge of someone who has never drawn a summoning diagram?

replies(3): >>45119735 #>>45120082 #>>45120250 #

7. jibal ◴[03 Sep 25 19:45 UTC] No.45119709{3}[source]▶

>>45118775 #

If you look at their other papers, you will see that this is very much within their core field.

replies(3): >>45119914 #>>45120336 #>>45124453 #

8. jibal ◴[03 Sep 25 19:48 UTC] No.45119735{5}[source]▶

>>45119694 #

Somehow the game of telephone took us from "outside of their core field" (which wasn't true) to "outside of core competence" (which is grossly untrue).

> One of the first lessons you have to learn in the field is that of humility.

I suggest then that you make your statements less confidently.

9. dcre ◴[03 Sep 25 20:10 UTC] No.45119904[source]▶

>>45114579 (OP) #

Always fun to see a theoretical argument that something clearly already happening is impossible.

replies(2): >>45120040 #>>45120369 #

10. lumost ◴[03 Sep 25 20:12 UTC] No.45119914{4}[source]▶

>>45119709 #

Their other papers are on simulation and applied chemistry. Where does their expertise in Machine Learning, or Large Language Models derive from?

While it's not a requirement to have published in a field before publishing in a field. Having a coauthor who is from the target field or a peer review venue in that field as an entry point certainly raises credibility.

From my limited claim to be in either Machine Learning or Large Language Models the paper does not appear to demonstrate what it claims. The author's language addresses the field of Machine Learning and LLM development as you would a young student - which does not help make their point.

replies(1): >>45132135 #

11. ahartmetz ◴[03 Sep 25 20:28 UTC] No.45120040[source]▶

>>45119904 #

So where are the recent improvements in LLMs proportional to the billions invested?

replies(1): >>45120367 #

12. measurablefunc ◴[03 Sep 25 20:29 UTC] No.45120049[source]▶

>>45114579 (OP) #

There is a formal extensional equivalence between Markov chains & LLMs but the only person who seems to be saying anything about this is Gary Marcus. He is constantly making the point that symbolic understanding can not be reduced to a probabilistic computation regardless of how large the graph gets it will still be missing basic stuff like backtracking (which is available in programming languages like Prolog). I think that Gary is right on basically all counts. Probabilistic generative models are fun but no amount of probabilistic sequence generation can be a substitute for logical reasoning.

replies(16): >>45120249 #>>45120259 #>>45120415 #>>45120573 #>>45120628 #>>45121159 #>>45121215 #>>45122702 #>>45122805 #>>45123808 #>>45123989 #>>45125478 #>>45125935 #>>45129038 #>>45130942 #>>45131644 #

13. cwmoore ◴[03 Sep 25 20:32 UTC] No.45120082{5}[source]▶

>>45119694 #

Your passions may have run away with you.

https://news.ycombinator.com/item?id=45114753

14. boznz ◴[03 Sep 25 20:48 UTC] No.45120249[source]▶

>>45120049 #

logical reasoning is also based on probability weights, most of the time that probability is so close to 100% that it can be assumed to be true without consequence.

replies(1): >>45120935 #

15. ForHackernews ◴[03 Sep 25 20:48 UTC] No.45120250{5}[source]▶

>>45119694 #

The freshly-summoned Gaap-5 was rumored to be the most accursed spirit ever witnessed by mankind, but so far it seems not dramatically more evil than previous demons, despite having been fed vastly more humans souls.

replies(1): >>45120510 #

16. Certhas ◴[03 Sep 25 20:48 UTC] No.45120259[source]▶

>>45120049 #

I don't understand what point you're hinting at.

Either way, I can get arbitrarily good approximations of arbitrary nonlinear differential/difference equations using only linear probabilistic evolution at the cost of a (much) larger state space. So if you can implement it in a brain or a computer, there is a sufficiently large probabilistic dynamic that can model it. More really is different.

So I view all deductive ab-initio arguments about what LLMs can/can't do due to their architecture as fairly baseless.

(Note that the "large" here is doing a lot of heavy lifting. You need _really_ large. See https://en.m.wikipedia.org/wiki/Transfer_operator)

replies(5): >>45120313 #>>45120341 #>>45120344 #>>45123837 #>>45124441 #

17. arduanika ◴[03 Sep 25 20:54 UTC] No.45120313{3}[source]▶

>>45120259 #

What hinting? The comment was very clear. Arbitrarily good approximation is different from symbolic understanding.

"if you can implement it in a brain"

But we didn't. You have no idea how a brain works. Neither does anyone.

replies(3): >>45120357 #>>45120411 #>>45121006 #

18. JohnKemeny ◴[03 Sep 25 20:57 UTC] No.45120336{4}[source]▶

>>45119709 #

He's a chemist. Lots of chemists and physicists like to talk about computation without having any background in it.

I'm not saying anything about the content, merely making a remark.

replies(3): >>45120611 #>>45122263 #>>45122690 #

19. awesome_dude ◴[03 Sep 25 20:57 UTC] No.45120341{3}[source]▶

>>45120259 #

I think that the difference can be best explained thus:

I guess that you are most likely going to have cereal for breakfast tomorrow, I also guess that it's because it's your favourite.

I understand that you don't like cereal for breakfast, and I understand that you only have it every day because a Dr told you that it was the only way for you to start the day in a way that aligns with your health and dietary needs.

Meaning, I can guess based on past behaviour and be right, but understanding the reasoning for those choices, that's a whole other ballgame. Further, if we do end up with an AI that actually understands, well, that would really open up creativity, and problem solving.

replies(1): >>45120879 #

20. measurablefunc ◴[03 Sep 25 20:57 UTC] No.45120344{3}[source]▶

>>45120259 #

What part about backtracking is baseless? Typical Prolog interpreters can be implemented in a few MBs of binary code (the high level specification is even simpler & can be in a few hundred KB)¹ but none of the LLMs (open source or not) are capable of backtracking even though there is plenty of room for a basic Prolog interpreter. This seems like a very obvious shortcoming to me that no amount of smooth approximation can overcome.

If you think there is a threshold at which point some large enough feedforward network develops the capability to backtrack then I'd like to see your argument for it.

¹https://en.wikipedia.org/wiki/Warren_Abstract_Machine

replies(3): >>45120516 #>>45121626 #>>45124764 #

21. Certhas ◴[03 Sep 25 20:59 UTC] No.45120357{4}[source]▶

>>45120313 #

We didn't but somebody did so it's possible so probabilistic dynamics in high enough dimensions can do it.

We don't understand what LLMs are doing. You can't go from understanding what a transformer is to understanding what an LLM does any more than you can go from understanding what a Neuron is to what a brain does.

22. dcre ◴[03 Sep 25 21:00 UTC] No.45120367{3}[source]▶

>>45120040 #

Value for the money is not at issue in the paper!

replies(2): >>45120399 #>>45120659 #

23. crowbahr ◴[03 Sep 25 21:00 UTC] No.45120369[source]▶

>>45119904 #

Really? It sure seems like we're at the top of the S curve with LLMs. Wiring them up to talk the themselves as reasoning isn't scaling the core models, which have only made incremental gains for all the billions invested.

There's plenty more room to grow with agents and tooling, but the core models are only slightly bumping YoY rather than the rocketship changes of 2022/23.

replies(3): >>45121228 #>>45123377 #>>45125522 #

24. ahartmetz ◴[03 Sep 25 21:03 UTC] No.45120399{4}[source]▶

>>45120367 #

I believe it is. They are saying that LLMs don't improve all that much from giving them more resources - and computing power (and input corpus size) is pretty proportional to money.

25. mallowdram ◴[03 Sep 25 21:04 UTC] No.45120411{4}[source]▶

>>45120313 #

We know the healthy brain is unpredictable. We suspect error minimization and prediction are not central tenets. We know the brain creates memory via differences in sharp wave ripples. That it's oscillatory. That it neither uses symbols nor represents. That words are wholly external to what we call thought. The authors deal with molecules which are neither arbitrary nor specific. Yet tumors ARE specific, while words are wholly arbitrary. Knowing these things should offer a deep suspicion of ML/LLMs. They have so little to do with how brains work and the units brains actually use (all oscillation is specific, all stats emerge from arbitrary symbols and worse: metaphors) that mistaking LLMs for reasoning/inference is less lexemic hallucination and more eugenic.

replies(3): >>45120774 #>>45120824 #>>45124688 #

26. logicchains ◴[03 Sep 25 21:06 UTC] No.45120415[source]▶

>>45120049 #

LLMs are not formally equivalent to Markov chains, they're more powerful; transformers with sufficient chain of thought can solve any problem in P: https://arxiv.org/abs/2310.07923.

replies(1): >>45120484 #

27. klawed ◴[03 Sep 25 21:09 UTC] No.45120432[source]▶

>>45114579 (OP) #

> avoidance, which we also discuss in this paper, necessitates putting a much higher premium on insight and understanding of the structural characteristics of the problems being investigated.

I wonder if the authors are aware of The Bitter Lesson

28. measurablefunc ◴[03 Sep 25 21:15 UTC] No.45120484{3}[source]▶

>>45120415 #

If you think there is a mistake in this argument then I'd like to know where it is: https://markov.dk.workers.dev/.

replies(2): >>45122838 #>>45124408 #

29. lazide ◴[03 Sep 25 21:17 UTC] No.45120510{6}[source]▶

>>45120250 #

Perhaps we’re reaching peak demon?

30. bondarchuk ◴[03 Sep 25 21:18 UTC] No.45120516{4}[source]▶

>>45120344 #

Backtracking makes sense in a search context which is basically what prolog is. Why would you expect a next-token-predictor to do backtracking and what should that even look like?

replies(2): >>45120536 #>>45120766 #

31. measurablefunc ◴[03 Sep 25 21:20 UTC] No.45120536{5}[source]▶

>>45120516 #

I don't expect a Markov chain to be capable of backtracking. That's the point I am making. Logical reasoning as it is implemented in Prolog interpreters is not something that can be done w/ LLMs regardless of the size of their weights, biases, & activation functions between the nodes in the graph.

replies(4): >>45120598 #>>45121266 #>>45122145 #>>45124657 #

32. jules ◴[03 Sep 25 21:24 UTC] No.45120573[source]▶

>>45120049 #

What does this predict about LLMs ability to win gold at the International Mathematical Olympiad?

replies(2): >>45120671 #>>45122931 #

33. bondarchuk ◴[03 Sep 25 21:27 UTC] No.45120598{6}[source]▶

>>45120536 #

Imagine the context window contains A-B-C, C turns out a dead end and we want to backtrack to B and try another branch. Then the LLM could produce outputs such that the context window would become A-B-C-[backtrack-back-to-B-and-don't-do-C] which after some more tokens could become A-B-C-[backtrack-back-to-B-and-don't-do-C]-D. This would essentially be backtracking and I don't see why it would be inherently impossible for LLMs as long as the different branches fit in context.

replies(1): >>45120655 #

34. chermi ◴[03 Sep 25 21:29 UTC] No.45120611{5}[source]▶

>>45120336 #

You're really not saying anything? Just a random remark with no bearing?

Seth Lloyd, Wolpert, Landauer, Bennet, Fredkin, Feynman, Sejnowski, Hopfield, Zechinna, parisi,mezard, and zdebvora, Crutchfeld, Preskill, Deutsch, Manin, Szilard, MacKay....

I wish someone told them to shut up about computing. And I wouldn't dare claim von Neumann as merely a physicist, but that's where he was coming from. Oh and as much as I dislike him, Wolfram.

replies(1): >>45124354 #

35. Anon84 ◴[03 Sep 25 21:31 UTC] No.45120628[source]▶

>>45120049 #

There definitely is, but Marcus is not the only one talking about it. For example, we covered this paper in one of our internal journal clubs a few weeks ago: https://arxiv.org/abs/2410.02724

replies(1): >>45122852 #

36. measurablefunc ◴[03 Sep 25 21:35 UTC] No.45120655{7}[source]▶

>>45120598 #

If you think it is possible then I'd like to see an implementation of a sudoku puzzle solver as Markov chain. This is a simple enough problem that can be implemented in a few dozen lines of Prolog but I've never seen a solver implemented as a Markov chain.

replies(4): >>45120792 #>>45120895 #>>45120900 #>>45121022 #

37. 42lux ◴[03 Sep 25 21:35 UTC] No.45120659{4}[source]▶

>>45120367 #

It's not about value it's about the stagnation while throwing compute at the problem.

replies(1): >>45120725 #

38. measurablefunc ◴[03 Sep 25 21:37 UTC] No.45120671{3}[source]▶

>>45120573 #

Same thing it does about their ability to drive cars.

replies(1): >>45123195 #

39. dcre ◴[03 Sep 25 21:45 UTC] No.45120725{5}[source]▶

>>45120659 #

Exactly.

40. PaulHoule ◴[03 Sep 25 21:51 UTC] No.45120766{5}[source]▶

>>45120516 #

If you want general-purpose generation than it has to be able to respect constraints (e.g. figure art of a person has 0..1 belly buttons, 0..2 legs is unspoken) as it is generative models usually get those things right but don't always if they can stick together the tiles they use internally in some combination that makes sense locally but not globally.

General intelligence may not be SAT/SMT solving but it has to be able to do it, hence, backtracking.

Today I had another of those experiences of the weaknesses of LLM reasoning, one that happens a lot when doing LLM-assisted coding. I was trying to figure out how to rebuild some CSS after the HTML changed for accessibility purposes and got a good idea for how to do it from talking to the LLM but at that point the context was poisoned, probably because there was a lot of content about the context describing what we were thinking about at different stages of the conversation which evolved considerably. It lost its ability to follow instructions and I'd tell it specifically to do this or do that and it just wouldn't do it properly and this happens a lot if a session goes on too long.

My guess is that the attention mechanism is locking on to parts of the conversation which are no longer relevant to where I think we're at and in general the logic that considers the variation of either a practice (instances) or a theory over time is a very tricky problem and 'backtracking' is a specific answer for maintaining your knowledge base across a search process.

replies(2): >>45121102 #>>45123910 #

41. Zigurd ◴[03 Sep 25 21:52 UTC] No.45120774{5}[source]▶

>>45120411 #

"That words are wholly external to what we call thought." may be what we should learn, or at least hypothesize, based on what we see LLMs doing. I'm disappointed that AI isn't more of a laboratory for understanding brain architecture, and precisely what is this thing called thought.

replies(1): >>45121279 #

42. sudosysgen ◴[03 Sep 25 21:54 UTC] No.45120792{8}[source]▶

>>45120655 #

You can do that pretty trivially for any fixed size problem (as in solvable with a fixed-sized tape Turing machine), you'll just have a titanically huge state space. The claim of the LLM folks is that the models have a huge state space (they do have a titanically huge state space) and can navigate it efficiently.

Simply have a deterministic Markov chain where each state is a possible value of the tape+state of the TM and which transitions accordingly.

replies(1): >>45120915 #

43. quantummagic ◴[03 Sep 25 21:58 UTC] No.45120824{5}[source]▶

>>45120411 #

What do you think about the idea that LLMs are not reasoning/inferring, but are rather an approximation of the result? Just like you yourself might have to spend some effort reasoning, on how a plant grows, in order to answer questions about that subject. When asked, you wouldn't replicate that reasoning, instead you would recall the crystallized representation of the knowledge you accumulated while previously reasoning/learning. The "thinking" in the process isn't modelled by the LLM data, but rather by the code/strategies used to iterate over this crystallized knowledge, and present it to the user.

replies(1): >>45121309 #

44. CuriouslyC ◴[03 Sep 25 22:04 UTC] No.45120876[source]▶

>>45114579 (OP) #

This article is accurate. That's why I'm investigating a bayesian symbolic lisp reasoner. It's incapable of hallucinating, it provides auditable traces which are actual programs and it kicks the crap out of LLMs at stuff like Arc-Agi, symbolic reasoning, logic programs, game playing, etc. I'm working on a paper where I show that the same model can break 80 on arc-agi, run the house by counting cards at blackjack, and solve complex mathematical word problems.

replies(1): >>45121147 #

45. quantummagic ◴[03 Sep 25 22:04 UTC] No.45120879{4}[source]▶

>>45120341 #

How are the two cases you present fundamentally different? Aren't they both the same _type_ of knowledge? Why do you attribute "true understanding" to the case of knowing what the Dr said? Why stop there? Isn't true understanding knowing why we trust what the doctor said (all those years of schooling, and a presumption of competence, etc)? And why stop there? Why do we value years of schooling? Understanding, can always be taken to a deeper level, but does that mean we didn't "truly" understand earlier? And aren't the data structures needed to encode the knowledge, exactly the same for both cases you presented?

replies(1): >>45121027 #

46. bboygravity ◴[03 Sep 25 22:06 UTC] No.45120895{8}[source]▶

>>45120655 #

The LLM can just write the Prolog and solve the sudoku that way. I don't get your point. LLMs like Grok 4 can probably one-shot this today with the current state of art. You can likely just ask it to solve any sudoku and it will do it (by writing code in the background and running it and returning the result). And this is still very early stage compared to what will be out a year from now.

Why does it matter how it does it or whether this is strictly LLM or LLM with tools for any practical purpose?

replies(1): >>45123059 #

47. Ukv ◴[03 Sep 25 22:06 UTC] No.45120900{8}[source]▶

>>45120655 #

> If you think it is possible then I'd like to see an implementation of a sudoku puzzle solver as Markov chain

Have each of the Markov chain's states be one of 10^81 possible sudoku grids (a 9x9 grid of digits 1-9 and blank), then calculate the 10^81-by-10^81 transition matrix that takes each incomplete grid to the valid complete grid containing the same numbers. If you want you could even have it fill one square at a time rather than jump right to the solution, though there's no need to.

Up to you what you do for ambiguous inputs (select one solution at random to give 1.0 probability in the transition matrix? equally weight valid solutions? have the states be sets of boards and map to set of all valid solutions?) and impossible inputs (map to itself? have the states be sets of boards and map to empty set?).

Could say that's "cheating" by pre-computing the answers and hard-coding them in a massive input-output lookup table, but to my understanding that's also the only sense in which there's equivalence between Markov chains and LLMs.

replies(1): >>45120933 #

48. measurablefunc ◴[03 Sep 25 22:08 UTC] No.45120915{9}[source]▶

>>45120792 #

How are you encoding the state spaces for the sudoku solver specifically?

49. measurablefunc ◴[03 Sep 25 22:11 UTC] No.45120933{9}[source]▶

>>45120900 #

There are multiple solutions for each incomplete grid so how are you calculating the transitions for a grid w/ a non-unique solution?

Edit: I see you added questions for the ambiguities but modulo those choices your solution will almost work b/c it is not extensionally equivalent entirely. The transition graph and solver are almost extensionally equivalent but whereas the Prolog solver will backtrack there is no backtracking in the Markov chain and you have to re-run the chain multiple times to find all the solutions.

replies(2): >>45121032 #>>45121057 #

50. AaronAPU ◴[03 Sep 25 22:11 UTC] No.45120935{3}[source]▶

>>45120249 #

Stunningly, though I have been saying this for 20 years I’ve never come across someone else mention it until now.

replies(1): >>45125963 #

51. jjgreen ◴[03 Sep 25 22:21 UTC] No.45121006{4}[source]▶

>>45120313 #

You can look at it, from the inside.

52. lelanthran ◴[03 Sep 25 22:22 UTC] No.45121022{8}[source]▶

>>45120655 #

> If you think it is possible then I'd like to see an implementation of a sudoku puzzle solver as Markov chain. This is a simple enough problem that can be implemented in a few dozen lines of Prolog but I've never seen a solver implemented as a Markov chain.

I think it can be done. I started a chatbot that works like this some time back (2024) but paused work on it since January.

In brief, you shorten the context by discarding the context that didn't work out.

53. awesome_dude ◴[03 Sep 25 22:23 UTC] No.45121027{5}[source]▶

>>45120879 #

When you ask that question, why don't you just use a corpus of the previous answers to get some result?

Why do you need to ask me, isn't a guess based on past answers good enough?

Or, do you understand that you need to know more, you need to understand the reasoning based on what's missing from that post?

replies(1): >>45121696 #

54. ◴[03 Sep 25 22:25 UTC] No.45121032{10}[source]▶

>>45120933 #

55. Ukv ◴[03 Sep 25 22:26 UTC] No.45121057{10}[source]▶

>>45120933 #

> but whereas the Prolog solver will backtrack there is no backtracking in the Markov chain and you have to re-run the chain multiple times to find all the solutions

If you want it to give all possible solutions at once, you can just expand the state space to the power-set of sudoku boards, such that the input board transitions to the state representing the set of valid solved boards.

replies(2): >>45121077 #>>45126405 #

56. measurablefunc ◴[03 Sep 25 22:29 UTC] No.45121077{11}[source]▶

>>45121057 #

That still won't work b/c there is no backtracking. The point is that there is no way to encode backtracking/choice points like in Prolog w/ a Markov chain. The argument you have presented is not extensionally equivalent to the Prolog solver. It is almost equivalent but it's missing choice points for starting at a valid solution & backtracking to an incomplete board to generate a new one. The typical argument for absorbing states doesn't work b/c sudoku is not a typical deterministic puzzle.

replies(1): >>45121229 #

57. XenophileJKO ◴[03 Sep 25 22:34 UTC] No.45121102{6}[source]▶

>>45120766 #

What if you gave the model a tool to "willfully forget" a section of context. That would be easy to make. Hmm I might be onto something.

replies(1): >>45121201 #

58. Animats ◴[03 Sep 25 22:35 UTC] No.45121110[source]▶

>>45114579 (OP) #

That article is weird. They seem obsessed with nuclear reactors. Also, they misunderstand how floating point works.

As one learns at high school, the continuous derivative is the limit of the discrete version as the displacement h is sent to zero. If our computers could afford infinite precision, this statement would be equally good in practice as it is in continuum mathematics. But no computer can afford infinite precision, in fact, the standard double-precision IEEE representation of floating numbers offers an accuracy around the 16th digit, meaning that numbers below 10−16 are basically treated as pure noise. This means that upon sending the displacement h below machine precision, the discrete derivatives start to diverge from the continuum value as roundoff errors then dominate the discretization errors.

Yes, differentiating data has a noise problem. This is where gradient followers sometimes get stuck. A low pass filter can help by smoothing the data so the derivatives are less noisy. But is that relevant to LLMs? A big insight in machine learning optimization was that, in a high dimensional space, there's usually some dimension with a significant signal, which gets you out of local minima. Most machine learning is in high dimensional spaces but with low resolution data points.

replies(3): >>45121287 #>>45122475 #>>45124430 #

59. leptons ◴[03 Sep 25 22:40 UTC] No.45121147[source]▶

>>45120876 #

LLMs are also incapable of "hallucinating", so maybe that isn't the buzzword you should be using.

60. tim333 ◴[03 Sep 25 22:41 UTC] No.45121159[source]▶

>>45120049 #

Humans can do symbolic understanding that seems to rest on a rather flakey probabilistic neural network in our brains, or at least mine does. I can do maths and the like but there's quite a lot of trial and error and double checking things involved.

GPT5 said it thinks it's fixable when I asked it:

>Marcus is right that LLMs alone are not the full story of reasoning. But the evidence so far suggests the gap can be bridged—either by scaling, better architectures, or hybrid neuro-symbolic approaches.

replies(2): >>45122981 #>>45124687 #

61. PaulHoule ◴[03 Sep 25 22:48 UTC] No.45121201{7}[source]▶

>>45121102 #

I guess you could have some kind of mask that would let you suppress some of the context from matching, but my guess is that kind of thing might cause problems as often as it solves them.

Back when I was thinking about commonsense reasoning with logic it was obviously a much more difficult problem to add things like "P was true before time t", "there will be some time t in the future such at P is true", "John believes Mary believes that P is true", "It is possible that P is true", "there is some person q who believes that P is true", particularly when you combine these qualifiers. For one thing you don't even have a sound and complete strategy for reasoning over first-order logic + arithmetic but you also have a combinatorical explosion over the qualifiers.

Back in the day I thought it was important to have sound reasoning procedures but one of the reasons none of my foundation models ever became ChatGPT was that I cared about that and I really needed to ask "does change C cause an unsound procedure to get the right answer more often?" and not care if the reasoning procedure was sound or not.

62. vidarh ◴[03 Sep 25 22:49 UTC] No.45121215[source]▶

>>45120049 #

> Probabilistic generative models are fun but no amount of probabilistic sequence generation can be a substitute for logical reasoning.

Unless you either claim that humans can't do logical reasoning, or claim humans exceed the Turing computable, then given you can trivially wire an LLM into a Turing complete system, this reasoning is illogical due to Turing equivalence.

And either of those two claims lack evidence.

replies(4): >>45121263 #>>45122313 #>>45123029 #>>45125727 #

63. EMM_386 ◴[03 Sep 25 22:51 UTC] No.45121228{3}[source]▶

>>45120369 #

> the core models are only slightly bumping YoY rather than the rocketship changes of 2022/23

From Anthropic's press release yesterday after raising another $13 billion:

"Anthropic has seen rapid growth since the launch of Claude in March 2023. At the beginning of 2025, less than two years after launch, Anthropic’s run-rate revenue had grown to approximately $1 billion. By August 2025, just eight months later, our run-rate revenue reached over $5 billion—making Anthropic one of the fastest-growing technology companies in history."

$4 billion increase in 8 months. $1 billion every two months.

replies(1): >>45121531 #

64. Ukv ◴[03 Sep 25 22:52 UTC] No.45121229{12}[source]▶

>>45121077 #

> That still won't work b/c there is no backtracking.

It's essentially just a lookup table mapping from input board to the set of valid output boards - there's no real way for it not to work (obviously not practical though). If board A has valid solutions B, C, D, then the transition matrix cell mapping {A} to {B, C, D} is 1.0, and all other entries in that row are 0.0.

> The point is that there is no way to encode backtracking/choice points

You can if you want, keeping the same variables as a regular sudoku solver as part of the Markov chain's state and transitioning instruction-by-instruction, rather than mapping directly to the solution - just that there's no particular need to when you've precomputed the solution.

replies(1): >>45121264 #

65. measurablefunc ◴[03 Sep 25 22:57 UTC] No.45121264{13}[source]▶

>>45121229 #

My point is that your initial argument was missing several key pieces & if you specify the entire state space you will see that it's not as simple as you thought initially. I'm not saying it can't be done but that it's actually much more complicated than simply saying just take an incomplete board state s & uniform transitions between s, s' for valid solutions s' that are compatible with s. In fact, now that I spelled out the issues I still don't think this is a formal extensional equivalence. Prolog has interactive transitions between the states & it tracks choice points so compiling a sudoku solver to a Markov chain requires more than just tracking the board state in the context.

replies(1): >>45121671 #

66. vidarh ◴[03 Sep 25 22:57 UTC] No.45121266{6}[source]▶

>>45120536 #

A (2,3) Turing machine can be trivially implemented with a loop around an LLM that treats the context as an IO channel, and a Prolog interpreter runs on a Turing complete computer, and so per Truing equivalence you can run a Prolog interpreter on an LLM.

Of course this would be pointless, but it demonstrates that a system where an LLM provides the logic can backtrack, as there's nothing computationally special about backtracking.

That current UIs to LLMs are set up for conversation-style use that makes this harder isn't an inherent limitation of what we can do with LLMs.

replies(1): >>45121294 #

67. mallowdram ◴[03 Sep 25 22:59 UTC] No.45121279{6}[source]▶

>>45120774 #

The question is how to model the irreducible. And then to concatenate between spatiotemporal neuroscience (the oscillators) and neural syntax (what's oscillating) and add or subtract what the fields are doing to bind that to the surroundings.

68. hatmanstack ◴[03 Sep 25 23:00 UTC] No.45121287[source]▶

>>45121110 #

Have no empirical feedback but subjectively it reads as though the authors are trying to proof their own intelligence through convolution and confusion. Pure AI slop IMHO.

69. measurablefunc ◴[03 Sep 25 23:01 UTC] No.45121294{7}[source]▶

>>45121266 #

Loop around an LLM is not an LLM.

replies(1): >>45121337 #

70. mallowdram ◴[03 Sep 25 23:03 UTC] No.45121309{6}[source]▶

>>45120824 #

This is toughest part. We need some kind of analog external that concatenates. It's software, but not necessarily binary, it uses topology to express that analog. It somehow is visual, ie you can see it, but at the same time, it can be expanded specifically into syntax, which the details of are invisible. Scale invariance is probably key.

71. vidarh ◴[03 Sep 25 23:07 UTC] No.45121337{8}[source]▶

>>45121294 #

Then no current systems you are using are LLMs

replies(1): >>45121409 #

72. measurablefunc ◴[03 Sep 25 23:16 UTC] No.45121409{9}[source]▶

>>45121337 #

Choice-free feedforward graphs are LLMs. The inputs/outputs are extensionally equivalent to context and transition probabilities of a Markov chain. What exactly is your argument b/c what it looks like to me is you're simply making a Turing tarpit argument which does not address any of my points.

replies(1): >>45121690 #

73. dcre ◴[03 Sep 25 23:31 UTC] No.45121531{4}[source]▶

>>45121228 #

They’re talking about model quality. I still think they’re wrong, but the revenue is only indirectly relevant.

74. skissane ◴[03 Sep 25 23:42 UTC] No.45121626{4}[source]▶

>>45120344 #

> but none of the LLMs (open source or not) are capable of backtracking even though there is plenty of room for a basic Prolog interpreter. This seems like a very obvious shortcoming to me that no amount of smooth approximation can overcome.

The fundamental autoregressive architecture is absolutely capable of backtracking… we generate next token probabilities, select a next token, then calculate probabilities for the token thereafter.

There is absolutely nothing stopping you from “rewinding” to an earlier token, making a different selection and replaying from that point. The basic architecture absolutely supports it.

Why then has nobody implemented it? Maybe, this kind of backtracking isn’t really that useful.

replies(2): >>45121703 #>>45124591 #

75. Ukv ◴[03 Sep 25 23:47 UTC] No.45121671{14}[source]▶

>>45121264 #

> My point is that your initial argument was missing several key pieces

My initial example was a response to "If you think it is possible then I'd like to see an implementation of a sudoku puzzle solver as Markov chain", describing how a Sudoku solver could be implemented as a Markov chain. I don't think there's anything missing from it - it solves all proper Sudokus, and I only left open the choice of how to handle improper Sudokus because that was unspecified (but trivial regardless of what's wanted).

> I'm not saying it can't be done but that it's actually much more complicated

If that's the case, then I did misinterpret your comments as saying it can't be done. But, I don't think it's really complicated regardless of whatever "ok but now it must encode choice points in its state" are thrown at it - it's just a state-to-state transition look-up table.

> so compiling a sudoku solver to a Markov chain requires more than just tracking the board state in the context.

As noted, you can keep all the same variables as a regular Sudoku solver as part of the Markov chain's state and transition instruction-by-instruction, if that's what you want.

If you mean inputs from a user, the same is true of LLMs which are typically ran interactively. Either model the whole universe including the user as part of state transition table (maybe impossible, depending on your beliefs about the universe), or have user interaction take the current state, modify it, and use it as initial state for a new run of the Markov chain.

replies(1): >>45121836 #

76. vidarh ◴[03 Sep 25 23:50 UTC] No.45121690{10}[source]▶

>>45121409 #

My argument is that artificially limiting what you argue about to a subset of the systems people are actually using and arguing about the limitations of that makes your argument irrelevant to what people are actually using.

replies(1): >>45146244 #

77. quantummagic ◴[03 Sep 25 23:51 UTC] No.45121696{6}[source]▶

>>45121027 #

I asked that question in an attempt to not sound too argumentative. It was rhetorical. I'm asking you to consider the fact that there isn't actually any difference between the two examples you provided. They're fundamentally the same type of knowledge. They can be represented by the same data structures.

There's _always_ something missing, left unsaid in every example, it's the nature of language.

As for your example, the LLM can be trained to know the underlying reasons (doctor's recommendation, etc.). That knowledge is not fundamentally different from the knowledge that someone tends to eat cereal for breakfast. My question to you, was an attempt to highlight that the dichotomy you were drawing, in your example, doesn't actually exist.

replies(1): >>45122364 #

78. measurablefunc ◴[03 Sep 25 23:51 UTC] No.45121703{5}[source]▶

>>45121626 #

Where is this spelled out formally and proven logically?

replies(1): >>45121936 #

79. measurablefunc ◴[04 Sep 25 00:07 UTC] No.45121836{15}[source]▶

>>45121671 #

> As noted, you can keep all the same variables as a regular Sudoku solver

What are those variables exactly?

replies(1): >>45122154 #

80. skissane ◴[04 Sep 25 00:22 UTC] No.45121936{6}[source]▶

>>45121703 #

LLM backtracking is an active area of research, see e.g.

https://arxiv.org/html/2502.04404v1

https://arxiv.org/abs/2306.05426

And I was wrong that nobody has implemented it, as these papers prove people have… it is just the results haven’t been sufficiently impressive to support the transition from the research lab to industrial use - or at least, not yet

replies(2): >>45122006 #>>45122999 #

81. measurablefunc ◴[04 Sep 25 00:31 UTC] No.45122006{7}[source]▶

>>45121936 #

> Empirical evaluations demonstrate that our proposal significantly enhances the reasoning capabilities of LLMs, achieving a performance gain of over 40% compared to the optimal-path supervised fine-tuning method.

82. ◴[04 Sep 25 00:48 UTC] No.45122145{6}[source]▶

>>45120536 #

83. Ukv ◴[04 Sep 25 00:49 UTC] No.45122154{16}[source]▶

>>45121836 #

For a depth-first solution (backtracking), I'd assume mostly just the partial solutions and a few small counters/indices/masks - like for tracking the cell we're up to and which cells were prefilled. Specifics will depend on the solver, but can be made part of Markov chain's state regardless.

84. 11101010001100 ◴[04 Sep 25 01:05 UTC] No.45122263{5}[source]▶

>>45120336 #

Succi is no slouch; hardcore multiscale physics guy, among other things.

85. 11101010001100 ◴[04 Sep 25 01:12 UTC] No.45122313{3}[source]▶

>>45121215 #

So we just need a lot of monkeys at computers?

86. awesome_dude ◴[04 Sep 25 01:21 UTC] No.45122364{7}[source]▶

>>45121696 #

> They're fundamentally the same type of knowledge. They can be represented by the same data structures.

Maybe, maybe one is based on correlation, the other causation.

replies(1): >>45122758 #

87. godelski ◴[04 Sep 25 01:36 UTC] No.45122475[source]▶

>>45121110 #

  > A big insight in machine learning optimization was that

I think the big insight was how useful this low order method still is. I think many people don't appreciate how new the study of high dimensional mathematics (let alone high dimensional statistics) actually is. I mean metric theory didn't really start till around the early 1900's. The big reason these systems are still mostly black boxes is because we still have a long way to go when it comes to understanding these spaces.

But I think it is worth mentioning that low order approximations can still lock you out of different optima. While I agree the (Latent) Manifold Hypothesis pretty likely applies to many problems, this doesn't change the fact that even at relatively low dimensions (like 10D) are quite complex and have lots of properties that are unintuitive. With topics like language and images, I think it is safe to say that these still require operating in high dimensions. You're still going to have to contend with the complexities of the concentration of measure (an idea from the 70's).

Still, I don't think anyone expected things to have worked out as well as they have. If anything I think it is more surprising we haven't run into issues earlier! I think there are still some pretty grand problems for AI/ML left. Personally this is why I push back against much of the hype. The hype machine is good if the end is in sight. But a hype machine creates a bubble. The gamble is if you call fill the bubble before it pops. But the risk is that if it pops before then, then it all comes crashing down. It's been a very hot summer but I'm worried that the hype will lead to a winter. I'd rather have had a longer summer than a hotter summer and a winter.

88. godelski ◴[04 Sep 25 02:05 UTC] No.45122690{5}[source]▶

>>45120336 #

  > Lots of chemists and physicists like to talk about computation without having any background in it.

I'm confused. Physicists deal with computation all the time. Are you confusing computation with programming? There's a big difference. Physicists and chemists are frequently at odds with the limits of computability. Remember, Turing, Church, and even Knuth obtained degrees in mathematics. The divide isn't so clear cut and there's lots of overlaps. I think if you go look at someone doing their PhD in Programming Languages you could easily be mistake them for a mathematician.

Looking at the authors I don't see why this is out of their domain. Succi[0] looks like he deals a lot with fluid dynamics and has a big focus on Lattice Boltzmann. Modern fluid dynamics is all about computability and its limits. There's a lot of this that goes into the Navier–Stokes problem (even Terry Tao talks about this[1]), which is a lot about computational reproducibility.

Coveney[2] is a harder read for me, but doesn't seem suspect. Lots of work in molecular dynamics, so shares a lot of tools with Succi (seems like they like to work together too). There's a lot of papers there, but sorting by year there's quite a few that scream "limits of computability" to me.

I can't make strong comments without more intimate knowledge of their work, but nothing here is a clear red flag. I think you're misinterpreting because this is a position paper, written in the style you'd expect from a more formal field, but also is kinda scatterd. I've only done a quick read, -- don't get me wrong, I have critiques -- but there's no red flags that warrant quick dismissal. (My background: physicist -> computational physics -> ML) There's things they are pointing to that are more discussed within the more mathematically inclined sides of ML (it's a big field... even if only a small subset are most visible). I'll at least look at some of their other works on the topic as it seems they've written a few papers.

[0] https://scholar.google.com/citations?user=XrI0ffIAAAAJ

[1] I suspect this well above the average HN reader, but pay attention to what they mean by "blowup" and "singularity" https://terrytao.wordpress.com/tag/navier-stokes-equations/

[2] https://scholar.google.com/citations?user=_G6FZ6YAAAAJ

replies(2): >>45124095 #>>45124370 #

89. Straw ◴[04 Sep 25 02:07 UTC] No.45122702[source]▶

>>45120049 #

This is utter nonsense.

There's a formal equivalence between Markov chains and literally any system. The entire world can be viewed as a Markov chain. This doesn't tell you anything of interest, just that if you expand state without bound you eventually get the Markov property.

Why can't an LLM do backtracking? Not only within its multiple layers but across token models as reasoning models already do.

You are a probabilistic generative model (If you object, all of quantum mechanics is). I guess that means you can't do any reasoning!

90. quantummagic ◴[04 Sep 25 02:16 UTC] No.45122758{8}[source]▶

>>45122364 #

What if the causation had simply been that he enjoyed cereal for breakfast?

In either case, the results are the same, he's eating cereal for breakfast. We can know this fact without knowing the underlying cause. Many times, we don't even know the cause of things we choose to do for ourselves, let alone what others do.

On top of which, even if you think the "cause" is that the doctor told him to eat a healthy diet, do you really know the actual cause? Maybe the real cause, is that the girl he fancies, told him he's not in good enough shape. The doctor telling him how to get in shape is only a correlation, the real cause is his desire to win the girl.

These connections are vast and deep, but they're all essentially the same type of knowledge, representable by the same data structures.

replies(1): >>45123193 #

91. ◴[04 Sep 25 02:24 UTC] No.45122805[source]▶

>>45120049 #

92. CamperBob2 ◴[04 Sep 25 02:29 UTC] No.45122838{4}[source]▶

>>45120484 #

A Markov chain is memoryless by definition. A language model has a context, not to mention state in the form of the transformer's KV store.

The whole analogy is just pointless. You might as well call an elephant an Escalade because they weigh the same.

replies(1): >>45123454 #

93. godelski ◴[04 Sep 25 02:31 UTC] No.45122852{3}[source]▶

>>45120628 #

I just want to highlight this comment and stress how big of a field ML actually is. I think even much bigger than most people in ML research even know. It's really unfortunate that the hype has grown so much that even in the research community these areas are being overshadowed and even dismissed[0]. It's been interesting watching this evolution and how we're reapproaching symbolic reasoning while avoiding that phrase.

There's lots of people doing theory in ML and a lot of these people are making strides which others stand on (ViT and DDPM are great examples of this). But I never expect these works to get into the public eye as the barrier to entry tends to be much higher[1]. But they certainly should be something more ML researchers are looking at.

That is to say: Marcus is far from alone. He's just loud

[0] I'll never let go how Yi Tay said "fuck theorists" and just spent his time on Twitter calling the KAN paper garbage instead of making any actual critique. There seems to be too many who are happy to let the black box remain a black box because low level research has yet to accumulate to the point it can fully explain an LLM.

[1] You get tons of comments like this (the math being referenced is pretty basic, comparatively. Even if more advanced than what most people are familiar with) https://news.ycombinator.com/item?id=45052227

replies(2): >>45124050 #>>45125796 #

94. godelski ◴[04 Sep 25 02:44 UTC] No.45122931{3}[source]▶

>>45120573 #

Depends which question you're asking.

Ability to win a gold medal as if they were scored similarly to how humans are scored?

Ability to win a gold medal as determined by getting the "correct answer" to all the questions?

These are subtly two very different questions. In these kinds of math exams how you get to the answer matters more than the answer itself. i.e. You could not get high marks through divination. To add some clarity, the latter would be like testing someone's ability to code by only looking at their results to some test functions (oh wait... that's how we evaluate LLMs...). It's a good signal but it is far from a complete answer. It very much matters how the code generates the answer. Certainly you wouldn't accept code if it does a bunch of random computations before divining an answer.

The paper's answer to your question (assuming scored similarly to humans) is "Don’t count on it". Not a definitive "no" but they strongly suspect not.

replies(1): >>45123108 #

95. phoenixhaber ◴[04 Sep 25 02:46 UTC] No.45122941[source]▶

>>45114579 (OP) #

I don't get it. Explain in layman's terms please? Without getting involved with the math which looks quite complicated it appears that they simply are assuming scaling with what is currently known without model improvements.

96. afiori ◴[04 Sep 25 02:51 UTC] No.45122981{3}[source]▶

>>45121159 #

I sorta agree with you, but replying to "LLM can't reason" with "an LLM says they do" is wild

replies(2): >>45124419 #>>45125906 #

97. afiori ◴[04 Sep 25 02:54 UTC] No.45122999{7}[source]▶

>>45121936 #

I would expect to see something like this soonish as around now we are seeing the end of training scaling and the beginning of inference scaling

replies(1): >>45123196 #

98. godelski ◴[04 Sep 25 03:00 UTC] No.45123029{3}[source]▶

>>45121215 #

  > you can trivially wire an LLM into a Turing complete system

Please don't do the "the proof is trivial and left to the reader"[0].

If it is so trivial, show it. Don't hand wave, "put up or shut up". I think if you work this out you'll find it isn't so trivial...

I'm aware of some works but at least every one I know of has limitations that would not apply to LLMs. Plus, none of those are so trivial...

[0] https://en.wikipedia.org/wiki/Proof_by_intimidation

99. PhunkyPhil ◴[04 Sep 25 03:05 UTC] No.45123059{9}[source]▶

>>45120895 #

The point isn't if the output is correct or not, it's if the actual net is doing "logical computation" ala Prolog.

What you're suggesting is akin to me saying you can't build a house, then you go and hire someone to build a house. _You_ didn't build the house.

replies(1): >>45124543 #

100. jules ◴[04 Sep 25 03:10 UTC] No.45123108{4}[source]▶

>>45122931 #

The type of reasoning by the OP and the linked paper obviously does not work. The observable reality is that LLMs can do mathematical reasoning. A cursory interaction with state of the art LLMs makes this evident, as does their IMO gold medal scored like humans are. You cannot counter observable reality with generic theoretical considerations about Markov chains or pretraining scaling laws or floating point precision. The irony is that LLMs can explain why that type of reasoning is faulty:

> Any discrete-time computation (including backtracking search) becomes Markov if you define the state as the full machine configuration. Thus “Markov ⇒ no reasoning/backtracking” is a non sequitur. Moreover, LLMs can simulate backtracking in their reasoning chains. -- GPT-5

replies(2): >>45125259 #>>45125354 #

101. awesome_dude ◴[04 Sep 25 03:24 UTC] No.45123193{9}[source]▶

>>45122758 #

> In either case, the results are the same, he's eating cereal for breakfast. We can know this fact without knowing the underlying cause. Many times, we don't even know the cause of things we choose to do for ourselves, let alone what others do.

Yeah, no.

Understanding the causation allows the system to provide a better answer.

If they "enjoy" cereal, what about it do they enjoy, and what other possible things can be had for breakfast that also satisfy that enjoyment.

You'll never find that by looking only at the fact that they have eaten cereal for breakfast.

And the fact that that's not obvious to you is why I cannot be bothered going into any more depth on the topic any more. It's clear that you don't have any understanding on the topic beyond a superficial glance.

Bye :)

102. jules ◴[04 Sep 25 03:24 UTC] No.45123195{4}[source]▶

>>45120671 #

So, nothing.

replies(1): >>45123351 #

103. foota ◴[04 Sep 25 03:24 UTC] No.45123196{8}[source]▶

>>45122999 #

This is a neat observation, training has been optimized to hell and inference is just beginning.

104. measurablefunc ◴[04 Sep 25 03:48 UTC] No.45123351{5}[source]▶

>>45123195 #

It's definitely something but it might not be apparent to those who do not understand the distinctions between intensionallity & extensionallity.

105. dangus ◴[04 Sep 25 03:54 UTC] No.45123377{3}[source]▶

>>45120369 #

And relevant to the summary of this paper, LLM incremental improvement doesn't really seem to include the described wall.

If work produced by LLMs forever has to be checked for accuracy, the applicability will be limited.

This is perhaps analogous to all the "self-driving cars" that still have to be monitored by humans, and in that case the self-driving system might as well not exist at all.

106. measurablefunc ◴[04 Sep 25 04:07 UTC] No.45123454{5}[source]▶

>>45122838 #

Where is the logical mistake in the linked argument? If there is a mistake then I'd like to know what it is & the counter-example that invalidates the logical argument.

replies(3): >>45124048 #>>45124132 #>>45124761 #

107. lowbloodsugar ◴[04 Sep 25 05:14 UTC] No.45123808[source]▶

>>45120049 #

1. You are a neural net and you can backtrack. But unlike an algorithm space search, you’lol go “hmm. That doesn’t look right. Let me try it another way. “

2. Agentic AI already does this in the way that you do it.

108. patrick451 ◴[04 Sep 25 05:19 UTC] No.45123837{3}[source]▶

>>45120259 #

> Either way, I can get arbitrarily good approximations of arbitrary nonlinear differential/difference equations using only linear probabilistic evolution at the cost of a (much) larger state space.

This is impossible. When driven by a sinusoid, a linear system will only ever output a sinusoid with exactly the same frequency but a different amplitude and phase regardless of how many states you give it. A non-linear system can change the frequency or output multiple frequencies.

replies(1): >>45124025 #

109. photonthug ◴[04 Sep 25 05:35 UTC] No.45123910{6}[source]▶

>>45120766 #

> General intelligence may not be SAT/SMT solving but it has to be able to do it, hence, backtracking.

Just to add some more color to this. For problems that completely reduce to formal methods or have significant subcomponents that involve it, combinatorial explosion in state-space is a notorious problem and N variables is going to stick you with 2^N at least. It really doesn't matter whether you think you're directly looking at solving SAT/search, because it's too basic to really be avoided in general.

When people talk optimistically about hallucinations not being a problem, they generally mean something like "not a problem in the final step" because they hope they can evaluate/validate something there, but what about errors somewhere in the large middle? So even with a very tiny chance of hallucinations in general, we're talking about an exponential number of opportunities in implicit state-transitions to trigger those low-probability errors.

The answer to stuff like this is supposed to be "get LLMs to call out to SAT solvers". Fine, definitely moving from state-space to program-space is helpful, but it also kinda just pushes the problem around as long as the unconstrained code generation is still prone to hallucination.. what happens when it validates, runs, and answers.. but the spec was wrong?

Personally I'm most excited about projects like AlphaEvolve that seem fearless about hybrid symbolics / LLMs and embracing the good parts of GOFAI that LLMs can make tractable for the first time. Instead of the "reasoning is dead, long live messy incomprehensible vibes", those guys are talking about how to leverage earlier work, including things like genetic algorithms and things like knowledge-bases.[0] Especially with genuinely new knowledge-discovery from systems like this, I really don't get all the people who are still staunchly in either an old-school / new-school camp on this kind of thing.

[0]: MLST on the subject: https://www.youtube.com/watch?v=vC9nAosXrJw

replies(1): >>45128160 #

110. bubblyworld ◴[04 Sep 25 05:53 UTC] No.45123989[source]▶

>>45120049 #

If you want to understand SOTA systems then I don't think you should study their formal properties in isolation, i.e. it's not useful to separate them from their environment. Every LLM-based tool has access to code interpreters these days which makes this kind of a moot point.

replies(2): >>45124020 #>>45128356 #

111. measurablefunc ◴[04 Sep 25 05:58 UTC] No.45124020{3}[source]▶

>>45123989 #

I prefer logic to hype. If you have a reason to think the hype nullifies basic logical analysis then you're welcome to your opinion but I'm going to stick w/ logic b/c so far no one has presented an actual counter-argument w/ enough rigor to justify their stance.

replies(1): >>45124089 #

112. diffeomorphism ◴[04 Sep 25 05:58 UTC] No.45124025{4}[source]▶

>>45123837 #

As far as I understand, the terminology says "linear" but means compositions of affine (with cutoffs etc). That gives you arbitrary polynomials and piecewise affine, which are dense in most classes of interest.

Of course, in practice you don't actually get arbitrary degree polynomials but some finite degree, so the approximation might still be quite bad or inefficient.

113. ◴[04 Sep 25 06:02 UTC] No.45124048{6}[source]▶

>>45123454 #

114. calf ◴[04 Sep 25 06:02 UTC] No.45124050{4}[source]▶

>>45122852 #

I hazard to imagine that LLMs are a special subset of Markov chains, and this subset has interesting properties; it seems a bit reductive to dismiss LLMs as "merely' Markov chains. It's what we can do with this unusual subset (e.g. maybe incorporate in a larger AI system) that is the interesting question.

replies(1): >>45124319 #

115. bubblyworld ◴[04 Sep 25 06:08 UTC] No.45124089{4}[source]▶

>>45124020 #

I think you are applying logic and demand for rigour selectively, to be honest. Not all arguments require formalisation. I have presented mine - your linked logical analyses just aren't relevant to modern systems. I said nothing about the logical steps being wrong, necessarily.

replies(2): >>45124296 #>>45124662 #

116. calf ◴[04 Sep 25 06:09 UTC] No.45124095{6}[source]▶

>>45122690 #

There are some good example posts on Scott Aaronson's blog where he eviscerated shoddy physicists' take on quantum complexity theory. Physicists today aren't like Turing et al, most never picked up a theory of computer science book and actually worked through the homework exercises, and with AI pivot and paper spawning, this is kind of a general problem (arguably more interdisciplinary expertise is needed but people need to actually take the time to learn material and internalize it without making sophomore mistakes etc.).

replies(1): >>45125416 #

117. ◴[04 Sep 25 06:15 UTC] No.45124132{6}[source]▶

>>45123454 #

118. pama ◴[04 Sep 25 06:25 UTC] No.45124182[source]▶

>>45114579 (OP) #

Sauro, if you read this, please refrain from such low-content speculative statements:

“On a loose but telling note, this is still three decades short of the number of neural connections in the human brain, 1015, and yet they consume some one hundred million times more power (GWatts as compared to the very modest 20 Watts required by our brains).”

No human brain could have time to read all the materials of a modern LLM training run even if they lived and read eight hours a day since humans first appeared over 300,000 years ago. More to the point, inference of an LLM is way more energy efficient than human inference (see the energy costs of a B200 decoding a 671B parameter model and estimate the energy needed to write the equivalent of a human book worth of information as part of a larger batch). The main reason for the large energy costs of inference is that we are serving hundreds of millions of people with the same model. No humans have this type of scaling capability.

replies(4): >>45124272 #>>45124735 #>>45125473 #>>45127090 #

119. vrighter ◴[04 Sep 25 06:39 UTC] No.45124272[source]▶

>>45124182 #

And yet, the human brain is still (way way wayyyyyyyyy) more capable than the LLMs at the actual thinking. They're as wide as an ocean and as shallow as a puddle in a pothole. And we didn't need to read all of the internet to do it.

As for the "write a book" part, the LLM will write a book quickly sure, but a significant chunk of it will be bullshit. It will all be hallucinated, but the stopped clock will be right some of the time.

No humans have this scaling capability? What do you call the reproductive cycle then? Lots of smaller brains, each one possible specialized in a few fields, together containing all of human knowledge. And you might say that's not the same thing!, to which I reply with "let's not kid ourselves, Mixture-of-Experts describes exactly this".

replies(1): >>45124515 #

120. measurablefunc ◴[04 Sep 25 06:44 UTC] No.45124296{5}[source]▶

>>45124089 #

If there are no logical errors then you're just waving your hands which, again, you're welcome to do but it doesn't address any of the points I've made in this thread.

replies(1): >>45124369 #

121. measurablefunc ◴[04 Sep 25 06:47 UTC] No.45124319{5}[source]▶

>>45124050 #

You don't have to imagine, there is a logically rigorous argument¹ that establishes the equivalence. There is also nothing unusual about neural networks or Markov chains. You've just been mystified by the marketing around them so you think there is something special about them when they're just another algorithm for approximating different kinds of compressible signals & observations about the real world.

¹https://markov.dk.workers.dev/

replies(1): >>45133873 #

122. JohnKemeny ◴[04 Sep 25 06:56 UTC] No.45124354{6}[source]▶

>>45120611 #

As you note, some physicists do have computing backgrounds. I'm not suggesting they can't do computer science.

But today, most people hold opinions about LLMs, both as to their limits and their potential, without any real knowledge of computational linguistics nor of deep learning.

replies(1): >>45140626 #

123. bubblyworld ◴[04 Sep 25 06:58 UTC] No.45124369{6}[source]▶

>>45124296 #

Lol, okay. Serves me right for feeding the trolls.

124. JohnKemeny ◴[04 Sep 25 06:59 UTC] No.45124370{6}[source]▶

>>45122690 #

Turing, Church, and even Knuth got their degrees before CS was an academic discipline. At least I don't think Turing studied Turing machines in his undergrads.

I'm saying that lots of people like to post their opinions of LLMs regardless of whether or not they actually have any competence in either computational linguistics or deep learning.

replies(2): >>45125375 #>>45128377 #

125. logicchains ◴[04 Sep 25 07:03 UTC] No.45124408{4}[source]▶

>>45120484 #

It assumes the LLM only runs once, i.e. it doesn't account for chain of thought, which makes the program not memoryless.

replies(1): >>45124467 #

126. JohnKemeny ◴[04 Sep 25 07:04 UTC] No.45124419{4}[source]▶

>>45122981 #

I asked ChatGPT and it agrees with the statement that it is indeed wild

127. JohnKemeny ◴[04 Sep 25 07:06 UTC] No.45124430[source]▶

>>45121110 #

They are physicists and will therefore talk about things that make sense to them. What they aren't: computational linguists and deep learning experts.

128. baselessness ◴[04 Sep 25 07:08 UTC] No.45124441{3}[source]▶

>>45120259 #

That's what this debate has been reduced to. People point out the logical and empirical, by now very obvious limitation of LLMs. And boosters are the equivalent of Chopra's "quantum physics means anything is possible" saying "if you add enough information to a system anything is possible".

replies(1): >>45125289 #

129. JohnKemeny ◴[04 Sep 25 07:10 UTC] No.45124453{4}[source]▶

>>45119709 #

Look at their actual papers before making a comment of what is or isn't their core field: https://dblp.org/pid/35/3081.html

replies(1): >>45128392 #

130. measurablefunc ◴[04 Sep 25 07:12 UTC] No.45124467{5}[source]▶

>>45124408 #

There is no such assumption. You can run/sample the LLM & the equivalent Markov chain as many times as you want & the logical analysis remains the same b/c the extensional equivalence between the LLM & Markov chain has nothing to do w/ how many times the trajectories are sampled from each one.

131. throwaway314155 ◴[04 Sep 25 07:19 UTC] No.45124515{3}[source]▶

>>45124272 #

Agency may be better understood as Michael Levin's approach where e.g. a lifeform is something that can achieve the same goal using various methods (robust).

Having said that, you can now simply move the goal posts to say that while one human cannot read that much in that amount of time - the collective of all humans certainly can - or at least they can approximate it in a similar fashion to LLM's.

Since each of us can reap the benefits of the collective then the benefits are distributed back to the individuals as needed.

replies(1): >>45126609 #

132. kaibee ◴[04 Sep 25 07:22 UTC] No.45124543{10}[source]▶

>>45123059 #

I feel like you're kinda proving too much. By the same reasoning, humans/programmers aren't generally intelligent either, because we can only mentally simulate relatively small state spaces of programs, and when my boss tells me to go build a tool, I'm not exactly writing raw x86 assembly. I didn't _build_ the tool, I just wrote text that instructed a compiler how to build the tool. Like the whole reason we invented SAT solvers is because we're not smart in that way. But I feel like you're trying to argue that LLMs at any scale gonna be less capable than an average person?

133. versteegen ◴[04 Sep 25 07:30 UTC] No.45124591{5}[source]▶

>>45121626 #

Yes, but anyway, LLMs themselves are perfectly capable of backtracking reasoning while sampling is run forwards only, in the same way humans do: by deciding something doesn't work and trying something else. Humans DON'T time travel backwards in time and never have the erroneous thought in the first place.

134. awanderingmind ◴[04 Sep 25 07:38 UTC] No.45124647[source]▶

>>45114579 (OP) #

There is a lot of focus in the comments on the authors' credentials and, apparently, their writing style. It is a pity, because I think their discussion of scaling is interesting, even if comparing LLMs to grid-based differential equation solvers might be unconventional (I haven't convinced myself whether the analogy is entirely apt/valid yet, but it could conceivably be).

135. Certhas ◴[04 Sep 25 07:41 UTC] No.45124657{6}[source]▶

>>45120536 #

Take a finite tape Turing machine with N states and tape length T and N^T total possible tape states.

Now consider that you have a probability for each state instead of a definite state. The transitions of the Turing machine induce transitions of the probabilities. These transitions define a Markov chain on a N^T dimensional probability space.

Is this useful? Absolutely not. It's just a trivial rewriting. But it shows that high dimensional spaces are extremely powerful. You can trade off sophisticated transition rules for high dimensionality.

136. wolvesechoes ◴[04 Sep 25 07:42 UTC] No.45124662{5}[source]▶

>>45124089 #

> I have presented mine - your linked logical analyses just aren't relevant to modern systems

Assertion is not an argument

replies(1): >>45124918 #

137. wolvesechoes ◴[04 Sep 25 07:44 UTC] No.45124687{3}[source]▶

>>45121159 #

And I though that the gap is bridged by giving another billions to Sam Altman

138. suddenlybananas ◴[04 Sep 25 07:44 UTC] No.45124688{5}[source]▶

>>45120411 #

We don't know those things about the brain. I don't know why you keep going around HN making wildly false claims about the state of contemporary neuroscience. We know very very little about how higher order cognition works in the brain.

replies(1): >>45126938 #

139. wolvesechoes ◴[04 Sep 25 07:50 UTC] No.45124735[source]▶

>>45124182 #

I didn't have to read all textbooks, web articles or blog posts about numerical methods, yet I am capable of implementing production-ready ODE solver, and LLMs are not (I use this example as this is what I experienced). Clearly human supremacy.

140. versteegen ◴[04 Sep 25 07:54 UTC] No.45124761{6}[source]▶

>>45123454 #

A Transformer with a length n context window implements an order 2n-1 Markov chain¹. That is correct. That is also irrelevant in the real world, because LLMs aren't run for that many tokens (as results are bad). Before it hits that limit, there is nothing requiring it to have any of the properties of a Markov chain. In fact, because the state space is k^n (alphabet size k), you might not revisit a state until generating k^n tokens.

¹ Depending on context window implementation details, but that is the maximum, because the states n tokens back were computed from the n tokens before that. The minimum of course is an order n-1 Markov chain.

replies(1): >>45126789 #

141. Certhas ◴[04 Sep 25 07:54 UTC] No.45124764{4}[source]▶

>>45120344 #

I know that if you go large enough you can do any finite computation using only fixed transition probabilities. This is a trivial observation. To repeat what I posted elsewhere in this thread:

Take a finite tape Turing machine with N states and tape length T and N^T total possible tape states.

You _can_ continue this line of thought though in more productive directions. E.g. what if the input of your machine is genuinely uncertain? What if the transitions are not precise but slightly noisy? You'd expect that the fundamental capabilities of a noisy machine wouldn't be that much worse than those of a noiseless ones (over finite time horizons). What if the machine was built to be noise resistant in some way?

All of this should regularize the Markov chain above. If it's more regular you can start thinking about approximating it using a lower rank transition matrix.

The point of this is not to say that this is really useful. It's to say that there is no reason in my mind to dismiss the purely mathematical rewriting as entirely meaningless in practice.

142. bubblyworld ◴[04 Sep 25 08:17 UTC] No.45124918{6}[source]▶

>>45124662 #

That assertion is not what I was referring to. Anyway, I'm not really interested in nitpicking this stuff. Engage with my initial comment if you actually care to discuss it.

143. ◴[04 Sep 25 09:19 UTC] No.45125259{5}[source]▶

>>45123108 #

144. yorwba ◴[04 Sep 25 09:24 UTC] No.45125289{4}[source]▶

>>45124441 #

The argument isn't that anything is possible for LLMs, but that representing LLMs as Markov chains doesn't demonstrate a limitation, because the resulting Markov chain would be huge, much larger than the LLM, and anything that is possible is possible with a large enough Markov chain.

If you limit yourself to Markov chains where the full transition matrix can be stored in a reasonable amount of space (which is the kind of Markov chain that people usually have in mind when they think that Markov chains are very limited), LLMs cannot be represented as such a Markov chain.

If you want to show limitations of LLMs by reducing them to another system of computation, you need to pick one that is more limited than LLMs appear to be, not less.

replies(1): >>45127523 #

145. godelski ◴[04 Sep 25 09:35 UTC] No.45125354{5}[source]▶

>>45123108 #

  > The observable reality is that LLMs can do mathematical reasoning

I still can't get these machines to reliably perform basic subtraction[0]. The result is stochastic, so I can get the right answer, but have yet to reproduce one where the actual logic is correct[1,2]. Both [1,2] perform the same mistake and in [2] you see it just say "fuck it, skip to the answer"

  > You cannot counter observable reality

I'd call [0,1,2] "observable". These types of errors are quite common, so maybe I'm not the one with lying eyes.

[0] https://chatgpt.com/share/68b95bf5-562c-8013-8535-b61a80bada...

[1] https://chatgpt.com/share/68b95c95-808c-8013-b4ae-87a3a5a42b...

[2] https://chatgpt.com/share/68b95cae-0414-8013-aaf0-11acd0edeb...

replies(1): >>45125387 #

146. godelski ◴[04 Sep 25 09:38 UTC] No.45125375{7}[source]▶

>>45124370 #

Sure, but how long ago was that? Do you really think the fields fully decoupled in such a small time? That's the entire point of that comment

replies(1): >>45128423 #

147. FergusArgyll ◴[04 Sep 25 09:40 UTC] No.45125387{6}[source]▶

>>45125354 #

Why don't you use a state of the art model? Are you scared it will get it right? Or are you just not aware of reasoning models in which case you should get to know the field

replies(2): >>45125428 #>>45127681 #

148. ◴[04 Sep 25 09:46 UTC] No.45125416{7}[source]▶

>>45124095 #

149. godelski ◴[04 Sep 25 09:47 UTC] No.45125428{7}[source]▶

>>45125387 #

Careful there, without a /s people might think you're being serious.

replies(1): >>45125580 #

150. skeezyboy ◴[04 Sep 25 09:56 UTC] No.45125473[source]▶

>>45124182 #

> The main reason for the large energy costs of inference is that we are serving hundreds of millions of people with the same model.

its because thats how LLMs work, not because theyre so popular

151. PartiallyTyped ◴[04 Sep 25 09:56 UTC] No.45125478[source]▶

>>45120049 #

What’s the difference between backtracking and flattening the computation?

152. skeezyboy ◴[04 Sep 25 10:05 UTC] No.45125522{3}[source]▶

>>45120369 #

> There's plenty more room to grow with agents and tooling, but the core models are only slightly bumping YoY rather than the rocketship changes of 2022/23.

understandable. the real innovation was the process/technique underlying LLMs. the rest is just programmers automating it. similar happened with blockchain, everything after was just tinkering the initial idea

153. FergusArgyll ◴[04 Sep 25 10:15 UTC] No.45125580{8}[source]▶

>>45125428 #

I am being serious, why don't you use a SOTA model?

replies(1): >>45129312 #

154. voidhorse ◴[04 Sep 25 10:42 UTC] No.45125727{3}[source]▶

>>45121215 #

Such a system redefines logical reasoning to the point that hardly any typical person's definition would agree.

It's Searle's Chinese Room scenario all over again, which everyone seems to have forgotten amidst the bs marketing storm around LLMs. A person with no knowledge of Chinese following a set of instructions and reading from a dictionary translating texts is a substitute for hiring a translator who understands chinese, however we would not claim that this person understands Chinese.

An LLM hooked up to a Turing Machine would be similar wrt to logical reasoning. When we claim someone reasons logically we usually don't imagine they randomly throw ideas at the wall and then consult outputs to determine if they reasoned logically. Instead, the process of deduction makes the line of reasoning decidedly not stochastic. I can't believe we've gotten to such a mad place that basic notions like that of logical deduction are being confused for stochastic processes. Ultimately, I would agree that it all comes back to the problem of other minds and you either take a fully reductionist stance and claim the brain and intellection is nothing more than probabilistic neural firing or you take a non-reductionist stance and assume there may be more to it. In either case, I think that claiming that LLMs+tools are equivalent to whatever process humans perform is kind of silly and severely underrated what humans are capable of^1.

1: Then again, this has been going on since the dawn of computing, which has always put forth its brain=computer metaphors more on grounds of reducing what we mean by "thought" than by any real substantively justified connection.

replies(2): >>45126037 #>>45129185 #

155. voidhorse ◴[04 Sep 25 10:52 UTC] No.45125796{4}[source]▶

>>45122852 #

That linked comment was so eye-opening. It suddenly made sense to me why people who are presumably technical and thus shouldn't even be entertaining the notion that LLMs reason (and who should further realize that the use and choice of this term was pure marketing strategy) are giving it the time of day. When so many of the enthusiasts can't even get enough math under their belt to understand basic claims it's no wonder the industry is a complete circus right now.

replies(1): >>45129443 #

156. tim333 ◴[04 Sep 25 11:11 UTC] No.45125906{4}[source]▶

>>45122981 #

I don't have a strong opinion if LLMs can reason or not. I think they can a bit but not very well. I think that also applies to many humans though. I was stuck that to my eyes GPT5's take on the question seemed better thought out than Garry Marcus's who is pretty biased to the LLMs are rubbish school.

157. fluidcruft ◴[04 Sep 25 11:14 UTC] No.45125935[source]▶

>>45120049 #

I'm genuinely curious whether neuroscoence has shown that humans perform backtracking?

I am sure we can be taught to backtrack and many things may seem similar, but I just haven't heard of anything at the neuroscience level that backtracking is a fundamental capacity of biological neural networks. So to the extent that humans perform symbolic computations I'm not sure it's established that backtracking is necessary vs other strategies.

158. nprateem ◴[04 Sep 25 11:17 UTC] No.45125963{4}[source]▶

>>45120935 #

Glue sticks.

Pepperoni falls off pizza.

Therefore to keep it in place, stick it with glue...

Not stunned by this reductionist take.

replies(1): >>45153196 #

159. bopjesvla ◴[04 Sep 25 11:26 UTC] No.45126037{4}[source]▶

>>45125727 #

The Chinese Room experiment has always been a hack thought experiment that was discussed in other forms before it was posited by Searle, most famously in Turing's "Can machines think?". Searle only superficially engaged with existing literature in the original Chinese Room paper. When he was forced to do so later on, Searle claimed that if you'd precisely simulate a Chinese human brain in a human-like robot, that brain still wouldn't be able to think or understand Chinese. Not a useful definition of thinking if you ask me.

From Wikipedia:

Suppose that the program simulated in fine detail the action of every neuron in the brain of a Chinese speaker.[83][w] This strengthens the intuition that there would be no significant difference between the operation of the program and the operation of a live human brain. Not a useful definition of thinking if you ask me.

Searle replies that such a simulation does not reproduce the important features of the brain—its causal and intentional states. He is adamant that "human mental phenomena [are] dependent on actual physical–chemical properties of actual human brains."[26]

160. Certhas ◴[04 Sep 25 12:17 UTC] No.45126405{11}[source]▶

>>45121057 #

People really don't appreciate what is possible in infinite (or more precisely: arbitrarily high) dimensional spaces.

161. pama ◴[04 Sep 25 12:43 UTC] No.45126609{4}[source]▶

>>45124515 #

Certainly all of humanity could approximate the reading of an LLM for now. The energy consumption of humanity is about 20TW at this point, so over 20,000 times higher than the next generation LLM training runs. Just like the LLM training does not only spend energy on their “brain” neither does humanity.

replies(1): >>45145646 #

162. versteegen ◴[04 Sep 25 13:02 UTC] No.45126789{7}[source]▶

>>45124761 #

Specifically, an order n Markov chain such as a transformer, if not otherwise restricted, can have any joint distribution you wish for the first n-1 steps: any extensional property. In which case you have to look at intensional properties to actually draw non-vacuous conclusions.

I would like to comment that there are a lot of papers out there on what transformers can or can't do that are misleading, often misunderstood, or abstract so far from transformers as implemented and used that they are pure theory.

163. mallowdram ◴[04 Sep 25 13:17 UTC] No.45126938{6}[source]▶

>>45124688 #

Of course we know these things about the brain, and who said anything about higher order cognition? I'd stay current, you seem to be a legacy thinker. I'll needle drop ONE of the references re: unpredictability and brain health, there are about 30, just to keep you in your corner. The rest you'll have to hunt down, but please stop pretending you know what you're talking about.

Your line of attack which is to dismiss from a pretend point of certainty, rather than inquiry and curiosity, seems indicative of the cog-sci/engineering problem in general. There's an imposition based in intuition/folk psychology that suffuses the industry. The field doesn't remain curious to new discoveries in neurobiology, which supplants psychology (psychology is being based, neuro is neural based). What this does is remove the intent of rhetoric/being and suggest brains built our external communication. The question is how and by what regularities. Cog-sci has no grasp of that in the slightest.

https://pubmed.ncbi.nlm.nih.gov/38579270/

replies(1): >>45137064 #

164. mikewarot ◴[04 Sep 25 13:32 UTC] No.45127090[source]▶

>>45124182 #

> The main reason for the large energy costs of inference is that we are serving hundreds of millions of people with the same model. No humans have this type of scaling capability.

Using CPUs or GPUs or even tensor units involve waiting for data to be moved from RAM to/from compute. It's my understanding that most of the power used in LLM compute is taken at that stage, and I further believe that 95% savings are possible by merging memory and compute to build a universal computing fabric.

Alternatively, I'm deep in old man with goofy idea territory. Only time will tell.

replies(1): >>45148750 #

165. lumost ◴[04 Sep 25 13:58 UTC] No.45127357{4}[source]▶

>>45119119 #

I will venture my 2 cents, the equations kinda sorta look like something - but in no way approach a derivation of the wall. Specifically, I would have looked for a derivation which proved for one of/all of

1. Sequence Models relying on a markov chain, with and without summarization to extend beyond fixed length horizons. 2. All forms of attention mechanisms/dense layers. 3. A specific Transformer architecture.

That there exists a limit on the representation or prediction powers of the model for tasks of all input/output token lengths or fixed size N input tokens/M output tokens. *Based On* a derived cost growth schedule for model size, data size, compute budgets.

Separately, I would have expected a clear literature review of existing mathematical studies on LLM capabilities and limitations - for which there are *many*. Including studies that purport that Transformers can represent any program of finite pre-determined execution length.

166. ariadness ◴[04 Sep 25 14:15 UTC] No.45127523{5}[source]▶

>>45125289 #

> anything that is possible is possible with a large enough Markov chain

This is not true. Do you mean anything that is possible to compute? If yes than you missed the point entirely.

replies(1): >>45134479 #

167. pfortuny ◴[04 Sep 25 14:29 UTC] No.45127681{7}[source]▶

>>45125387 #

Have you tried to get google ai studio (nano-banana) to draw a 9-sided polygon? Just that.

https://ibb.co/Qj8hv76h

168. PaulHoule ◴[04 Sep 25 15:08 UTC] No.45128160{7}[source]▶

>>45123910 #

When I was interested in information extraction I saw the problem of resolving language to a semantic model [1] as containing an SMT problem. That is, words are ambiguous, sentences can parse different ways, you have to resolve pronouns and explicit subjects, objects and stuff like that.

Seen that way the text is a set of constraints with a set of variables for all the various choices you make determining it. And of course there is a theory of the world such that "causes must precede their effects" and all the world knowledge about instances such as "Chicago is in Illinois".

The problem is really worse than that because you'll have to parse sentences that weren't generated by sound reasoners or that live in a different microtheory, deal with situations that are ambiguous anyway, etc. Which is why that program never succeeded.

[1] in short: database rows

169. wavemode ◴[04 Sep 25 15:26 UTC] No.45128356{3}[source]▶

>>45123989 #

If my cat has access to my computer keyboard, that doesn't make it a software engineer.

replies(1): >>45136293 #

170. jibal ◴[04 Sep 25 15:29 UTC] No.45128377{7}[source]▶

>>45124370 #

Your whole take is extraordinarily ad hominem. The paper in question is not just people posting opinions.

171. jibal ◴[04 Sep 25 15:30 UTC] No.45128392{5}[source]▶

>>45124453 #

I did. And don't tell me what I can or can't comment on.

172. jibal ◴[04 Sep 25 15:32 UTC] No.45128423{8}[source]▶

>>45125375 #

The fellow is engaged in some pretty intense gatekeeping.

173. hyghjiyhu ◴[04 Sep 25 16:28 UTC] No.45129038[source]▶

>>45120049 #

The head of a Turing machine is a Markov chain, even simpler even as it transitions completely predictably. But give it a tape and it can compute anything.

174. SpicyLemonZest ◴[04 Sep 25 16:40 UTC] No.45129185{4}[source]▶

>>45125727 #

> When we claim someone reasons logically we usually don't imagine they randomly throw ideas at the wall and then consult outputs to determine if they reasoned logically.

I definitely imagine that and I'm surprised to hear you don't. To me it seems obvious that this is how humans reason logically. When you're developing a complex argument, don't you write a sloppy first draft then review to check and clean up the logic?

replies(1): >>45133135 #

175. godelski ◴[04 Sep 25 16:50 UTC] No.45129312{9}[source]▶

>>45125580 #

Sorry, I've just been hearing this response for years now... GPT-5 not SOTA enough for you all now? I remember when people told me to just use 3.5

  - Gemini 2.5 Pro[0], the top model on LLM Arena. This SOTA enough for you? It even hallucinated Python code!

  - Claude Opus 4.1, sharing that chat shares my name, so here's a screenshot[1]. I'll leave that one for you to check. 

  - Grok4 getting the right answer but using bad logic[2]

  - Kimi K2[3]

  - Mistral[4]

I'm sorry, but you can fuck off with your goal post moving. They all do it. Check yourself.

  > I am being serious

Don't lie to yourself, you never were

People like you have been using that copy-paste piss-poor logic since the GPT-3 days. The same exact error existed since those days on all those models just as it does today. You all were highly disingenuous then, and still are now. I know this comment isn't going to change your mind because you never cared about the evidence. You could have checked yourself! So you and your paperclip cult can just fuck off

[0] https://g.co/gemini/share/259b33fb64cc

[1] https://0x0.st/KXWf.png

[2] https://grok.com/s/c2hhcmQtNA%3D%3D_e15bb008-d252-4b4d-8233-...

[3] http://0x0.st/KXWv.png

[4] https://chat.mistral.ai/chat/8e94be15-61f4-4f74-be26-3a4289d...

replies(1): >>45130498 #

176. godelski ◴[04 Sep 25 17:01 UTC] No.45129443{5}[source]▶

>>45125796 #

Let me introduce to you one of X's former staff members arguing that there is no such thing as deep knowledge or expertise[0]

I would love to tell you that I don't meet many people working in AI that share this sentiment, but I'd be lying.

And just for fun, here's a downvoted comment of mine, despite my follow-up comments that evidence my point being upvoted[1] (I got a bit pissed in that last one). The point here is that most people don't want to hear the truth. They are just glossing over things. But I think the two biggest things I've learned from the modern AI movement is: 1) gradient descent and scale are far more powerful than I though, 2) I now understand how used car salesmen are so effective on even people I once thought smart. People love their sycophants...

I swear, we're going to make AGI not by making the AI smarter but by making the people dumber...

[0] https://x.com/yacineMTB/status/1836415592162554121

[1] https://news.ycombinator.com/item?id=45122931

177. FergusArgyll ◴[04 Sep 25 18:23 UTC] No.45130498{10}[source]▶

>>45129312 #

That's very weird, before I wrote my comment I asked gpt5-thinking (yes, once) and it nailed it. I just assumed the rest would get it as well, gemini-2.5 is shocking (the code!) I hereby give you leave to be a curmudgeon for another year...

replies(1): >>45131235 #

178. moonmagick ◴[04 Sep 25 18:58 UTC] No.45130942[source]▶

>>45120049 #

Gary Marcus is a broken clock

He's been saying LLMs wouldn't scale since GPT-3 came out

And yet we all use them every day

Who cares about tracing? Prolog can't be multi-threaded on a GPU, why is that even in the conversation lol

179. godelski ◴[04 Sep 25 19:25 UTC] No.45131235{11}[source]▶

>>45130498 #

Try a few times and it'll happen. I don't think it took me more than 3 tries on any platform.

To convince me it is "reasoning", it needs to get the answer right consistently. Most attempts were actually about getting it to show its results. But pay close attention. GPT got the answer right several times but through incorrect calculations. Go check the "thinking" and see if it does a 11-9=2 calculation somewhere, I saw this >50% of the attempts. You should be able to reproduce my results in <5 minutes.

Forgive my annoyance, but we've been hearing the same argument you've made for years[0,1,2,3,4]. We're talking about models that have been reported as operating at "PhD Level" since the previous generation. People have constantly been saying "But I get the right answer" or "if you use X model it'll get it right" while missing the entire point. It never mattered if it got the answer right once, it matters that it can do it consistently. It matters how it gets the answer if you want to claim reasoning. There is still no evidence that LLMs can perform even simple math consistently, despite years of such claims[5]

[0] https://news.ycombinator.com/item?id=34113657

[1] https://news.ycombinator.com/item?id=36288834

[2] https://news.ycombinator.com/item?id=36089362

[3] https://news.ycombinator.com/item?id=37825219

[4] https://news.ycombinator.com/item?id=37825059

[5] Don't let your eyes trick you, not all those green squares are 100%... You'll also see many "look X model got it right!" in response to something tested multiple times... https://x.com/yuntiandeng/status/1889704768135905332

180. ◴[04 Sep 25 20:04 UTC] No.45131644[source]▶

>>45120049 #

181. stonogo ◴[04 Sep 25 21:01 UTC] No.45132135{5}[source]▶

>>45119914 #

If you can't look at that publication list and see their expertise in macine learning, then it may be that they know more about your field than you know about theirs. Nothing wrong with that! Computational chemists use different terminology than computer scientists but there is significant overlap in the fields.

182. voidhorse ◴[04 Sep 25 22:50 UTC] No.45133135{5}[source]▶

>>45129185 #

I think you're mistaking my claim for something else. When I say logical reasoning here, I mean the dead simple reasoning that tells you that 1 + 1 - 1 = 1 or that, by definition, x <= y and y <= x imply x = y. You can reach these conclusions because you understand arithmetic or aspects of order theory and can use the basic definitions of those theories to deduce others. You don't need to throw random guesses at the wall to reach these conclusions or operationally execute an algorithm every time, because you use your understanding and logical reasoning to reach an immediate conclusion, but LLMs precisely don't do this. Maybe you memorize these facts instead of using logic, or maybe you consult Google each time but then I wouldn't claim that you understand arithmetic or order theory either.

183. calf ◴[05 Sep 25 00:45 UTC] No.45133873{6}[source]▶

>>45124319 #

I'll have you realize you are replying (quite arrogantly by the way) to someone who wrote part of their PhD dissertation on models of computation. Try again :)

Besides, it is patently false. Not every Markov chain is an LLM, an actual LLM outputs human-readable English, while the vast majority of Markov chains do not map onto that set of models.

replies(1): >>45134747 #

184. yorwba ◴[05 Sep 25 02:33 UTC] No.45134479{6}[source]▶

>>45127523 #

It's mostly a consequence of the laws of physics having the Markov property. So the time evolution of any physical system can be modeled as a Markov process. Of course the corresponding state space may in general be infinite.

185. measurablefunc ◴[05 Sep 25 03:29 UTC] No.45134747{7}[source]▶

>>45133873 #

Appeals to authority do not change the logical content of an argument. You are welcome to point to the part of the linked argument that is incorrect & present a counter-example to demonstrate the error.

replies(1): >>45136257 #

186. godelski ◴[05 Sep 25 08:23 UTC] No.45136257{8}[source]▶

>>45134747 #

Calf isn't making an appeal to authority. They are saying "I'm not the idiot you think I am." Two very different things. Likely also a request to talk more mathy to them.

I read your link btw and I just don't know how someone can do all that work and not establish the Markov Property. That's like the first step. Speaking of which, I'm not sure I even understand the first definition of your link. I've never heard the phrase "computably countable" before, but I have head "computable number," which these numbers are countable. This does seem to be what it is referring to? So I'll assume that? (My dissertation wasn't on models of computation, it was on neural architectures) In 1.2.2 is there a reason for strictly uniform noise? It also seems to run counter to the deterministic setting.

Regardless, I agree with Calf, it's very clear MCs are not equivalent to LLMs. That is trivially a false statement. But the question of if an LLM can be represented via a MC is a different question. I did find this paper on the topic[0], but I need to give it a better read. Does look like it was rejected from ICLR[1], though ML review is very noisy. Including the link as comments are more informative than the accept/reject signal.

(@Calf, sorry, I didn't respond to your comment because I wasn't trying to make a comment about the relationship of LLMs and MCs. Only that there was more fundamental research being overshadowed)

[0] https://arxiv.org/abs/2410.02724

[1] https://openreview.net/forum?id=RDFkGZ9Dkh

replies(1): >>45142605 #

187. bubblyworld ◴[05 Sep 25 08:30 UTC] No.45136293{4}[source]▶

>>45128356 #

LLMs can clearly make use of tools, unlike your cat. The claim was that they cannot do backtracking natively, which may or may not be true but it's irrelevant because they can do it through code.

Who said anything about software engineers?

188. suddenlybananas ◴[05 Sep 25 10:35 UTC] No.45137064{7}[source]▶

>>45126938 #

Your writing reminds me of a schizophrenic.

189. chermi ◴[05 Sep 25 16:43 UTC] No.45140626{7}[source]▶

>>45124354 #

Huh? Have you heard of learning something new? Physicists and scientists at large are pretty good at it. Do you want some certification program to determine who's allowed to opine? If someone is wrong, tell them and show them they're wrong. Don't preemptively dismiss ideas based on some authority mechanism.

Here's another example in case you still don't get the point - Schrodinger had no business talking about biology because he wasn't trained in it, right? Nevermind him being ahead of the entire field on understanding the role of "DNA"(yet undiscovered, but he correctly posited the crystal-ish structure) and information in evolution and inspiring Watson's quest to figure out DNA.

Judge ideas on the merit of the idea itself. It's not about whether they have computing backgrounds, its about the ideas.

Hell, look at the history of deep learning with Minsky's book. Sure glad everyone listened to the linguistics expert there...

190. measurablefunc ◴[05 Sep 25 19:26 UTC] No.45142605{9}[source]▶

>>45136257 #

If it's trivially false then you should be able to present a counter-example but so far no one has done that but there has been a lot of hand-waving about "trivialities" of one sort or another.

Neural networks are stateless, the output only depends on the current input so the Markov property is trivially/vacuously true. The reason for the uniform random number for sampling from the CDF¹ is b/c if you have the cumulative distribution function of a probability density then you can sample from the distribution by using a uniformly distributed RNG.

¹https://stackoverflow.com/questions/60559616/how-to-sample-f...

replies(1): >>45147349 #

191. throwaway314155 ◴[06 Sep 25 01:14 UTC] No.45145646{5}[source]▶

>>45126609 #

Well said

192. measurablefunc ◴[06 Sep 25 02:58 UTC] No.45146244{11}[source]▶

>>45121690 #

So where is the error exactly? Loop around is simply a repetition of the argument for the equivalence between an LLM & a Markov chain. It doesn't matter how many times you sample the trajectories from either one, they're still extensionally equivalent.

193. godelski ◴[06 Sep 25 07:32 UTC] No.45147349{10}[source]▶

>>45142605 #

You want me to show that it is trivially false that all Neural Networks are not Markov Chains? I mean we could point to a RNN which doesn't have the Markov Property. I mean another trivial case is when the rows do not sum to 1. I mean the internal states of neural networks are not required to be probability distributions. In fact, this isn't a requirement anywhere in a neural network. So whatever you want to call the transition matrix you're going to have issues.

Or the inverse of this? That all Markov Chains are Neural Networks? Sure. Well sure, here's my transition matrix [1].

I'm quite positive an LLM would be able to give you more examples.

  > the output only depends on the current input so the Markov property is trivially/vacuously true.

It's pretty clear you did not get your PhD in ML.

  > The reason for the uniform random number

I think you're misunderstanding. Maybe I'm misunderstanding. But I'm failing to understand why you're jumping to the CDF. I also don't understand why this answers my question since there are other ways to sample from a distribution knowing only its CDF and without using the uniform distribution. I mean you can always convert to the uniform distribution and there's lots of tricks to do that. Or I mean the distribution in that SO post is the Rayleigh Distribution so we don't even need to do that. My question was not about that uniform is clean, but that it is a requirement. But this just doesn't seem relevant at all.

replies(1): >>45151801 #

194. pama ◴[06 Sep 25 12:36 UTC] No.45148750{3}[source]▶

>>45127090 #

There is room for improvement in inference, hence the presence of various startups in this space and the increased innovation in software. Large nvidia clusters are still cost optimal for scaling inference (as they move most of the memory transfer of smaller setups out of the critical path), and their energy cost is trivial compared to the cost of the hardware, but these conditions may change.

Training is nearly fully compute bound and NVidia/CUDA provide decent abstractions for it. At least for now. We still need new ideas if training is to scale another 10 orders of magnitude in compute, but these ideas may not be practical for another decade.

195. measurablefunc ◴[06 Sep 25 18:39 UTC] No.45151801{11}[source]▶

>>45147349 #

Either find the exact error in the proof or stop running around in circles. The proof is very simple so if there is an error in any of it you should be able to find one very easily but you haven't done that. You have only asked for unrelated clarifications & gone on unrelated tangents.

196. AaronAPU ◴[06 Sep 25 21:50 UTC] No.45153196{5}[source]▶

>>45125963 #

Those words contain far more relations than just “sticks” — the reduction is in your framing.

↑