Most active commenters

measurablefunc(15)
Certhas(5)
Ukv(5)
awesome_dude(4)
mallowdram(4)
quantummagic(4)
vidarh(4)
PaulHoule(3)

Popular/hot comments

>>45120536 #
>>45120655 #
>>45120313 #
>>45120344 #
>>45120411 #

←back to thread

The wall confronting large language models

(arxiv.org)

Show context

measurablefunc ◴[03 Sep 25 20:29 UTC] No.45120049[source]▶

>>45114579 (OP) #

There is a formal extensional equivalence between Markov chains & LLMs but the only person who seems to be saying anything about this is Gary Marcus. He is constantly making the point that symbolic understanding can not be reduced to a probabilistic computation regardless of how large the graph gets it will still be missing basic stuff like backtracking (which is available in programming languages like Prolog). I think that Gary is right on basically all counts. Probabilistic generative models are fun but no amount of probabilistic sequence generation can be a substitute for logical reasoning.

replies(16): >>45120249 #>>45120259 #>>45120415 #>>45120573 #>>45120628 #>>45121159 #>>45121215 #>>45122702 #>>45122805 #>>45123808 #>>45123989 #>>45125478 #>>45125935 #>>45129038 #>>45130942 #>>45131644 #

1. Certhas ◴[03 Sep 25 20:48 UTC] No.45120259[source]▶

>>45120049 #

I don't understand what point you're hinting at.

Either way, I can get arbitrarily good approximations of arbitrary nonlinear differential/difference equations using only linear probabilistic evolution at the cost of a (much) larger state space. So if you can implement it in a brain or a computer, there is a sufficiently large probabilistic dynamic that can model it. More really is different.

So I view all deductive ab-initio arguments about what LLMs can/can't do due to their architecture as fairly baseless.

(Note that the "large" here is doing a lot of heavy lifting. You need _really_ large. See https://en.m.wikipedia.org/wiki/Transfer_operator)

replies(5): >>45120313 #>>45120341 #>>45120344 #>>45123837 #>>45124441 #

2. arduanika ◴[03 Sep 25 20:54 UTC] No.45120313[source]▶

>>45120259 (TP) #

What hinting? The comment was very clear. Arbitrarily good approximation is different from symbolic understanding.

"if you can implement it in a brain"

But we didn't. You have no idea how a brain works. Neither does anyone.

replies(3): >>45120357 #>>45120411 #>>45121006 #

3. awesome_dude ◴[03 Sep 25 20:57 UTC] No.45120341[source]▶

>>45120259 (TP) #

I think that the difference can be best explained thus:

I guess that you are most likely going to have cereal for breakfast tomorrow, I also guess that it's because it's your favourite.

I understand that you don't like cereal for breakfast, and I understand that you only have it every day because a Dr told you that it was the only way for you to start the day in a way that aligns with your health and dietary needs.

Meaning, I can guess based on past behaviour and be right, but understanding the reasoning for those choices, that's a whole other ballgame. Further, if we do end up with an AI that actually understands, well, that would really open up creativity, and problem solving.

replies(1): >>45120879 #

4. measurablefunc ◴[03 Sep 25 20:57 UTC] No.45120344[source]▶

>>45120259 (TP) #

What part about backtracking is baseless? Typical Prolog interpreters can be implemented in a few MBs of binary code (the high level specification is even simpler & can be in a few hundred KB)¹ but none of the LLMs (open source or not) are capable of backtracking even though there is plenty of room for a basic Prolog interpreter. This seems like a very obvious shortcoming to me that no amount of smooth approximation can overcome.

If you think there is a threshold at which point some large enough feedforward network develops the capability to backtrack then I'd like to see your argument for it.

¹https://en.wikipedia.org/wiki/Warren_Abstract_Machine

replies(3): >>45120516 #>>45121626 #>>45124764 #

5. Certhas ◴[03 Sep 25 20:59 UTC] No.45120357[source]▶

>>45120313 #

We didn't but somebody did so it's possible so probabilistic dynamics in high enough dimensions can do it.

We don't understand what LLMs are doing. You can't go from understanding what a transformer is to understanding what an LLM does any more than you can go from understanding what a Neuron is to what a brain does.

6. mallowdram ◴[03 Sep 25 21:04 UTC] No.45120411[source]▶

>>45120313 #

We know the healthy brain is unpredictable. We suspect error minimization and prediction are not central tenets. We know the brain creates memory via differences in sharp wave ripples. That it's oscillatory. That it neither uses symbols nor represents. That words are wholly external to what we call thought. The authors deal with molecules which are neither arbitrary nor specific. Yet tumors ARE specific, while words are wholly arbitrary. Knowing these things should offer a deep suspicion of ML/LLMs. They have so little to do with how brains work and the units brains actually use (all oscillation is specific, all stats emerge from arbitrary symbols and worse: metaphors) that mistaking LLMs for reasoning/inference is less lexemic hallucination and more eugenic.

replies(3): >>45120774 #>>45120824 #>>45124688 #

7. bondarchuk ◴[03 Sep 25 21:18 UTC] No.45120516[source]▶

>>45120344 #

Backtracking makes sense in a search context which is basically what prolog is. Why would you expect a next-token-predictor to do backtracking and what should that even look like?

replies(2): >>45120536 #>>45120766 #

8. measurablefunc ◴[03 Sep 25 21:20 UTC] No.45120536{3}[source]▶

>>45120516 #

I don't expect a Markov chain to be capable of backtracking. That's the point I am making. Logical reasoning as it is implemented in Prolog interpreters is not something that can be done w/ LLMs regardless of the size of their weights, biases, & activation functions between the nodes in the graph.

replies(4): >>45120598 #>>45121266 #>>45122145 #>>45124657 #

9. bondarchuk ◴[03 Sep 25 21:27 UTC] No.45120598{4}[source]▶

>>45120536 #

Imagine the context window contains A-B-C, C turns out a dead end and we want to backtrack to B and try another branch. Then the LLM could produce outputs such that the context window would become A-B-C-[backtrack-back-to-B-and-don't-do-C] which after some more tokens could become A-B-C-[backtrack-back-to-B-and-don't-do-C]-D. This would essentially be backtracking and I don't see why it would be inherently impossible for LLMs as long as the different branches fit in context.

replies(1): >>45120655 #

10. measurablefunc ◴[03 Sep 25 21:35 UTC] No.45120655{5}[source]▶

>>45120598 #

If you think it is possible then I'd like to see an implementation of a sudoku puzzle solver as Markov chain. This is a simple enough problem that can be implemented in a few dozen lines of Prolog but I've never seen a solver implemented as a Markov chain.

replies(4): >>45120792 #>>45120895 #>>45120900 #>>45121022 #

11. PaulHoule ◴[03 Sep 25 21:51 UTC] No.45120766{3}[source]▶

>>45120516 #

If you want general-purpose generation than it has to be able to respect constraints (e.g. figure art of a person has 0..1 belly buttons, 0..2 legs is unspoken) as it is generative models usually get those things right but don't always if they can stick together the tiles they use internally in some combination that makes sense locally but not globally.

General intelligence may not be SAT/SMT solving but it has to be able to do it, hence, backtracking.

Today I had another of those experiences of the weaknesses of LLM reasoning, one that happens a lot when doing LLM-assisted coding. I was trying to figure out how to rebuild some CSS after the HTML changed for accessibility purposes and got a good idea for how to do it from talking to the LLM but at that point the context was poisoned, probably because there was a lot of content about the context describing what we were thinking about at different stages of the conversation which evolved considerably. It lost its ability to follow instructions and I'd tell it specifically to do this or do that and it just wouldn't do it properly and this happens a lot if a session goes on too long.

My guess is that the attention mechanism is locking on to parts of the conversation which are no longer relevant to where I think we're at and in general the logic that considers the variation of either a practice (instances) or a theory over time is a very tricky problem and 'backtracking' is a specific answer for maintaining your knowledge base across a search process.

replies(2): >>45121102 #>>45123910 #

12. Zigurd ◴[03 Sep 25 21:52 UTC] No.45120774{3}[source]▶

>>45120411 #

"That words are wholly external to what we call thought." may be what we should learn, or at least hypothesize, based on what we see LLMs doing. I'm disappointed that AI isn't more of a laboratory for understanding brain architecture, and precisely what is this thing called thought.

replies(1): >>45121279 #

13. sudosysgen ◴[03 Sep 25 21:54 UTC] No.45120792{6}[source]▶

>>45120655 #

You can do that pretty trivially for any fixed size problem (as in solvable with a fixed-sized tape Turing machine), you'll just have a titanically huge state space. The claim of the LLM folks is that the models have a huge state space (they do have a titanically huge state space) and can navigate it efficiently.

Simply have a deterministic Markov chain where each state is a possible value of the tape+state of the TM and which transitions accordingly.

replies(1): >>45120915 #

14. quantummagic ◴[03 Sep 25 21:58 UTC] No.45120824{3}[source]▶

>>45120411 #

What do you think about the idea that LLMs are not reasoning/inferring, but are rather an approximation of the result? Just like you yourself might have to spend some effort reasoning, on how a plant grows, in order to answer questions about that subject. When asked, you wouldn't replicate that reasoning, instead you would recall the crystallized representation of the knowledge you accumulated while previously reasoning/learning. The "thinking" in the process isn't modelled by the LLM data, but rather by the code/strategies used to iterate over this crystallized knowledge, and present it to the user.

replies(1): >>45121309 #

15. quantummagic ◴[03 Sep 25 22:04 UTC] No.45120879[source]▶

>>45120341 #

How are the two cases you present fundamentally different? Aren't they both the same _type_ of knowledge? Why do you attribute "true understanding" to the case of knowing what the Dr said? Why stop there? Isn't true understanding knowing why we trust what the doctor said (all those years of schooling, and a presumption of competence, etc)? And why stop there? Why do we value years of schooling? Understanding, can always be taken to a deeper level, but does that mean we didn't "truly" understand earlier? And aren't the data structures needed to encode the knowledge, exactly the same for both cases you presented?

replies(1): >>45121027 #

16. bboygravity ◴[03 Sep 25 22:06 UTC] No.45120895{6}[source]▶

>>45120655 #

The LLM can just write the Prolog and solve the sudoku that way. I don't get your point. LLMs like Grok 4 can probably one-shot this today with the current state of art. You can likely just ask it to solve any sudoku and it will do it (by writing code in the background and running it and returning the result). And this is still very early stage compared to what will be out a year from now.

Why does it matter how it does it or whether this is strictly LLM or LLM with tools for any practical purpose?

replies(1): >>45123059 #

17. Ukv ◴[03 Sep 25 22:06 UTC] No.45120900{6}[source]▶

>>45120655 #

> If you think it is possible then I'd like to see an implementation of a sudoku puzzle solver as Markov chain

Have each of the Markov chain's states be one of 10^81 possible sudoku grids (a 9x9 grid of digits 1-9 and blank), then calculate the 10^81-by-10^81 transition matrix that takes each incomplete grid to the valid complete grid containing the same numbers. If you want you could even have it fill one square at a time rather than jump right to the solution, though there's no need to.

Up to you what you do for ambiguous inputs (select one solution at random to give 1.0 probability in the transition matrix? equally weight valid solutions? have the states be sets of boards and map to set of all valid solutions?) and impossible inputs (map to itself? have the states be sets of boards and map to empty set?).

Could say that's "cheating" by pre-computing the answers and hard-coding them in a massive input-output lookup table, but to my understanding that's also the only sense in which there's equivalence between Markov chains and LLMs.

replies(1): >>45120933 #

18. measurablefunc ◴[03 Sep 25 22:08 UTC] No.45120915{7}[source]▶

>>45120792 #

How are you encoding the state spaces for the sudoku solver specifically?

19. measurablefunc ◴[03 Sep 25 22:11 UTC] No.45120933{7}[source]▶

>>45120900 #

There are multiple solutions for each incomplete grid so how are you calculating the transitions for a grid w/ a non-unique solution?

Edit: I see you added questions for the ambiguities but modulo those choices your solution will almost work b/c it is not extensionally equivalent entirely. The transition graph and solver are almost extensionally equivalent but whereas the Prolog solver will backtrack there is no backtracking in the Markov chain and you have to re-run the chain multiple times to find all the solutions.

replies(2): >>45121032 #>>45121057 #

20. jjgreen ◴[03 Sep 25 22:21 UTC] No.45121006[source]▶

>>45120313 #

You can look at it, from the inside.

21. lelanthran ◴[03 Sep 25 22:22 UTC] No.45121022{6}[source]▶

>>45120655 #

> If you think it is possible then I'd like to see an implementation of a sudoku puzzle solver as Markov chain. This is a simple enough problem that can be implemented in a few dozen lines of Prolog but I've never seen a solver implemented as a Markov chain.

I think it can be done. I started a chatbot that works like this some time back (2024) but paused work on it since January.

In brief, you shorten the context by discarding the context that didn't work out.

22. awesome_dude ◴[03 Sep 25 22:23 UTC] No.45121027{3}[source]▶

>>45120879 #

When you ask that question, why don't you just use a corpus of the previous answers to get some result?

Why do you need to ask me, isn't a guess based on past answers good enough?

Or, do you understand that you need to know more, you need to understand the reasoning based on what's missing from that post?

replies(1): >>45121696 #

23. ◴[03 Sep 25 22:25 UTC] No.45121032{8}[source]▶

>>45120933 #

24. Ukv ◴[03 Sep 25 22:26 UTC] No.45121057{8}[source]▶

>>45120933 #

> but whereas the Prolog solver will backtrack there is no backtracking in the Markov chain and you have to re-run the chain multiple times to find all the solutions

If you want it to give all possible solutions at once, you can just expand the state space to the power-set of sudoku boards, such that the input board transitions to the state representing the set of valid solved boards.

replies(2): >>45121077 #>>45126405 #

25. measurablefunc ◴[03 Sep 25 22:29 UTC] No.45121077{9}[source]▶

>>45121057 #

That still won't work b/c there is no backtracking. The point is that there is no way to encode backtracking/choice points like in Prolog w/ a Markov chain. The argument you have presented is not extensionally equivalent to the Prolog solver. It is almost equivalent but it's missing choice points for starting at a valid solution & backtracking to an incomplete board to generate a new one. The typical argument for absorbing states doesn't work b/c sudoku is not a typical deterministic puzzle.

replies(1): >>45121229 #

26. XenophileJKO ◴[03 Sep 25 22:34 UTC] No.45121102{4}[source]▶

>>45120766 #

What if you gave the model a tool to "willfully forget" a section of context. That would be easy to make. Hmm I might be onto something.

replies(1): >>45121201 #

27. PaulHoule ◴[03 Sep 25 22:48 UTC] No.45121201{5}[source]▶

>>45121102 #

I guess you could have some kind of mask that would let you suppress some of the context from matching, but my guess is that kind of thing might cause problems as often as it solves them.

Back when I was thinking about commonsense reasoning with logic it was obviously a much more difficult problem to add things like "P was true before time t", "there will be some time t in the future such at P is true", "John believes Mary believes that P is true", "It is possible that P is true", "there is some person q who believes that P is true", particularly when you combine these qualifiers. For one thing you don't even have a sound and complete strategy for reasoning over first-order logic + arithmetic but you also have a combinatorical explosion over the qualifiers.

Back in the day I thought it was important to have sound reasoning procedures but one of the reasons none of my foundation models ever became ChatGPT was that I cared about that and I really needed to ask "does change C cause an unsound procedure to get the right answer more often?" and not care if the reasoning procedure was sound or not.

28. Ukv ◴[03 Sep 25 22:52 UTC] No.45121229{10}[source]▶

>>45121077 #

> That still won't work b/c there is no backtracking.

It's essentially just a lookup table mapping from input board to the set of valid output boards - there's no real way for it not to work (obviously not practical though). If board A has valid solutions B, C, D, then the transition matrix cell mapping {A} to {B, C, D} is 1.0, and all other entries in that row are 0.0.

> The point is that there is no way to encode backtracking/choice points

You can if you want, keeping the same variables as a regular sudoku solver as part of the Markov chain's state and transitioning instruction-by-instruction, rather than mapping directly to the solution - just that there's no particular need to when you've precomputed the solution.

replies(1): >>45121264 #

29. measurablefunc ◴[03 Sep 25 22:57 UTC] No.45121264{11}[source]▶

>>45121229 #

My point is that your initial argument was missing several key pieces & if you specify the entire state space you will see that it's not as simple as you thought initially. I'm not saying it can't be done but that it's actually much more complicated than simply saying just take an incomplete board state s & uniform transitions between s, s' for valid solutions s' that are compatible with s. In fact, now that I spelled out the issues I still don't think this is a formal extensional equivalence. Prolog has interactive transitions between the states & it tracks choice points so compiling a sudoku solver to a Markov chain requires more than just tracking the board state in the context.

replies(1): >>45121671 #

30. vidarh ◴[03 Sep 25 22:57 UTC] No.45121266{4}[source]▶

>>45120536 #

A (2,3) Turing machine can be trivially implemented with a loop around an LLM that treats the context as an IO channel, and a Prolog interpreter runs on a Turing complete computer, and so per Truing equivalence you can run a Prolog interpreter on an LLM.

Of course this would be pointless, but it demonstrates that a system where an LLM provides the logic can backtrack, as there's nothing computationally special about backtracking.

That current UIs to LLMs are set up for conversation-style use that makes this harder isn't an inherent limitation of what we can do with LLMs.

replies(1): >>45121294 #

31. mallowdram ◴[03 Sep 25 22:59 UTC] No.45121279{4}[source]▶

>>45120774 #

The question is how to model the irreducible. And then to concatenate between spatiotemporal neuroscience (the oscillators) and neural syntax (what's oscillating) and add or subtract what the fields are doing to bind that to the surroundings.

32. measurablefunc ◴[03 Sep 25 23:01 UTC] No.45121294{5}[source]▶

>>45121266 #

Loop around an LLM is not an LLM.

replies(1): >>45121337 #

33. mallowdram ◴[03 Sep 25 23:03 UTC] No.45121309{4}[source]▶

>>45120824 #

This is toughest part. We need some kind of analog external that concatenates. It's software, but not necessarily binary, it uses topology to express that analog. It somehow is visual, ie you can see it, but at the same time, it can be expanded specifically into syntax, which the details of are invisible. Scale invariance is probably key.

34. vidarh ◴[03 Sep 25 23:07 UTC] No.45121337{6}[source]▶

>>45121294 #

Then no current systems you are using are LLMs

replies(1): >>45121409 #

35. measurablefunc ◴[03 Sep 25 23:16 UTC] No.45121409{7}[source]▶

>>45121337 #

Choice-free feedforward graphs are LLMs. The inputs/outputs are extensionally equivalent to context and transition probabilities of a Markov chain. What exactly is your argument b/c what it looks like to me is you're simply making a Turing tarpit argument which does not address any of my points.

replies(1): >>45121690 #

36. skissane ◴[03 Sep 25 23:42 UTC] No.45121626[source]▶

>>45120344 #

> but none of the LLMs (open source or not) are capable of backtracking even though there is plenty of room for a basic Prolog interpreter. This seems like a very obvious shortcoming to me that no amount of smooth approximation can overcome.

The fundamental autoregressive architecture is absolutely capable of backtracking… we generate next token probabilities, select a next token, then calculate probabilities for the token thereafter.

There is absolutely nothing stopping you from “rewinding” to an earlier token, making a different selection and replaying from that point. The basic architecture absolutely supports it.

Why then has nobody implemented it? Maybe, this kind of backtracking isn’t really that useful.

replies(2): >>45121703 #>>45124591 #

37. Ukv ◴[03 Sep 25 23:47 UTC] No.45121671{12}[source]▶

>>45121264 #

> My point is that your initial argument was missing several key pieces

My initial example was a response to "If you think it is possible then I'd like to see an implementation of a sudoku puzzle solver as Markov chain", describing how a Sudoku solver could be implemented as a Markov chain. I don't think there's anything missing from it - it solves all proper Sudokus, and I only left open the choice of how to handle improper Sudokus because that was unspecified (but trivial regardless of what's wanted).

> I'm not saying it can't be done but that it's actually much more complicated

If that's the case, then I did misinterpret your comments as saying it can't be done. But, I don't think it's really complicated regardless of whatever "ok but now it must encode choice points in its state" are thrown at it - it's just a state-to-state transition look-up table.

> so compiling a sudoku solver to a Markov chain requires more than just tracking the board state in the context.

As noted, you can keep all the same variables as a regular Sudoku solver as part of the Markov chain's state and transition instruction-by-instruction, if that's what you want.

If you mean inputs from a user, the same is true of LLMs which are typically ran interactively. Either model the whole universe including the user as part of state transition table (maybe impossible, depending on your beliefs about the universe), or have user interaction take the current state, modify it, and use it as initial state for a new run of the Markov chain.

replies(1): >>45121836 #

38. vidarh ◴[03 Sep 25 23:50 UTC] No.45121690{8}[source]▶

>>45121409 #

My argument is that artificially limiting what you argue about to a subset of the systems people are actually using and arguing about the limitations of that makes your argument irrelevant to what people are actually using.

replies(1): >>45146244 #

39. quantummagic ◴[03 Sep 25 23:51 UTC] No.45121696{4}[source]▶

>>45121027 #

I asked that question in an attempt to not sound too argumentative. It was rhetorical. I'm asking you to consider the fact that there isn't actually any difference between the two examples you provided. They're fundamentally the same type of knowledge. They can be represented by the same data structures.

There's _always_ something missing, left unsaid in every example, it's the nature of language.

As for your example, the LLM can be trained to know the underlying reasons (doctor's recommendation, etc.). That knowledge is not fundamentally different from the knowledge that someone tends to eat cereal for breakfast. My question to you, was an attempt to highlight that the dichotomy you were drawing, in your example, doesn't actually exist.

replies(1): >>45122364 #

40. measurablefunc ◴[03 Sep 25 23:51 UTC] No.45121703{3}[source]▶

>>45121626 #

Where is this spelled out formally and proven logically?

replies(1): >>45121936 #

41. measurablefunc ◴[04 Sep 25 00:07 UTC] No.45121836{13}[source]▶

>>45121671 #

> As noted, you can keep all the same variables as a regular Sudoku solver

What are those variables exactly?

replies(1): >>45122154 #

42. skissane ◴[04 Sep 25 00:22 UTC] No.45121936{4}[source]▶

>>45121703 #

LLM backtracking is an active area of research, see e.g.

https://arxiv.org/html/2502.04404v1

https://arxiv.org/abs/2306.05426

And I was wrong that nobody has implemented it, as these papers prove people have… it is just the results haven’t been sufficiently impressive to support the transition from the research lab to industrial use - or at least, not yet

replies(2): >>45122006 #>>45122999 #

43. measurablefunc ◴[04 Sep 25 00:31 UTC] No.45122006{5}[source]▶

>>45121936 #

> Empirical evaluations demonstrate that our proposal significantly enhances the reasoning capabilities of LLMs, achieving a performance gain of over 40% compared to the optimal-path supervised fine-tuning method.

44. ◴[04 Sep 25 00:48 UTC] No.45122145{4}[source]▶

>>45120536 #

45. Ukv ◴[04 Sep 25 00:49 UTC] No.45122154{14}[source]▶

>>45121836 #

For a depth-first solution (backtracking), I'd assume mostly just the partial solutions and a few small counters/indices/masks - like for tracking the cell we're up to and which cells were prefilled. Specifics will depend on the solver, but can be made part of Markov chain's state regardless.

46. awesome_dude ◴[04 Sep 25 01:21 UTC] No.45122364{5}[source]▶

>>45121696 #

> They're fundamentally the same type of knowledge. They can be represented by the same data structures.

Maybe, maybe one is based on correlation, the other causation.

replies(1): >>45122758 #

47. quantummagic ◴[04 Sep 25 02:16 UTC] No.45122758{6}[source]▶

>>45122364 #

What if the causation had simply been that he enjoyed cereal for breakfast?

In either case, the results are the same, he's eating cereal for breakfast. We can know this fact without knowing the underlying cause. Many times, we don't even know the cause of things we choose to do for ourselves, let alone what others do.

On top of which, even if you think the "cause" is that the doctor told him to eat a healthy diet, do you really know the actual cause? Maybe the real cause, is that the girl he fancies, told him he's not in good enough shape. The doctor telling him how to get in shape is only a correlation, the real cause is his desire to win the girl.

These connections are vast and deep, but they're all essentially the same type of knowledge, representable by the same data structures.

replies(1): >>45123193 #

48. afiori ◴[04 Sep 25 02:54 UTC] No.45122999{5}[source]▶

>>45121936 #

I would expect to see something like this soonish as around now we are seeing the end of training scaling and the beginning of inference scaling

replies(1): >>45123196 #

49. PhunkyPhil ◴[04 Sep 25 03:05 UTC] No.45123059{7}[source]▶

>>45120895 #

The point isn't if the output is correct or not, it's if the actual net is doing "logical computation" ala Prolog.

What you're suggesting is akin to me saying you can't build a house, then you go and hire someone to build a house. _You_ didn't build the house.

replies(1): >>45124543 #

50. awesome_dude ◴[04 Sep 25 03:24 UTC] No.45123193{7}[source]▶

>>45122758 #

> In either case, the results are the same, he's eating cereal for breakfast. We can know this fact without knowing the underlying cause. Many times, we don't even know the cause of things we choose to do for ourselves, let alone what others do.

Yeah, no.

Understanding the causation allows the system to provide a better answer.

If they "enjoy" cereal, what about it do they enjoy, and what other possible things can be had for breakfast that also satisfy that enjoyment.

You'll never find that by looking only at the fact that they have eaten cereal for breakfast.

And the fact that that's not obvious to you is why I cannot be bothered going into any more depth on the topic any more. It's clear that you don't have any understanding on the topic beyond a superficial glance.

Bye :)

51. foota ◴[04 Sep 25 03:24 UTC] No.45123196{6}[source]▶

>>45122999 #

This is a neat observation, training has been optimized to hell and inference is just beginning.

52. patrick451 ◴[04 Sep 25 05:19 UTC] No.45123837[source]▶

>>45120259 (TP) #

> Either way, I can get arbitrarily good approximations of arbitrary nonlinear differential/difference equations using only linear probabilistic evolution at the cost of a (much) larger state space.

This is impossible. When driven by a sinusoid, a linear system will only ever output a sinusoid with exactly the same frequency but a different amplitude and phase regardless of how many states you give it. A non-linear system can change the frequency or output multiple frequencies.

replies(1): >>45124025 #

53. photonthug ◴[04 Sep 25 05:35 UTC] No.45123910{4}[source]▶

>>45120766 #

> General intelligence may not be SAT/SMT solving but it has to be able to do it, hence, backtracking.

Just to add some more color to this. For problems that completely reduce to formal methods or have significant subcomponents that involve it, combinatorial explosion in state-space is a notorious problem and N variables is going to stick you with 2^N at least. It really doesn't matter whether you think you're directly looking at solving SAT/search, because it's too basic to really be avoided in general.

When people talk optimistically about hallucinations not being a problem, they generally mean something like "not a problem in the final step" because they hope they can evaluate/validate something there, but what about errors somewhere in the large middle? So even with a very tiny chance of hallucinations in general, we're talking about an exponential number of opportunities in implicit state-transitions to trigger those low-probability errors.

The answer to stuff like this is supposed to be "get LLMs to call out to SAT solvers". Fine, definitely moving from state-space to program-space is helpful, but it also kinda just pushes the problem around as long as the unconstrained code generation is still prone to hallucination.. what happens when it validates, runs, and answers.. but the spec was wrong?

Personally I'm most excited about projects like AlphaEvolve that seem fearless about hybrid symbolics / LLMs and embracing the good parts of GOFAI that LLMs can make tractable for the first time. Instead of the "reasoning is dead, long live messy incomprehensible vibes", those guys are talking about how to leverage earlier work, including things like genetic algorithms and things like knowledge-bases.[0] Especially with genuinely new knowledge-discovery from systems like this, I really don't get all the people who are still staunchly in either an old-school / new-school camp on this kind of thing.

[0]: MLST on the subject: https://www.youtube.com/watch?v=vC9nAosXrJw

replies(1): >>45128160 #

54. diffeomorphism ◴[04 Sep 25 05:58 UTC] No.45124025[source]▶

>>45123837 #

As far as I understand, the terminology says "linear" but means compositions of affine (with cutoffs etc). That gives you arbitrary polynomials and piecewise affine, which are dense in most classes of interest.

Of course, in practice you don't actually get arbitrary degree polynomials but some finite degree, so the approximation might still be quite bad or inefficient.

55. baselessness ◴[04 Sep 25 07:08 UTC] No.45124441[source]▶

>>45120259 (TP) #

That's what this debate has been reduced to. People point out the logical and empirical, by now very obvious limitation of LLMs. And boosters are the equivalent of Chopra's "quantum physics means anything is possible" saying "if you add enough information to a system anything is possible".

replies(1): >>45125289 #

56. kaibee ◴[04 Sep 25 07:22 UTC] No.45124543{8}[source]▶

>>45123059 #

I feel like you're kinda proving too much. By the same reasoning, humans/programmers aren't generally intelligent either, because we can only mentally simulate relatively small state spaces of programs, and when my boss tells me to go build a tool, I'm not exactly writing raw x86 assembly. I didn't _build_ the tool, I just wrote text that instructed a compiler how to build the tool. Like the whole reason we invented SAT solvers is because we're not smart in that way. But I feel like you're trying to argue that LLMs at any scale gonna be less capable than an average person?

57. versteegen ◴[04 Sep 25 07:30 UTC] No.45124591{3}[source]▶

>>45121626 #

Yes, but anyway, LLMs themselves are perfectly capable of backtracking reasoning while sampling is run forwards only, in the same way humans do: by deciding something doesn't work and trying something else. Humans DON'T time travel backwards in time and never have the erroneous thought in the first place.

58. Certhas ◴[04 Sep 25 07:41 UTC] No.45124657{4}[source]▶

>>45120536 #

Take a finite tape Turing machine with N states and tape length T and N^T total possible tape states.

Now consider that you have a probability for each state instead of a definite state. The transitions of the Turing machine induce transitions of the probabilities. These transitions define a Markov chain on a N^T dimensional probability space.

Is this useful? Absolutely not. It's just a trivial rewriting. But it shows that high dimensional spaces are extremely powerful. You can trade off sophisticated transition rules for high dimensionality.

59. suddenlybananas ◴[04 Sep 25 07:44 UTC] No.45124688{3}[source]▶

>>45120411 #

We don't know those things about the brain. I don't know why you keep going around HN making wildly false claims about the state of contemporary neuroscience. We know very very little about how higher order cognition works in the brain.

replies(1): >>45126938 #

60. Certhas ◴[04 Sep 25 07:54 UTC] No.45124764[source]▶

>>45120344 #

I know that if you go large enough you can do any finite computation using only fixed transition probabilities. This is a trivial observation. To repeat what I posted elsewhere in this thread:

Take a finite tape Turing machine with N states and tape length T and N^T total possible tape states.

You _can_ continue this line of thought though in more productive directions. E.g. what if the input of your machine is genuinely uncertain? What if the transitions are not precise but slightly noisy? You'd expect that the fundamental capabilities of a noisy machine wouldn't be that much worse than those of a noiseless ones (over finite time horizons). What if the machine was built to be noise resistant in some way?

All of this should regularize the Markov chain above. If it's more regular you can start thinking about approximating it using a lower rank transition matrix.

The point of this is not to say that this is really useful. It's to say that there is no reason in my mind to dismiss the purely mathematical rewriting as entirely meaningless in practice.

61. yorwba ◴[04 Sep 25 09:24 UTC] No.45125289[source]▶

>>45124441 #

The argument isn't that anything is possible for LLMs, but that representing LLMs as Markov chains doesn't demonstrate a limitation, because the resulting Markov chain would be huge, much larger than the LLM, and anything that is possible is possible with a large enough Markov chain.

If you limit yourself to Markov chains where the full transition matrix can be stored in a reasonable amount of space (which is the kind of Markov chain that people usually have in mind when they think that Markov chains are very limited), LLMs cannot be represented as such a Markov chain.

If you want to show limitations of LLMs by reducing them to another system of computation, you need to pick one that is more limited than LLMs appear to be, not less.

replies(1): >>45127523 #

62. Certhas ◴[04 Sep 25 12:17 UTC] No.45126405{9}[source]▶

>>45121057 #

People really don't appreciate what is possible in infinite (or more precisely: arbitrarily high) dimensional spaces.

63. mallowdram ◴[04 Sep 25 13:17 UTC] No.45126938{4}[source]▶

>>45124688 #

Of course we know these things about the brain, and who said anything about higher order cognition? I'd stay current, you seem to be a legacy thinker. I'll needle drop ONE of the references re: unpredictability and brain health, there are about 30, just to keep you in your corner. The rest you'll have to hunt down, but please stop pretending you know what you're talking about.

Your line of attack which is to dismiss from a pretend point of certainty, rather than inquiry and curiosity, seems indicative of the cog-sci/engineering problem in general. There's an imposition based in intuition/folk psychology that suffuses the industry. The field doesn't remain curious to new discoveries in neurobiology, which supplants psychology (psychology is being based, neuro is neural based). What this does is remove the intent of rhetoric/being and suggest brains built our external communication. The question is how and by what regularities. Cog-sci has no grasp of that in the slightest.

https://pubmed.ncbi.nlm.nih.gov/38579270/

replies(1): >>45137064 #

64. ariadness ◴[04 Sep 25 14:15 UTC] No.45127523{3}[source]▶

>>45125289 #

> anything that is possible is possible with a large enough Markov chain

This is not true. Do you mean anything that is possible to compute? If yes than you missed the point entirely.

replies(1): >>45134479 #

65. PaulHoule ◴[04 Sep 25 15:08 UTC] No.45128160{5}[source]▶

>>45123910 #

When I was interested in information extraction I saw the problem of resolving language to a semantic model [1] as containing an SMT problem. That is, words are ambiguous, sentences can parse different ways, you have to resolve pronouns and explicit subjects, objects and stuff like that.

Seen that way the text is a set of constraints with a set of variables for all the various choices you make determining it. And of course there is a theory of the world such that "causes must precede their effects" and all the world knowledge about instances such as "Chicago is in Illinois".

The problem is really worse than that because you'll have to parse sentences that weren't generated by sound reasoners or that live in a different microtheory, deal with situations that are ambiguous anyway, etc. Which is why that program never succeeded.

[1] in short: database rows

66. yorwba ◴[05 Sep 25 02:33 UTC] No.45134479{4}[source]▶

>>45127523 #

It's mostly a consequence of the laws of physics having the Markov property. So the time evolution of any physical system can be modeled as a Markov process. Of course the corresponding state space may in general be infinite.

67. suddenlybananas ◴[05 Sep 25 10:35 UTC] No.45137064{5}[source]▶

>>45126938 #

Your writing reminds me of a schizophrenic.

68. measurablefunc ◴[06 Sep 25 02:58 UTC] No.45146244{9}[source]▶

>>45121690 #

So where is the error exactly? Loop around is simply a repetition of the argument for the equivalence between an LLM & a Markov chain. It doesn't matter how many times you sample the trajectories from either one, they're still extensionally equivalent.

replies(1): >>45156548 #

69. vidarh ◴[07 Sep 25 08:54 UTC] No.45156548{10}[source]▶

>>45146244 #

Since an LLM with a loop is trivially and demonstrably Turing complete if you allow it to use the context as an IO channel (and thereby memory), by extension arguing there's some limitation that prevents an LLM from doing what Prolog can is logically invalid.

In other words, this claim is categorically false:

> Logical reasoning as it is implemented in Prolog interpreters is not something that can be done w/ LLMs regardless of the size of their weights, biases, & activation functions between the nodes in the graph.

What is limiting "just" an LLM is not the ability of the model to encode reasoning, but the lack of a minimal and trivial runtime scaffolding to let it use it's capabilities.

replies(1): >>45157743 #

70. measurablefunc ◴[07 Sep 25 12:51 UTC] No.45157743{11}[source]▶

>>45156548 #

> Since an LLM with a loop is trivially and demonstrably Turing complete

Where is the demonstration?

↑