The wall confronting large language models

(arxiv.org)

170 points PaulHoule | 4 comments | 03 Sep 25 11:40 UTC | HN request time: 0.001s | source

Show context

measurablefunc ◴[03 Sep 25 20:29 UTC] No.45120049[source]▶

There is a formal extensional equivalence between Markov chains & LLMs but the only person who seems to be saying anything about this is Gary Marcus. He is constantly making the point that symbolic understanding can not be reduced to a probabilistic computation regardless of how large the graph gets it will still be missing basic stuff like backtracking (which is available in programming languages like Prolog). I think that Gary is right on basically all counts. Probabilistic generative models are fun but no amount of probabilistic sequence generation can be a substitute for logical reasoning.

replies(16): >>45120249 #>>45120259 #>>45120415 #>>45120573 #>>45120628 #>>45121159 #>>45121215 #>>45122702 #>>45122805 #>>45123808 #>>45123989 #>>45125478 #>>45125935 #>>45129038 #>>45130942 #>>45131644 #

Certhas ◴[03 Sep 25 20:48 UTC] No.45120259[source]▶

>>45120049 #

I don't understand what point you're hinting at.

Either way, I can get arbitrarily good approximations of arbitrary nonlinear differential/difference equations using only linear probabilistic evolution at the cost of a (much) larger state space. So if you can implement it in a brain or a computer, there is a sufficiently large probabilistic dynamic that can model it. More really is different.

So I view all deductive ab-initio arguments about what LLMs can/can't do due to their architecture as fairly baseless.

(Note that the "large" here is doing a lot of heavy lifting. You need _really_ large. See https://en.m.wikipedia.org/wiki/Transfer_operator)

replies(5): >>45120313 #>>45120341 #>>45120344 #>>45123837 #>>45124441 #

measurablefunc ◴[03 Sep 25 20:57 UTC] No.45120344[source]▶

>>45120259 #

What part about backtracking is baseless? Typical Prolog interpreters can be implemented in a few MBs of binary code (the high level specification is even simpler & can be in a few hundred KB)¹ but none of the LLMs (open source or not) are capable of backtracking even though there is plenty of room for a basic Prolog interpreter. This seems like a very obvious shortcoming to me that no amount of smooth approximation can overcome.

If you think there is a threshold at which point some large enough feedforward network develops the capability to backtrack then I'd like to see your argument for it.

¹https://en.wikipedia.org/wiki/Warren_Abstract_Machine

replies(3): >>45120516 #>>45121626 #>>45124764 #

bondarchuk ◴[03 Sep 25 21:18 UTC] No.45120516[source]▶

>>45120344 #

Backtracking makes sense in a search context which is basically what prolog is. Why would you expect a next-token-predictor to do backtracking and what should that even look like?

replies(2): >>45120536 #>>45120766 #

measurablefunc ◴[03 Sep 25 21:20 UTC] No.45120536[source]▶

>>45120516 #

I don't expect a Markov chain to be capable of backtracking. That's the point I am making. Logical reasoning as it is implemented in Prolog interpreters is not something that can be done w/ LLMs regardless of the size of their weights, biases, & activation functions between the nodes in the graph.

replies(4): >>45120598 #>>45121266 #>>45122145 #>>45124657 #

bondarchuk ◴[03 Sep 25 21:27 UTC] No.45120598[source]▶

>>45120536 #

Imagine the context window contains A-B-C, C turns out a dead end and we want to backtrack to B and try another branch. Then the LLM could produce outputs such that the context window would become A-B-C-[backtrack-back-to-B-and-don't-do-C] which after some more tokens could become A-B-C-[backtrack-back-to-B-and-don't-do-C]-D. This would essentially be backtracking and I don't see why it would be inherently impossible for LLMs as long as the different branches fit in context.

replies(1): >>45120655 #

measurablefunc ◴[03 Sep 25 21:35 UTC] No.45120655[source]▶

>>45120598 #

If you think it is possible then I'd like to see an implementation of a sudoku puzzle solver as Markov chain. This is a simple enough problem that can be implemented in a few dozen lines of Prolog but I've never seen a solver implemented as a Markov chain.

replies(4): >>45120792 #>>45120895 #>>45120900 #>>45121022 #

Ukv ◴[03 Sep 25 22:06 UTC] No.45120900[source]▶

>>45120655 #

> If you think it is possible then I'd like to see an implementation of a sudoku puzzle solver as Markov chain

Have each of the Markov chain's states be one of 10^81 possible sudoku grids (a 9x9 grid of digits 1-9 and blank), then calculate the 10^81-by-10^81 transition matrix that takes each incomplete grid to the valid complete grid containing the same numbers. If you want you could even have it fill one square at a time rather than jump right to the solution, though there's no need to.

Up to you what you do for ambiguous inputs (select one solution at random to give 1.0 probability in the transition matrix? equally weight valid solutions? have the states be sets of boards and map to set of all valid solutions?) and impossible inputs (map to itself? have the states be sets of boards and map to empty set?).

Could say that's "cheating" by pre-computing the answers and hard-coding them in a massive input-output lookup table, but to my understanding that's also the only sense in which there's equivalence between Markov chains and LLMs.

replies(1): >>45120933 #

measurablefunc ◴[03 Sep 25 22:11 UTC] No.45120933[source]▶

>>45120900 #

There are multiple solutions for each incomplete grid so how are you calculating the transitions for a grid w/ a non-unique solution?

Edit: I see you added questions for the ambiguities but modulo those choices your solution will almost work b/c it is not extensionally equivalent entirely. The transition graph and solver are almost extensionally equivalent but whereas the Prolog solver will backtrack there is no backtracking in the Markov chain and you have to re-run the chain multiple times to find all the solutions.

replies(2): >>45121032 #>>45121057 #

Ukv ◴[03 Sep 25 22:26 UTC] No.45121057[source]▶

>>45120933 #

> but whereas the Prolog solver will backtrack there is no backtracking in the Markov chain and you have to re-run the chain multiple times to find all the solutions

If you want it to give all possible solutions at once, you can just expand the state space to the power-set of sudoku boards, such that the input board transitions to the state representing the set of valid solved boards.

replies(2): >>45121077 #>>45126405 #

measurablefunc ◴[03 Sep 25 22:29 UTC] No.45121077[source]▶

>>45121057 #

That still won't work b/c there is no backtracking. The point is that there is no way to encode backtracking/choice points like in Prolog w/ a Markov chain. The argument you have presented is not extensionally equivalent to the Prolog solver. It is almost equivalent but it's missing choice points for starting at a valid solution & backtracking to an incomplete board to generate a new one. The typical argument for absorbing states doesn't work b/c sudoku is not a typical deterministic puzzle.

replies(1): >>45121229 #

Ukv ◴[03 Sep 25 22:52 UTC] No.45121229[source]▶

>>45121077 #

> That still won't work b/c there is no backtracking.

It's essentially just a lookup table mapping from input board to the set of valid output boards - there's no real way for it not to work (obviously not practical though). If board A has valid solutions B, C, D, then the transition matrix cell mapping {A} to {B, C, D} is 1.0, and all other entries in that row are 0.0.

> The point is that there is no way to encode backtracking/choice points

You can if you want, keeping the same variables as a regular sudoku solver as part of the Markov chain's state and transitioning instruction-by-instruction, rather than mapping directly to the solution - just that there's no particular need to when you've precomputed the solution.

replies(1): >>45121264 #

1. measurablefunc ◴[03 Sep 25 22:57 UTC] No.45121264[source]▶

>>45121229 #

My point is that your initial argument was missing several key pieces & if you specify the entire state space you will see that it's not as simple as you thought initially. I'm not saying it can't be done but that it's actually much more complicated than simply saying just take an incomplete board state s & uniform transitions between s, s' for valid solutions s' that are compatible with s. In fact, now that I spelled out the issues I still don't think this is a formal extensional equivalence. Prolog has interactive transitions between the states & it tracks choice points so compiling a sudoku solver to a Markov chain requires more than just tracking the board state in the context.

replies(1): >>45121671 #

2. Ukv ◴[03 Sep 25 23:47 UTC] No.45121671[source]▶

>>45121264 (TP) #

> My point is that your initial argument was missing several key pieces

My initial example was a response to "If you think it is possible then I'd like to see an implementation of a sudoku puzzle solver as Markov chain", describing how a Sudoku solver could be implemented as a Markov chain. I don't think there's anything missing from it - it solves all proper Sudokus, and I only left open the choice of how to handle improper Sudokus because that was unspecified (but trivial regardless of what's wanted).

> I'm not saying it can't be done but that it's actually much more complicated

If that's the case, then I did misinterpret your comments as saying it can't be done. But, I don't think it's really complicated regardless of whatever "ok but now it must encode choice points in its state" are thrown at it - it's just a state-to-state transition look-up table.

> so compiling a sudoku solver to a Markov chain requires more than just tracking the board state in the context.

As noted, you can keep all the same variables as a regular Sudoku solver as part of the Markov chain's state and transition instruction-by-instruction, if that's what you want.

If you mean inputs from a user, the same is true of LLMs which are typically ran interactively. Either model the whole universe including the user as part of state transition table (maybe impossible, depending on your beliefs about the universe), or have user interaction take the current state, modify it, and use it as initial state for a new run of the Markov chain.

replies(1): >>45121836 #

3. measurablefunc ◴[04 Sep 25 00:07 UTC] No.45121836[source]▶

>>45121671 #

> As noted, you can keep all the same variables as a regular Sudoku solver

What are those variables exactly?

replies(1): >>45122154 #

4. Ukv ◴[04 Sep 25 00:49 UTC] No.45122154{3}[source]▶

>>45121836 #

For a depth-first solution (backtracking), I'd assume mostly just the partial solutions and a few small counters/indices/masks - like for tracking the cell we're up to and which cells were prefilled. Specifics will depend on the solver, but can be made part of Markov chain's state regardless.

↑