←back to thread

170 points PaulHoule | 1 comments | | HN request time: 0s | source
Show context
measurablefunc ◴[] No.45120049[source]
There is a formal extensional equivalence between Markov chains & LLMs but the only person who seems to be saying anything about this is Gary Marcus. He is constantly making the point that symbolic understanding can not be reduced to a probabilistic computation regardless of how large the graph gets it will still be missing basic stuff like backtracking (which is available in programming languages like Prolog). I think that Gary is right on basically all counts. Probabilistic generative models are fun but no amount of probabilistic sequence generation can be a substitute for logical reasoning.
replies(16): >>45120249 #>>45120259 #>>45120415 #>>45120573 #>>45120628 #>>45121159 #>>45121215 #>>45122702 #>>45122805 #>>45123808 #>>45123989 #>>45125478 #>>45125935 #>>45129038 #>>45130942 #>>45131644 #
Certhas ◴[] No.45120259[source]
I don't understand what point you're hinting at.

Either way, I can get arbitrarily good approximations of arbitrary nonlinear differential/difference equations using only linear probabilistic evolution at the cost of a (much) larger state space. So if you can implement it in a brain or a computer, there is a sufficiently large probabilistic dynamic that can model it. More really is different.

So I view all deductive ab-initio arguments about what LLMs can/can't do due to their architecture as fairly baseless.

(Note that the "large" here is doing a lot of heavy lifting. You need _really_ large. See https://en.m.wikipedia.org/wiki/Transfer_operator)

replies(5): >>45120313 #>>45120341 #>>45120344 #>>45123837 #>>45124441 #
measurablefunc ◴[] No.45120344[source]
What part about backtracking is baseless? Typical Prolog interpreters can be implemented in a few MBs of binary code (the high level specification is even simpler & can be in a few hundred KB)¹ but none of the LLMs (open source or not) are capable of backtracking even though there is plenty of room for a basic Prolog interpreter. This seems like a very obvious shortcoming to me that no amount of smooth approximation can overcome.

If you think there is a threshold at which point some large enough feedforward network develops the capability to backtrack then I'd like to see your argument for it.

¹https://en.wikipedia.org/wiki/Warren_Abstract_Machine

replies(3): >>45120516 #>>45121626 #>>45124764 #
1. Certhas ◴[] No.45124764{3}[source]
I know that if you go large enough you can do any finite computation using only fixed transition probabilities. This is a trivial observation. To repeat what I posted elsewhere in this thread:

Take a finite tape Turing machine with N states and tape length T and N^T total possible tape states.

Now consider that you have a probability for each state instead of a definite state. The transitions of the Turing machine induce transitions of the probabilities. These transitions define a Markov chain on a N^T dimensional probability space.

Is this useful? Absolutely not. It's just a trivial rewriting. But it shows that high dimensional spaces are extremely powerful. You can trade off sophisticated transition rules for high dimensionality.

You _can_ continue this line of thought though in more productive directions. E.g. what if the input of your machine is genuinely uncertain? What if the transitions are not precise but slightly noisy? You'd expect that the fundamental capabilities of a noisy machine wouldn't be that much worse than those of a noiseless ones (over finite time horizons). What if the machine was built to be noise resistant in some way?

All of this should regularize the Markov chain above. If it's more regular you can start thinking about approximating it using a lower rank transition matrix.

The point of this is not to say that this is really useful. It's to say that there is no reason in my mind to dismiss the purely mathematical rewriting as entirely meaningless in practice.