The wall confronting large language models

(arxiv.org)

170 points PaulHoule | 5 comments | 03 Sep 25 11:40 UTC | HN request time: 0.734s | source

Show context

measurablefunc ◴[03 Sep 25 20:29 UTC] No.45120049[source]▶

There is a formal extensional equivalence between Markov chains & LLMs but the only person who seems to be saying anything about this is Gary Marcus. He is constantly making the point that symbolic understanding can not be reduced to a probabilistic computation regardless of how large the graph gets it will still be missing basic stuff like backtracking (which is available in programming languages like Prolog). I think that Gary is right on basically all counts. Probabilistic generative models are fun but no amount of probabilistic sequence generation can be a substitute for logical reasoning.

replies(16): >>45120249 #>>45120259 #>>45120415 #>>45120573 #>>45120628 #>>45121159 #>>45121215 #>>45122702 #>>45122805 #>>45123808 #>>45123989 #>>45125478 #>>45125935 #>>45129038 #>>45130942 #>>45131644 #

logicchains ◴[03 Sep 25 21:06 UTC] No.45120415[source]▶

>>45120049 #

LLMs are not formally equivalent to Markov chains, they're more powerful; transformers with sufficient chain of thought can solve any problem in P: https://arxiv.org/abs/2310.07923.

replies(1): >>45120484 #

measurablefunc ◴[03 Sep 25 21:15 UTC] No.45120484[source]▶

>>45120415 #

If you think there is a mistake in this argument then I'd like to know where it is: https://markov.dk.workers.dev/.

replies(2): >>45122838 #>>45124408 #

CamperBob2 ◴[04 Sep 25 02:29 UTC] No.45122838[source]▶

>>45120484 #

A Markov chain is memoryless by definition. A language model has a context, not to mention state in the form of the transformer's KV store.

The whole analogy is just pointless. You might as well call an elephant an Escalade because they weigh the same.

replies(1): >>45123454 #

1. measurablefunc ◴[04 Sep 25 04:07 UTC] No.45123454[source]▶

>>45122838 #

Where is the logical mistake in the linked argument? If there is a mistake then I'd like to know what it is & the counter-example that invalidates the logical argument.

replies(3): >>45124048 #>>45124132 #>>45124761 #

2. ◴[04 Sep 25 06:02 UTC] No.45124048[source]▶

>>45123454 (TP) #

3. ◴[04 Sep 25 06:15 UTC] No.45124132[source]▶

>>45123454 (TP) #

4. versteegen ◴[04 Sep 25 07:54 UTC] No.45124761[source]▶

>>45123454 (TP) #

A Transformer with a length n context window implements an order 2n-1 Markov chain¹. That is correct. That is also irrelevant in the real world, because LLMs aren't run for that many tokens (as results are bad). Before it hits that limit, there is nothing requiring it to have any of the properties of a Markov chain. In fact, because the state space is k^n (alphabet size k), you might not revisit a state until generating k^n tokens.

¹ Depending on context window implementation details, but that is the maximum, because the states n tokens back were computed from the n tokens before that. The minimum of course is an order n-1 Markov chain.

replies(1): >>45126789 #

5. versteegen ◴[04 Sep 25 13:02 UTC] No.45126789[source]▶

>>45124761 #

Specifically, an order n Markov chain such as a transformer, if not otherwise restricted, can have any joint distribution you wish for the first n-1 steps: any extensional property. In which case you have to look at intensional properties to actually draw non-vacuous conclusions.

I would like to comment that there are a lot of papers out there on what transformers can or can't do that are misleading, often misunderstood, or abstract so far from transformers as implemented and used that they are pure theory.

↑