Ask HN: How are Markov chains so different from tiny LLMs?

Show context

Sohcahtoa82 ◴[20 Nov 25 18:26 UTC] No.45995897[source]▶

A Markov Chain trained by only a single article of text will very likely just regurgitate entire sentences straight from the source material. There just isn't enough variation in sentences.

But then, Markov Chains fall apart when the source material is very large. Try training a chain based on Wikipedia. You'll find that the resulting output becomes incoherent garbage. Increasing the context length may increase coherence, but at the cost of turning into just simple regurgitation.

In addition to the "attention" mechanism that another commenter mentioned, it's important to note that Markov Chains are discrete in their next token prediction while an LLM is more fuzzy. LLMs have latent space where the meaning of a word basically exists as a vector. LLMs will generate token sequences that didn't exist in the source material, whereas Markov Chains will ONLY generate sequences that existed in the source.

This is why it's impossible to create a digital assistant, or really anything useful, via Markov Chain. The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.

replies(12): >>45995946 #>>45996109 #>>45996662 #>>45996887 #>>45996937 #>>45998252 #>>45999650 #>>46000705 #>>46002052 #>>46002754 #>>46004144 #>>46021459 #

johnisgood ◴[20 Nov 25 18:30 UTC] No.45995946[source]▶

>>45995897 #

> The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.

I have seen the argument that LLMs can only give you what its been trained on, i.e. it will not be "creative" or "revolutionary", that it will not output anything "new", but "only what is in its corpus".

I am quite confused right now. Could you please help me with this?

Somewhat related: I like the work of David Hume, and he explains it quite well how we can imagine various creatures, say, a pig with a dragon head, even if we have not seen one ANYWHERE. It is because we can take multiple ideas and combine them together. We know how dragons typically look like, and we know how a pig looks like, and so, we can imagine (through our creativity and combination of these two ideas) how a pig with a dragon head would look like. I wonder how this applies to LLMs, if they even apply.

Edit: to clarify further as to what I want to know: people have been telling me that LLMs cannot solve problems that is not in their training data already. Is this really true or not?

replies(16): >>45996256 #>>45996266 #>>45996274 #>>45996313 #>>45996484 #>>45996757 #>>45997088 #>>45997100 #>>45997291 #>>45997366 #>>45999327 #>>45999540 #>>46001856 #>>46001954 #>>46007347 #>>46017836 #

thaumasiotes ◴[20 Nov 25 18:56 UTC] No.45996266[source]▶

>>45995946 #

>> The fact that they only generate sequences that existed in the source

> I am quite confused right now. Could you please help me with this?

This is pretty straightforward. Sohcahtoa82 doesn't know what he's saying.

replies(1): >>45996332 #

Sohcahtoa82 ◴[20 Nov 25 19:03 UTC] No.45996332[source]▶

>>45996266 #

I'm fully open to being corrected. Just telling me I'm wrong without elaborating does absolutely nothing to foster understanding and learning.

replies(1): >>45996354 #

thaumasiotes ◴[20 Nov 25 19:04 UTC] No.45996354[source]▶

>>45996332 #

If you still think there's something left to explain, I recommend you read your other responses. Being restricted to the training data is not a property of Markov output. You'd have to be very, very badly confused to think that it was. (And it should be noted that a Markov chain itself doesn't contain any training data, as is also true of an LLM.)

More generally, since an LLM is a Markov chain, it doesn't make sense to try to answer the question "what's the difference between an LLM and a Markov chain?" Here, the question is "what's the difference between a tiny LLM and a Markov chain?", and assuming "tiny" refers to window size, and the Markov chain has a similarly tiny window size, they are the same thing.

replies(3): >>45996464 #>>45996469 #>>45999574 #

purple_turtle ◴[20 Nov 25 19:16 UTC] No.45996469[source]▶

>>45996354 #

1) being restricted to exact matches in input is definition of Markov Chains

2) LLMs are not Markov Chains

replies(2): >>45997503 #>>45999590 #

thaumasiotes ◴[20 Nov 25 20:48 UTC] No.45997503[source]▶

>>45996469 #

> 1) being restricted to exact matches in input is definition of Markov Chains

Here's wikipedia:

> a Markov chain or Markov process is a stochastic process describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event.

A Markov chain is a finite state machine in which transitions between states may have probabilities other than 0 or 1. In this model, there is no input; the transitions occur according to their probability as time passes.

> 2) LLMs are not Markov Chains

As far as the concept of "Markov chains" has been used in the development of linguistics, they are seen as a tool of text generation. A Markov chain for this purpose is a hash table. The key is a sequence of tokens (in the state-based definition, this sequence is the current state), and the value is a probability distribution over a set of tokens.

To rephrase this slightly, a Markov chain is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then for the following token you should choose t_1 with probability p_1, t_2 with probability p_2, etc...".

Then, to tie this back into the state-based definition, we say that when we choose token t_k, we emit that token into the output, and we also dequeue the first token from our representation of the state and enqueue t_k at the back. This brings us into a new state where we can generate another token.

A large language model is seen slightly differently. It is a function. The independent variable is a sequence of tokens, and the dependent variable is a probability distribution over a set of tokens. Here we say that the LLM answers the question "if the last N tokens of a fixed text were s_1, s_2, ..., s_N, what is the following token likely to be?".

Or, rephrased, the LLM is a lookup table which tells you "if the last N tokens were s_1, s_2, ..., s_N, then the following token will be t_1 with probability p_1, t_2 with probability p_2, etc...".

You might notice that these two tables contain the same information organized in the same way. The transformation from an LLM to a Markov chain is the identity transformation. The only difference is in what you say you're going to do with it.

replies(2): >>45999079 #>>45999081 #

purple_turtle ◴[20 Nov 25 23:02 UTC] No.45999079[source]▶

>>45997503 #

Maybe technically LLM can be converted to an equivalent Markov Chain.

The problem is that even for modest context sizes the size of Markov Chain would by hilariously and monstrously large.

You may as well tell that LLM and a hash table is the same thing.

replies(2): >>46002443 #>>46002567 #