←back to thread

129 points jxmorris12 | 1 comments | | HN request time: 0.215s | source
Show context
1024core ◴[] No.43131731[source]
I think Yann is right if all you do is output a token, which is dependent on the previous token. If it's a simple Markov chain, sure, errors will eventually compound. But with Attention mechanism, the output token depends not only on the previous one, but all 1 million previous ones (assuming a 1M context window). This gives the model plenty of opportunity to fix its errors (and hence the "aha moment"s).
replies(2): >>43131919 #>>43131932 #
jxmorris12 ◴[] No.43131919[source]
No this isn't right. The probabilistic formulation for autoregressive language models looks like this

     p(x_n | x_1 ... x_{n-1})
which means that each token depends on all the previous tokens. Attention is one way to parameterize this. Yann's not talking about Markov chains, he's talking about all autoregressive models.
replies(1): >>43132253 #
1024core ◴[] No.43132253[source]
Current token depends on _all_ previous tokens, but only indirectly for the ones before the previous one, no?
replies(2): >>43132529 #>>43133654 #
1. dimatura ◴[] No.43133654[source]
No, using a large context window (which includes previous tokens) is a critical ingredient for the success of modern LLMs. Which is why you will often see the window size mentioned in discussions of newly released LLMs.