(substack.com)

129 points jxmorris12 | 1 comments | 21 Feb 25 18:23 UTC | HN request time: 0.221s | source

Show context

1024core ◴[21 Feb 25 19:21 UTC] No.43131731[source]▶

I think Yann is right if all you do is output a token, which is dependent on the previous token. If it's a simple Markov chain, sure, errors will eventually compound. But with Attention mechanism, the output token depends not only on the previous one, but all 1 million previous ones (assuming a 1M context window). This gives the model plenty of opportunity to fix its errors (and hence the "aha moment"s).

replies(2): >>43131919 #>>43131932 #

jxmorris12 ◴[21 Feb 25 19:36 UTC] No.43131919[source]▶

>>43131731 #

No this isn't right. The probabilistic formulation for autoregressive language models looks like this

     p(x_n | x_1 ... x_{n-1})

which means that each token depends on all the previous tokens. Attention is one way to parameterize this. Yann's not talking about Markov chains, he's talking about all autoregressive models.

replies(1): >>43132253 #

1024core ◴[21 Feb 25 20:02 UTC] No.43132253[source]▶

>>43131919 #

Current token depends on _all_ previous tokens, but only indirectly for the ones before the previous one, no?

replies(2): >>43132529 #>>43133654 #

1. sudosysgen ◴[21 Feb 25 20:26 UTC] No.43132529[source]▶

>>43132253 #

In the auto regressive formulation the previous token is no different from any past token, so no. Historically some token took the shortcut of only directly looking at the past token or some other kind of recursive formulation for intermediate states in generating the past token, but that's not the case in for the theoretical formulation of an autoregressive model that was used, and plenty of past autoregressive models didn't do that, for example with nonlinear autoregressive models.

↑

I think Yann Lecun was right about LLMs (but perhaps only by accident)