(substack.com)

129 points jxmorris12 | 1 comments | 21 Feb 25 18:23 UTC | HN request time: 0.21s | source

Show context

1024core ◴[21 Feb 25 19:21 UTC] No.43131731[source]▶

I think Yann is right if all you do is output a token, which is dependent on the previous token. If it's a simple Markov chain, sure, errors will eventually compound. But with Attention mechanism, the output token depends not only on the previous one, but all 1 million previous ones (assuming a 1M context window). This gives the model plenty of opportunity to fix its errors (and hence the "aha moment"s).

replies(2): >>43131919 #>>43131932 #

1. lagrange77 ◴[21 Feb 25 19:37 UTC] No.43131932[source]▶

>>43131731 #

> But with Attention mechanism

I would think LeCun was aware of that. Also prior sequence to sequence models like RNNs have already incorporated information about the further past.

↑

I think Yann Lecun was right about LLMs (but perhaps only by accident)