←back to thread

129 points jxmorris12 | 1 comments | | HN request time: 0.21s | source
Show context
1024core ◴[] No.43131731[source]
I think Yann is right if all you do is output a token, which is dependent on the previous token. If it's a simple Markov chain, sure, errors will eventually compound. But with Attention mechanism, the output token depends not only on the previous one, but all 1 million previous ones (assuming a 1M context window). This gives the model plenty of opportunity to fix its errors (and hence the "aha moment"s).
replies(2): >>43131919 #>>43131932 #
1. lagrange77 ◴[] No.43131932[source]
> But with Attention mechanism

I would think LeCun was aware of that. Also prior sequence to sequence models like RNNs have already incorporated information about the further past.