←back to thread

169 points mgninad | 1 comments | | HN request time: 0.001s | source
Show context
attogram ◴[] No.45072664[source]
"Attention Is All You Need" - I've always wondered if the authors of that paper used such a casual and catchy title because they knew it would be groundbreaking and massively cited in the future....
replies(9): >>45073018 #>>45073470 #>>45073494 #>>45073527 #>>45073545 #>>45074544 #>>45074862 #>>45075147 #>>45079506 #
sivm ◴[] No.45073494[source]
Attention is all you need for what we have. But attention is a local heuristic. We have brittle coherence and no global state. I believe we need a paradigm shift in architecture to move forward.
replies(5): >>45073726 #>>45074245 #>>45074860 #>>45076552 #>>45078243 #
1. HarHarVeryFunny ◴[] No.45078243[source]
The Transformer was only ever designed to be a better seq-2-seq architecture, so "all you need" implicitly means "all you need for seq-2-seq" (not all you need for AGI), and was anyways more backwards looking than forwards looking.

The preceding seq-2-seq architectures had been RNN (LSTM) based, then RNN + attention (Bahdanau et al "Jointly Learning to Align & Translate"), with the Transformer "attention is all you need" paper then meaning you can drop use of RNNs altogether and just use attention.

Of course NOT using RNNs was the key motivator behind the new Transformer architecture - not only did you not NEED an RNN, but they explicitly wanted to avoid it since the goal was to support parallel vs sequential processing for better performance on the available highly parallel hardware.