After the success of transfer learning for computer vision in the mid-2010s, it was obvious that NLP needed its own transfer learning approach and AlexNet moment.
Lots of research focus around that time was on recurrent models—because that was the conventional wisdom about how you model sequences. Markov chains had led to vanilla RNNs, LSTMs, GRU, etc., which all seemed tantalizingly promising. (MAMBA fans take note.) Attention mechanisms were even used in recurrent models…but so was everything else.
Then came transformers—mixing all the then-best-practice bits with the heretical idea of just not giving a shit about O(n^2) complexity. The vanilla transformer used an encoder-decoder structure like the best translation models had been doing; it used a stack of identical blocks to nudge the output along through the pipe like ResNet; it was pretrained on a multi-task objective using a large document corpus. But then it jettisoned all the other complexity and just let it all rest on the attention mechanism to capture long range dependencies.
It was immediately thrilling, but it was also completely impractical. (I think the largest model had a 500ish token context limit and bigger than hobbyist GPUs.) So it mostly sat on a shelf while people used other “good enough” models for a few years until the hardware got better and a couple folks proved that it could actually work to run these things at massive scale.
And now here we are.
I think they knew what they were saying at the time, but I don’t think they knew that it would remain true for years.
I feel like there is a step missing here...
People were using RNN encoders/decoders for machine translation - the encoder was used to make a representation (fixed-size vector) of the source language sentence, the decoder generated the target language sentence from the source representation.
The issue that people were bumping into is that the fixed-sized vector bottlenecked the encoder/decoder architecture. Representing a variable-length source sentence as a fixed-size vector leads to a loss of information that increases with the source sentence length.
People started adding attention to the decoder as a way to work around this issue. Each decoder step could attend to every token (well, RNN hidden representation) of the source sentence. So, this led to the RNN + attention architecture.
The title 'Attention is all you need' comes from the realization that in this architecture the RNN is not needed, neither for the encoder and decoder. It's a message to the field who was using RNNs + attention (to avoid the bottleneck). Of course, the rest was born from that, encoder-only transformer models like BERT and decoder-only models like current LLMs.