"Attention Is All You Need" - I've always wondered if the authors of that paper used such a casual and catchy title because they knew it would be groundbreaking and massively cited in the future....
I was there. They had no idea. The purpose and expectation of the transformer architecture was to enable better translation at scale and without recurrence. This in itself would have been a big deal (and they probably expected some citations), but the actual impact of the architecture was a few orders of magnitude greater.