From multi-head to latent attention: The evolution of attention mechanisms

(vinithavn.medium.com)

1. mrtesthah ◴[30 Aug 25 07:05 UTC] No.45072533[source]▶

>>45072160 (OP) #

Do we know if any of these techniques are actually used in the so-called "frontier" models?

replies(3): >>45072588 #>>45073417 #>>45076391 #

2. vinithavn01 ◴[30 Aug 25 07:17 UTC] No.45072588[source]▶

>>45072533 #

The model names are mentioned under each type of attention mechanism

3. JSR_FDED ◴[30 Aug 25 07:30 UTC] No.45072654[source]▶

>>45072160 (OP) #

Any way to read this without making an account?

replies(4): >>45072762 #>>45072767 #>>45072855 #>>45073919 #

4. attogram ◴[30 Aug 25 07:32 UTC] No.45072664[source]▶

>>45072160 (OP) #

"Attention Is All You Need" - I've always wondered if the authors of that paper used such a casual and catchy title because they knew it would be groundbreaking and massively cited in the future....

replies(9): >>45073018 #>>45073470 #>>45073494 #>>45073527 #>>45073545 #>>45074544 #>>45074862 #>>45075147 #>>45079506 #

5. rmonvfer ◴[30 Aug 25 07:52 UTC] No.45072762[source]▶

>>45072654 #

https://freedium.cfd/https://vinithavn.medium.com/from-multi...

6. kuidaumpf ◴[30 Aug 25 07:53 UTC] No.45072767[source]▶

>>45072654 #

https://freedium.cfd

7. qcnguy ◴[30 Aug 25 08:10 UTC] No.45072855[source]▶

>>45072654 #

Just click the x at the top right of the interstitial?

replies(1): >>45073532 #

8. adastra22 ◴[30 Aug 25 08:42 UTC] No.45073018[source]▶

>>45072664 #

Definitely. I always assumed that, having been involved in writing similarly groundbreaking papers… or so we thought at the time. All my coauthors spent significant time thinking about what the best title would be, and strategies like that were common. (It ended up not mattering for us.)

9. gchadwick ◴[30 Aug 25 10:10 UTC] No.45073417[source]▶

>>45072533 #

Who knows what the closed source models use but certainly going by what's happening in open models all the big changes and corresponding gains in capability are in training techniques not model architecture. Things like GQA and MLA as discussed in this article are important techniques for getting better scaling but are relatively minor tweak vs the evolution in training techniques.

I suspect closed models aren't doing anything too radically different from what's presented here.

10. slickytail ◴[30 Aug 25 10:21 UTC] No.45073470[source]▶

>>45072664 #

The transformer was a major breakthrough in NLP, and it was clear at the time of publishing that it would have a major impact. But I will add that it is common in the Deep Learning field to give papers catchy titles (see, off the top of my head: all the YOLO papers, ViT, DiT, textual inversion). The transformer paper is one in a long line of seminal papers with funny names.

11. sivm ◴[30 Aug 25 10:25 UTC] No.45073494[source]▶

>>45072664 #

Attention is all you need for what we have. But attention is a local heuristic. We have brittle coherence and no global state. I believe we need a paradigm shift in architecture to move forward.

replies(5): >>45073726 #>>45074245 #>>45074860 #>>45076552 #>>45078243 #

12. iLoveOncall ◴[30 Aug 25 10:33 UTC] No.45073527[source]▶

>>45072664 #

I recommend reading this article which explains how you can get your papers accepted, and explains that a catchy title is the #1 most important thing: https://maxwellforbes.com/posts/how-to-get-a-paper-accepted/ (not a plug, I just saved it because it was interesting)

13. iLoveOncall ◴[30 Aug 25 10:34 UTC] No.45073532{3}[source]▶

>>45072855 #

That only work for a few articles per month. But usually opening in incognito does the trick.

14. hyperbovine ◴[30 Aug 25 10:37 UTC] No.45073545[source]▶

>>45072664 #

It sounds like a typical neurips paper to me. And no, they did know what a big deal it would be, else google never would have given the idea away.

15. treyd ◴[30 Aug 25 11:21 UTC] No.45073726{3}[source]▶

>>45073494 #

Has there been research into some hierarchical attention model that has local attention at the scale of sentences and paragraphs that feeds embeddings up to longer range attention across documents?

replies(1): >>45074035 #

16. djoldman ◴[30 Aug 25 12:05 UTC] No.45073919[source]▶

>>45072654 #

just turn off JS.

17. mxkopy ◴[30 Aug 25 12:27 UTC] No.45074035{4}[source]▶

>>45073726 #

There’s the hierarchical reasoning model https://arxiv.org/abs/2506.21734 but it’s very new and largely untested

Though honestly I don’t think new neural network architectures are going to get us over this local maximum, I think the next steps forward involve something that’s

1. Non lossy

2. Readily interpretable

replies(2): >>45074274 #>>45074473 #

18. ACCount37 ◴[30 Aug 25 12:58 UTC] No.45074245{3}[source]▶

>>45073494 #

Plenty of "we need a paradigm shift in architecture" going around - and no actual architecture that would beat transformers at their strengths as far as eye can see.

I remain highly skeptical. I doubt that transformers are the best architecture possible, but they set a high bar. And it sure seems like people who keep making the suggestion that "transformers aren't the future" aren't good enough to actually clear that bar.

replies(2): >>45074490 #>>45076257 #

19. miven ◴[30 Aug 25 13:02 UTC] No.45074274{5}[source]▶

>>45074035 #

The ARC Prize Foundation ran extensive ablations on HRM for their slew of reasoning tasks and noted that the "hierarchical" part of their architecture is not much more impactful than a vanilla transformer of the same size with no extra hyperparameter tuning:

https://arcprize.org/blog/hrm-analysis#analyzing-hrms-contri...

20. ACCount37 ◴[30 Aug 25 13:26 UTC] No.45074473{5}[source]▶

>>45074035 #

By now, I seriously doubt any "readily interpretable" claims.

Nothing about human brain is "readily interpretable", and artificial neural networks - which, unlike brains, can be instrumented and experimented on easily - tend to resist interpretation nonetheless.

If there was an easy to reduce ML to "readily interpretable" representations, someone would have done so already. If there were architectures that perform similarly but are orders of magnitude more interpretable, they will be used, because interpretability is desirable. Instead, we get what we get.

replies(1): >>45080279 #

21. airstrike ◴[30 Aug 25 13:27 UTC] No.45074490{4}[source]▶

>>45074245 #

That logic does not hold.

Being able to provide an immediate replacement is not a requirement to point out limitations in current technology.

replies(1): >>45075281 #

22. lucidrains ◴[30 Aug 25 13:33 UTC] No.45074544[source]▶

>>45072664 #

It is a reference to the beatles song, mainly because Noam Shazeer is a music lover

23. radarsat1 ◴[30 Aug 25 14:13 UTC] No.45074860{3}[source]▶

>>45073494 #

To be fair it would be a lot easier to iterate on ideas if a single experiment didn't cost thousands of dollars and require such massive data. Things have really gotten to the point that it's just not easy for outsiders to contribute if you're not part of a big company or university, and even then you have to justify the expenditure (risk). Paradigm shifts are hard to come by when there is so much momentum in one direction and trying something different carries significant barriers.

replies(1): >>45077719 #

24. ruuda ◴[30 Aug 25 14:13 UTC] No.45074862[source]▶

>>45072664 #

What about the converse, the paper became some massively influential because of the catchy title? Of course the contents are groundbreaking, but that alone is not enough. A groundbreaking paper that nobody knows about cannot have any impact. Even for research, there is a marketing part to it.

replies(2): >>45075150 #>>45075157 #

25. cgearhart ◴[30 Aug 25 14:51 UTC] No.45075147[source]▶

>>45072664 #

The title is a succinct snippet that spoke directly to researchers at the time. The transformer architecture is somewhat obvious (especially in retrospect), but it was still very surprising because no one was really going this direction. They were going many other directions… That’s the point of the title: you don’t need all kinds of complicated systems for NLP to work—“attention is all you need”.

After the success of transfer learning for computer vision in the mid-2010s, it was obvious that NLP needed its own transfer learning approach and AlexNet moment.

Lots of research focus around that time was on recurrent models—because that was the conventional wisdom about how you model sequences. Markov chains had led to vanilla RNNs, LSTMs, GRU, etc., which all seemed tantalizingly promising. (MAMBA fans take note.) Attention mechanisms were even used in recurrent models…but so was everything else.

Then came transformers—mixing all the then-best-practice bits with the heretical idea of just not giving a shit about O(n^2) complexity. The vanilla transformer used an encoder-decoder structure like the best translation models had been doing; it used a stack of identical blocks to nudge the output along through the pipe like ResNet; it was pretrained on a multi-task objective using a large document corpus. But then it jettisoned all the other complexity and just let it all rest on the attention mechanism to capture long range dependencies.

It was immediately thrilling, but it was also completely impractical. (I think the largest model had a 500ish token context limit and bigger than hobbyist GPUs.) So it mostly sat on a shelf while people used other “good enough” models for a few years until the hardware got better and a couple folks proved that it could actually work to run these things at massive scale.

And now here we are.

I think they knew what they were saying at the time, but I don’t think they knew that it would remain true for years.

replies(1): >>45075372 #

26. eldenring ◴[30 Aug 25 14:51 UTC] No.45075150{3}[source]▶

>>45074862 #

Huh? of course its enough. Transformers immediately started destroying every single baseline out there. The authors definitely knew it was a very significant discovery beforehand.

replies(1): >>45076298 #

27. soulofmischief ◴[30 Aug 25 14:52 UTC] No.45075157{3}[source]▶

>>45074862 #

The paper became massively influential because of its contents, not its catchy title. Scientists do not generally read a paper because if its title, they check the abstract and go from there.

28. ACCount37 ◴[30 Aug 25 15:05 UTC] No.45075281{5}[source]▶

>>45074490 #

What's the value of "pointing out limitations" if this completely fails to drive any improvements?

If any midwit can say "X is deeply flawed" but no one can put together an Y that would beat X, then clearly, pointing out the flaws was never the bottleneck at all.

replies(1): >>45076649 #

29. danieldk ◴[30 Aug 25 15:16 UTC] No.45075372{3}[source]▶

>>45075147 #

I feel like there is a step missing here...

People were using RNN encoders/decoders for machine translation - the encoder was used to make a representation (fixed-size vector) of the source language sentence, the decoder generated the target language sentence from the source representation.

The issue that people were bumping into is that the fixed-sized vector bottlenecked the encoder/decoder architecture. Representing a variable-length source sentence as a fixed-size vector leads to a loss of information that increases with the source sentence length.

People started adding attention to the decoder as a way to work around this issue. Each decoder step could attend to every token (well, RNN hidden representation) of the source sentence. So, this led to the RNN + attention architecture.

The title 'Attention is all you need' comes from the realization that in this architecture the RNN is not needed, neither for the encoder and decoder. It's a message to the field who was using RNNs + attention (to avoid the bottleneck). Of course, the rest was born from that, encoder-only transformer models like BERT and decoder-only models like current LLMs.

replies(1): >>45075776 #

30. cgearhart ◴[30 Aug 25 16:10 UTC] No.45075776{4}[source]▶

>>45075372 #

This is a fair point and clarification. :-)

31. scragz ◴[30 Aug 25 17:10 UTC] No.45076257{4}[source]▶

>>45074245 #

what ever happened to Google's Titans?

32. zackangelo ◴[30 Aug 25 17:23 UTC] No.45076391[source]▶

>>45072533 #

Not quite a frontier model but definitely built by a frontier lab: Grok 2 was recently open sourced and I believe it uses a fairly standard MHA architecture with MoE.

33. airstrike ◴[30 Aug 25 17:53 UTC] No.45076649{6}[source]▶

>>45075281 #

I think you don't understand how primary research works. Pointing out flaws helps others think about those flaws.

It's not a linear process so I'm not sure the "bottleneck" analogy holds here.

We're not limited to only talking about "the bottleneck". I think the argument is more that we're very close to optimal results for the current approach/architecture, so getting superior outcomes from AI will actually require meaningfully different approaches.

replies(1): >>45077664 #

34. ACCount37 ◴[30 Aug 25 20:16 UTC] No.45077664{7}[source]▶

>>45076649 #

Where's that "primary research" you're talking about? I certainly don't see it happening here right now.

My point is: saying "transformers are flawed" is dirt cheap. Coming up with anything less flawed isn't.

35. yorwba ◴[30 Aug 25 20:23 UTC] No.45077719{4}[source]▶

>>45074860 #

Plenty of research involves small models trained on small amounts of data. You don't necessarily need to do an internet-scale training run to test a new model architecture, you can just compare it to other models of the same size trained on the same data. For example, small-model speedruns are a thing: https://github.com/KellerJordan/modded-nanogpt

36. HarHarVeryFunny ◴[30 Aug 25 21:43 UTC] No.45078243{3}[source]▶

>>45073494 #

The Transformer was only ever designed to be a better seq-2-seq architecture, so "all you need" implicitly means "all you need for seq-2-seq" (not all you need for AGI), and was anyways more backwards looking than forwards looking.

The preceding seq-2-seq architectures had been RNN (LSTM) based, then RNN + attention (Bahdanau et al "Jointly Learning to Align & Translate"), with the Transformer "attention is all you need" paper then meaning you can drop use of RNNs altogether and just use attention.

Of course NOT using RNNs was the key motivator behind the new Transformer architecture - not only did you not NEED an RNN, but they explicitly wanted to avoid it since the goal was to support parallel vs sequential processing for better performance on the available highly parallel hardware.

37. jjaksic ◴[31 Aug 25 01:26 UTC] No.45079506[source]▶

>>45072664 #

I was there. They had no idea. The purpose and expectation of the transformer architecture was to enable better translation at scale and without recurrence. This in itself would have been a big deal (and they probably expected some citations), but the actual impact of the architecture was a few orders of magnitude greater.

38. mxkopy ◴[31 Aug 25 04:03 UTC] No.45080279{6}[source]▶

>>45074473 #

From what I’ve seen neurology is very readily interpretable but it’s hard to get data to interpret. For example the visual cortex V1-V5 areas are very well mapped out but other “deeper” structures are hard to get to and meaningfully measure.

replies(1): >>45080966 #

39. ACCount37 ◴[31 Aug 25 06:54 UTC] No.45080966{7}[source]▶

>>45080279 #

They're interpretable in a similar way to how interpretable CNNs are. Not by a coincidence.

For CNNs, we know very well how the early layers work - edge detectors, curve detectors, etc. This understanding decays further into the model. In the brain, V1/V2 are similarly well studied, but it breaks down deeper into the visual cortex - and the sheer architectural complexity there sure doesn't help.