Most active commenters

ACCount37(5)
mxkopy(3)

From multi-head to latent attention: The evolution of attention mechanisms

(vinithavn.medium.com)

Show context

attogram ◴[30 Aug 25 07:32 UTC] No.45072664[source]▶

"Attention Is All You Need" - I've always wondered if the authors of that paper used such a casual and catchy title because they knew it would be groundbreaking and massively cited in the future....

replies(9): >>45073018 #>>45073470 #>>45073494 #>>45073527 #>>45073545 #>>45074544 #>>45074862 #>>45075147 #>>45079506 #

1. sivm ◴[30 Aug 25 10:25 UTC] No.45073494[source]▶

>>45072664 #

Attention is all you need for what we have. But attention is a local heuristic. We have brittle coherence and no global state. I believe we need a paradigm shift in architecture to move forward.

replies(5): >>45073726 #>>45074245 #>>45074860 #>>45076552 #>>45078243 #

2. treyd ◴[30 Aug 25 11:21 UTC] No.45073726[source]▶

>>45073494 (TP) #

Has there been research into some hierarchical attention model that has local attention at the scale of sentences and paragraphs that feeds embeddings up to longer range attention across documents?

replies(1): >>45074035 #

3. mxkopy ◴[30 Aug 25 12:27 UTC] No.45074035[source]▶

>>45073726 #

There’s the hierarchical reasoning model https://arxiv.org/abs/2506.21734 but it’s very new and largely untested

Though honestly I don’t think new neural network architectures are going to get us over this local maximum, I think the next steps forward involve something that’s

1. Non lossy

2. Readily interpretable

replies(2): >>45074274 #>>45074473 #

4. ACCount37 ◴[30 Aug 25 12:58 UTC] No.45074245[source]▶

>>45073494 (TP) #

Plenty of "we need a paradigm shift in architecture" going around - and no actual architecture that would beat transformers at their strengths as far as eye can see.

I remain highly skeptical. I doubt that transformers are the best architecture possible, but they set a high bar. And it sure seems like people who keep making the suggestion that "transformers aren't the future" aren't good enough to actually clear that bar.

replies(2): >>45074490 #>>45076257 #

5. miven ◴[30 Aug 25 13:02 UTC] No.45074274{3}[source]▶

>>45074035 #

The ARC Prize Foundation ran extensive ablations on HRM for their slew of reasoning tasks and noted that the "hierarchical" part of their architecture is not much more impactful than a vanilla transformer of the same size with no extra hyperparameter tuning:

https://arcprize.org/blog/hrm-analysis#analyzing-hrms-contri...

6. ACCount37 ◴[30 Aug 25 13:26 UTC] No.45074473{3}[source]▶

>>45074035 #

By now, I seriously doubt any "readily interpretable" claims.

Nothing about human brain is "readily interpretable", and artificial neural networks - which, unlike brains, can be instrumented and experimented on easily - tend to resist interpretation nonetheless.

If there was an easy to reduce ML to "readily interpretable" representations, someone would have done so already. If there were architectures that perform similarly but are orders of magnitude more interpretable, they will be used, because interpretability is desirable. Instead, we get what we get.

replies(1): >>45080279 #

7. airstrike ◴[30 Aug 25 13:27 UTC] No.45074490[source]▶

>>45074245 #

That logic does not hold.

Being able to provide an immediate replacement is not a requirement to point out limitations in current technology.

replies(1): >>45075281 #

8. radarsat1 ◴[30 Aug 25 14:13 UTC] No.45074860[source]▶

>>45073494 (TP) #

To be fair it would be a lot easier to iterate on ideas if a single experiment didn't cost thousands of dollars and require such massive data. Things have really gotten to the point that it's just not easy for outsiders to contribute if you're not part of a big company or university, and even then you have to justify the expenditure (risk). Paradigm shifts are hard to come by when there is so much momentum in one direction and trying something different carries significant barriers.

replies(1): >>45077719 #

9. ACCount37 ◴[30 Aug 25 15:05 UTC] No.45075281{3}[source]▶

>>45074490 #

What's the value of "pointing out limitations" if this completely fails to drive any improvements?

If any midwit can say "X is deeply flawed" but no one can put together an Y that would beat X, then clearly, pointing out the flaws was never the bottleneck at all.

replies(2): >>45076649 #>>45091326 #

10. scragz ◴[30 Aug 25 17:10 UTC] No.45076257[source]▶

>>45074245 #

what ever happened to Google's Titans?

11. airstrike ◴[30 Aug 25 17:53 UTC] No.45076649{4}[source]▶

>>45075281 #

I think you don't understand how primary research works. Pointing out flaws helps others think about those flaws.

It's not a linear process so I'm not sure the "bottleneck" analogy holds here.

We're not limited to only talking about "the bottleneck". I think the argument is more that we're very close to optimal results for the current approach/architecture, so getting superior outcomes from AI will actually require meaningfully different approaches.

replies(1): >>45077664 #

12. ACCount37 ◴[30 Aug 25 20:16 UTC] No.45077664{5}[source]▶

>>45076649 #

Where's that "primary research" you're talking about? I certainly don't see it happening here right now.

My point is: saying "transformers are flawed" is dirt cheap. Coming up with anything less flawed isn't.

13. yorwba ◴[30 Aug 25 20:23 UTC] No.45077719[source]▶

>>45074860 #

Plenty of research involves small models trained on small amounts of data. You don't necessarily need to do an internet-scale training run to test a new model architecture, you can just compare it to other models of the same size trained on the same data. For example, small-model speedruns are a thing: https://github.com/KellerJordan/modded-nanogpt

14. HarHarVeryFunny ◴[30 Aug 25 21:43 UTC] No.45078243[source]▶

>>45073494 (TP) #

The Transformer was only ever designed to be a better seq-2-seq architecture, so "all you need" implicitly means "all you need for seq-2-seq" (not all you need for AGI), and was anyways more backwards looking than forwards looking.

The preceding seq-2-seq architectures had been RNN (LSTM) based, then RNN + attention (Bahdanau et al "Jointly Learning to Align & Translate"), with the Transformer "attention is all you need" paper then meaning you can drop use of RNNs altogether and just use attention.

Of course NOT using RNNs was the key motivator behind the new Transformer architecture - not only did you not NEED an RNN, but they explicitly wanted to avoid it since the goal was to support parallel vs sequential processing for better performance on the available highly parallel hardware.

15. mxkopy ◴[31 Aug 25 04:03 UTC] No.45080279{4}[source]▶

>>45074473 #

From what I’ve seen neurology is very readily interpretable but it’s hard to get data to interpret. For example the visual cortex V1-V5 areas are very well mapped out but other “deeper” structures are hard to get to and meaningfully measure.

replies(1): >>45080966 #

16. ACCount37 ◴[31 Aug 25 06:54 UTC] No.45080966{5}[source]▶

>>45080279 #

They're interpretable in a similar way to how interpretable CNNs are. Not by a coincidence.

For CNNs, we know very well how the early layers work - edge detectors, curve detectors, etc. This understanding decays further into the model. In the brain, V1/V2 are similarly well studied, but it breaks down deeper into the visual cortex - and the sheer architectural complexity there sure doesn't help.

replies(1): >>45082699 #

17. mxkopy ◴[31 Aug 25 12:32 UTC] No.45082699{6}[source]▶

>>45080966 #

Well, in terms of architectural complexity you have to wonder what something intelligent is going to look like, it’s probably not going to be very simple, but that doesn’t mean it can’t be readily interpreted. For the brain we can ascribe structure to evolutionary pressure, IMO there isn’t quite as powerful a principle at play with LLMs and transformer architectures and such. Like how does minimizing reconstruction loss help us understand the 50th, 60th layer of a neural network? It becomes very hard to interpret, compared to say the function of the amygdala or hippocampus in the context of evolutionary pressure.

18. jychang ◴[01 Sep 25 10:15 UTC] No.45091326{4}[source]▶

>>45075281 #

> What's the value of "pointing out limitations" if this completely fails to drive any improvements?

Ironically, the same could be said about Attention Is All You Need in 2017. It didn’t drive any improvements immediately- actual decent Transformer models took a few years to arrive after that.

↑