From multi-head to latent attention: The evolution of attention mechanisms

(vinithavn.medium.com)

Show context

attogram ◴[30 Aug 25 07:32 UTC] No.45072664[source]▶

"Attention Is All You Need" - I've always wondered if the authors of that paper used such a casual and catchy title because they knew it would be groundbreaking and massively cited in the future....

replies(9): >>45073018 #>>45073470 #>>45073494 #>>45073527 #>>45073545 #>>45074544 #>>45074862 #>>45075147 #>>45079506 #

sivm ◴[30 Aug 25 10:25 UTC] No.45073494[source]▶

>>45072664 #

Attention is all you need for what we have. But attention is a local heuristic. We have brittle coherence and no global state. I believe we need a paradigm shift in architecture to move forward.

replies(5): >>45073726 #>>45074245 #>>45074860 #>>45076552 #>>45078243 #

1. treyd ◴[30 Aug 25 11:21 UTC] No.45073726[source]▶

>>45073494 #

Has there been research into some hierarchical attention model that has local attention at the scale of sentences and paragraphs that feeds embeddings up to longer range attention across documents?

replies(1): >>45074035 #

2. mxkopy ◴[30 Aug 25 12:27 UTC] No.45074035[source]▶

>>45073726 (TP) #

There’s the hierarchical reasoning model https://arxiv.org/abs/2506.21734 but it’s very new and largely untested

Though honestly I don’t think new neural network architectures are going to get us over this local maximum, I think the next steps forward involve something that’s

1. Non lossy

2. Readily interpretable

replies(2): >>45074274 #>>45074473 #

3. miven ◴[30 Aug 25 13:02 UTC] No.45074274[source]▶

>>45074035 #

The ARC Prize Foundation ran extensive ablations on HRM for their slew of reasoning tasks and noted that the "hierarchical" part of their architecture is not much more impactful than a vanilla transformer of the same size with no extra hyperparameter tuning:

https://arcprize.org/blog/hrm-analysis#analyzing-hrms-contri...

4. ACCount37 ◴[30 Aug 25 13:26 UTC] No.45074473[source]▶

>>45074035 #

By now, I seriously doubt any "readily interpretable" claims.

Nothing about human brain is "readily interpretable", and artificial neural networks - which, unlike brains, can be instrumented and experimented on easily - tend to resist interpretation nonetheless.

If there was an easy to reduce ML to "readily interpretable" representations, someone would have done so already. If there were architectures that perform similarly but are orders of magnitude more interpretable, they will be used, because interpretability is desirable. Instead, we get what we get.

replies(1): >>45080279 #

5. mxkopy ◴[31 Aug 25 04:03 UTC] No.45080279{3}[source]▶

>>45074473 #

From what I’ve seen neurology is very readily interpretable but it’s hard to get data to interpret. For example the visual cortex V1-V5 areas are very well mapped out but other “deeper” structures are hard to get to and meaningfully measure.

replies(1): >>45080966 #

6. ACCount37 ◴[31 Aug 25 06:54 UTC] No.45080966{4}[source]▶

>>45080279 #

They're interpretable in a similar way to how interpretable CNNs are. Not by a coincidence.

For CNNs, we know very well how the early layers work - edge detectors, curve detectors, etc. This understanding decays further into the model. In the brain, V1/V2 are similarly well studied, but it breaks down deeper into the visual cortex - and the sheer architectural complexity there sure doesn't help.

replies(1): >>45082699 #

7. mxkopy ◴[31 Aug 25 12:32 UTC] No.45082699{5}[source]▶

>>45080966 #

Well, in terms of architectural complexity you have to wonder what something intelligent is going to look like, it’s probably not going to be very simple, but that doesn’t mean it can’t be readily interpreted. For the brain we can ascribe structure to evolutionary pressure, IMO there isn’t quite as powerful a principle at play with LLMs and transformer architectures and such. Like how does minimizing reconstruction loss help us understand the 50th, 60th layer of a neural network? It becomes very hard to interpret, compared to say the function of the amygdala or hippocampus in the context of evolutionary pressure.

↑