(vinithavn.medium.com)

169 points mgninad | 1 comments | 30 Aug 25 05:45 UTC | HN request time: 0s | source

Show context

attogram ◴[30 Aug 25 07:32 UTC] No.45072664[source]▶

"Attention Is All You Need" - I've always wondered if the authors of that paper used such a casual and catchy title because they knew it would be groundbreaking and massively cited in the future....

replies(9): >>45073018 #>>45073470 #>>45073494 #>>45073527 #>>45073545 #>>45074544 #>>45074862 #>>45075147 #>>45079506 #

sivm ◴[30 Aug 25 10:25 UTC] No.45073494[source]▶

>>45072664 #

Attention is all you need for what we have. But attention is a local heuristic. We have brittle coherence and no global state. I believe we need a paradigm shift in architecture to move forward.

replies(5): >>45073726 #>>45074245 #>>45074860 #>>45076552 #>>45078243 #

radarsat1 ◴[30 Aug 25 14:13 UTC] No.45074860[source]▶

>>45073494 #

To be fair it would be a lot easier to iterate on ideas if a single experiment didn't cost thousands of dollars and require such massive data. Things have really gotten to the point that it's just not easy for outsiders to contribute if you're not part of a big company or university, and even then you have to justify the expenditure (risk). Paradigm shifts are hard to come by when there is so much momentum in one direction and trying something different carries significant barriers.

replies(1): >>45077719 #

1. yorwba ◴[30 Aug 25 20:23 UTC] No.45077719{3}[source]▶

>>45074860 #

Plenty of research involves small models trained on small amounts of data. You don't necessarily need to do an internet-scale training run to test a new model architecture, you can just compare it to other models of the same size trained on the same data. For example, small-model speedruns are a thing: https://github.com/KellerJordan/modded-nanogpt

↑

From multi-head to latent attention: The evolution of attention mechanisms