←back to thread

169 points mgninad | 1 comments | | HN request time: 0s | source
Show context
attogram ◴[] No.45072664[source]
"Attention Is All You Need" - I've always wondered if the authors of that paper used such a casual and catchy title because they knew it would be groundbreaking and massively cited in the future....
replies(9): >>45073018 #>>45073470 #>>45073494 #>>45073527 #>>45073545 #>>45074544 #>>45074862 #>>45075147 #>>45079506 #
sivm ◴[] No.45073494[source]
Attention is all you need for what we have. But attention is a local heuristic. We have brittle coherence and no global state. I believe we need a paradigm shift in architecture to move forward.
replies(5): >>45073726 #>>45074245 #>>45074860 #>>45076552 #>>45078243 #
radarsat1 ◴[] No.45074860[source]
To be fair it would be a lot easier to iterate on ideas if a single experiment didn't cost thousands of dollars and require such massive data. Things have really gotten to the point that it's just not easy for outsiders to contribute if you're not part of a big company or university, and even then you have to justify the expenditure (risk). Paradigm shifts are hard to come by when there is so much momentum in one direction and trying something different carries significant barriers.
replies(1): >>45077719 #
1. yorwba ◴[] No.45077719{3}[source]
Plenty of research involves small models trained on small amounts of data. You don't necessarily need to do an internet-scale training run to test a new model architecture, you can just compare it to other models of the same size trained on the same data. For example, small-model speedruns are a thing: https://github.com/KellerJordan/modded-nanogpt