←back to thread

169 points mgninad | 4 comments | | HN request time: 0.21s | source
1. mrtesthah ◴[] No.45072533[source]
Do we know if any of these techniques are actually used in the so-called "frontier" models?
replies(3): >>45072588 #>>45073417 #>>45076391 #
2. vinithavn01 ◴[] No.45072588[source]
The model names are mentioned under each type of attention mechanism
3. gchadwick ◴[] No.45073417[source]
Who knows what the closed source models use but certainly going by what's happening in open models all the big changes and corresponding gains in capability are in training techniques not model architecture. Things like GQA and MLA as discussed in this article are important techniques for getting better scaling but are relatively minor tweak vs the evolution in training techniques.

I suspect closed models aren't doing anything too radically different from what's presented here.

4. zackangelo ◴[] No.45076391[source]
Not quite a frontier model but definitely built by a frontier lab: Grok 2 was recently open sourced and I believe it uses a fairly standard MHA architecture with MoE.