(vinithavn.medium.com)

169 points mgninad | 4 comments | 30 Aug 25 05:45 UTC | HN request time: 0.21s | source

1. mrtesthah ◴[30 Aug 25 07:05 UTC] No.45072533[source]▶

Do we know if any of these techniques are actually used in the so-called "frontier" models?

replies(3): >>45072588 #>>45073417 #>>45076391 #

2. vinithavn01 ◴[30 Aug 25 07:17 UTC] No.45072588[source]▶

>>45072533 (TP) #

The model names are mentioned under each type of attention mechanism

3. gchadwick ◴[30 Aug 25 10:10 UTC] No.45073417[source]▶

>>45072533 (TP) #

Who knows what the closed source models use but certainly going by what's happening in open models all the big changes and corresponding gains in capability are in training techniques not model architecture. Things like GQA and MLA as discussed in this article are important techniques for getting better scaling but are relatively minor tweak vs the evolution in training techniques.

I suspect closed models aren't doing anything too radically different from what's presented here.

4. zackangelo ◴[30 Aug 25 17:23 UTC] No.45076391[source]▶

>>45072533 (TP) #

Not quite a frontier model but definitely built by a frontier lab: Grok 2 was recently open sourced and I believe it uses a fairly standard MHA architecture with MoE.

↑

From multi-head to latent attention: The evolution of attention mechanisms