(developers.googleblog.com)

602 points emrah | 1 comments | 20 Apr 25 12:22 UTC | HN request time: 0.2s | source

Show context

api ◴[20 Apr 25 15:32 UTC] No.43744419[source]▶

When I see 32B or 70B models performing similarly to 200+B models, I don’t know what to make of this. Either the latter contains more breadth of information but we have managed to distill latent capabilities to be similar, the larger models are just less efficient, or the tests are not very good.

replies(2): >>43744582 #>>43744783 #

1. simonw ◴[20 Apr 25 15:56 UTC] No.43744582[source]▶

>>43744419 #

It makes intuitive sense to me that this would be possible, because LLMs are still mostly opaque black boxes. I expect you could drop a whole hunch of the weights without having a huge impact on quality - maybe you end up mostly ditching the parts that are derived from shitposts on Reddit but keep the bits from Arxiv for example.

(That's a massive simplification of how any of this works, but it's how I think about it at a high level.)

↑

Gemma 3 QAT Models: Bringing AI to Consumer GPUs