(developers.googleblog.com)

602 points emrah | 4 comments | 20 Apr 25 12:22 UTC | HN request time: 0s | source

1. diggan ◴[20 Apr 25 13:29 UTC] No.43743644[source]▶

First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn't the obvious graph of comparing the quality between BF16 and QAT missing? The text doesn't seem to talk about it either, yet it's basically the topic of the blog post.

replies(3): >>43743893 #>>43743928 #>>43745363 #

2. croemer ◴[20 Apr 25 14:14 UTC] No.43743893[source]▶

>>43743644 (TP) #

Indeed, the one thing I was looking for was Elo/performance of the quantized models, not how good the base model is. Showing how much memory is saved by quantization in a figure is a bit of an insult to the intelligence of the reader.

3. nithril ◴[20 Apr 25 14:19 UTC] No.43743928[source]▶

>>43743644 (TP) #

In addition the graph "Massive VRAM Savings" graph states what looks like a tautology, reducing from 16 bits to 4 bits leads unsurprisingly to a x4 reduction in memory usage

4. claiir ◴[20 Apr 25 18:06 UTC] No.43745363[source]▶

>>43743644 (TP) #

Yea they mention a “perplexity drop” relative to naive quantization, but that’s meaningless to me. > We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0.

Wish they showed benchmarks / added quantized versions to the arena! :>

↑

Gemma 3 QAT Models: Bringing AI to Consumer GPUs