←back to thread

602 points emrah | 4 comments | | HN request time: 0.633s | source
1. diggan ◴[] No.43743644[source]
First graph is a comparison of the "Elo Score" while using "native" BF16 precision in various models, second graph is comparing VRAM usage between native BF16 precision and their QAT models, but since this method is about doing quantization while also maintaining quality, isn't the obvious graph of comparing the quality between BF16 and QAT missing? The text doesn't seem to talk about it either, yet it's basically the topic of the blog post.
replies(3): >>43743893 #>>43743928 #>>43745363 #
2. croemer ◴[] No.43743893[source]
Indeed, the one thing I was looking for was Elo/performance of the quantized models, not how good the base model is. Showing how much memory is saved by quantization in a figure is a bit of an insult to the intelligence of the reader.
3. nithril ◴[] No.43743928[source]
In addition the graph "Massive VRAM Savings" graph states what looks like a tautology, reducing from 16 bits to 4 bits leads unsurprisingly to a x4 reduction in memory usage
4. claiir ◴[] No.43745363[source]
Yea they mention a “perplexity drop” relative to naive quantization, but that’s meaningless to me. > We reduce the perplexity drop by 54% (using llama.cpp perplexity evaluation) when quantizing down to Q4_0.

Wish they showed benchmarks / added quantized versions to the arena! :>