←back to thread

1311 points msoad | 1 comments | | HN request time: 0s | source
Show context
lukev ◴[] No.35393652[source]
Has anyone done any comprehensive analysis on exactly how much quantization affects the quality of model output? I haven't seen any more than people running it and being impressed (or not) by a few sample outputs.

I would be very curious about some contrastive benchmarks between a quantized and non-quantized version of the same model.

replies(4): >>35393753 #>>35393773 #>>35393898 #>>35394006 #
1. terafo ◴[] No.35393773[source]
For this specific implementation here's info from llama.cpp repo:

Perplexity - model options

5.5985 - 13B, q4_0

5.9565 - 7B, f16

6.3001 - 7B, q4_1

6.5949 - 7B, q4_0

6.5995 - 7B, q4_0, --memory_f16

According to this repo[1] difference is about 3% in their implementation with right group size. If you'd like to know more, I think you should read GPTQ paper[2].

[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa

[2] https://arxiv.org/abs/2210.17323