(github.com)

1311 points msoad | 1 comments | 31 Mar 23 20:37 UTC | HN request time: 0.207s | source

Show context

lukev ◴[31 Mar 23 21:05 UTC] No.35393652[source]▶

Has anyone done any comprehensive analysis on exactly how much quantization affects the quality of model output? I haven't seen any more than people running it and being impressed (or not) by a few sample outputs.

I would be very curious about some contrastive benchmarks between a quantized and non-quantized version of the same model.

replies(4): >>35393753 #>>35393773 #>>35393898 #>>35394006 #

1. terafo ◴[31 Mar 23 21:16 UTC] No.35393773[source]▶

>>35393652 #

For this specific implementation here's info from llama.cpp repo:

Perplexity - model options

5.5985 - 13B, q4_0

5.9565 - 7B, f16

6.3001 - 7B, q4_1

6.5949 - 7B, q4_0

6.5995 - 7B, q4_0, --memory_f16

According to this repo[1] difference is about 3% in their implementation with right group size. If you'd like to know more, I think you should read GPTQ paper[2].

[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa

[2] https://arxiv.org/abs/2210.17323

↑

Llama.cpp 30B runs with only 6GB of RAM now