(github.com)

1311 points msoad | 1 comments | 31 Mar 23 20:37 UTC | HN request time: 0s | source

Show context

lukev ◴[31 Mar 23 21:05 UTC] No.35393652[source]▶

Has anyone done any comprehensive analysis on exactly how much quantization affects the quality of model output? I haven't seen any more than people running it and being impressed (or not) by a few sample outputs.

I would be very curious about some contrastive benchmarks between a quantized and non-quantized version of the same model.

replies(4): >>35393753 #>>35393773 #>>35393898 #>>35394006 #

bakkoting ◴[31 Mar 23 21:14 UTC] No.35393753[source]▶

>>35393652 #

Some results here: https://github.com/ggerganov/llama.cpp/discussions/406

tl;dr quantizing the 13B model gives up about 30% of the improvement you get from moving from 7B to 13B - so quantized 13B is still much better than unquantized 7B. Similar results for the larger models.

replies(1): >>35393937 #

terafo ◴[31 Mar 23 21:30 UTC] No.35393937[source]▶

>>35393753 #

I wonder where such difference between llama.cpp and [1] repo comes from. F16 difference in perplexity is .3 on 7B model, which is not insignificant. ggml quirks are definitely need to be fixed.

[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa

replies(2): >>35394332 #>>35394357 #

1. gliptic ◴[31 Mar 23 22:09 UTC] No.35394357[source]▶

>>35393937 #

GPTQ-for-LLaMa recently implemented some quantization tricks suggested by the GPTQ authors that improved 7B especially. Maybe llama.cpp hasn't been evaluated with those in place?

↑

Llama.cpp 30B runs with only 6GB of RAM now