←back to thread

1311 points msoad | 3 comments | | HN request time: 0.212s | source
Show context
lukev ◴[] No.35393652[source]
Has anyone done any comprehensive analysis on exactly how much quantization affects the quality of model output? I haven't seen any more than people running it and being impressed (or not) by a few sample outputs.

I would be very curious about some contrastive benchmarks between a quantized and non-quantized version of the same model.

replies(4): >>35393753 #>>35393773 #>>35393898 #>>35394006 #
bakkoting ◴[] No.35393753[source]
Some results here: https://github.com/ggerganov/llama.cpp/discussions/406

tl;dr quantizing the 13B model gives up about 30% of the improvement you get from moving from 7B to 13B - so quantized 13B is still much better than unquantized 7B. Similar results for the larger models.

replies(1): >>35393937 #
1. terafo ◴[] No.35393937[source]
I wonder where such difference between llama.cpp and [1] repo comes from. F16 difference in perplexity is .3 on 7B model, which is not insignificant. ggml quirks are definitely need to be fixed.

[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa

replies(2): >>35394332 #>>35394357 #
2. bakkoting ◴[] No.35394332[source]
I'd guess the GPTQ-for-LLaMa repo is using a larger context size. Poking around it looks like GPTQ-for-llama is specifying 2048 [1] vs the default 512 for llama.cpp [2]. You can just specify a longer size on the CLI for llama.cpp if you are OK with the extra memory.

[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/934034c8e...

[2] https://github.com/ggerganov/llama.cpp/tree/3525899277d2e2bd...

3. gliptic ◴[] No.35394357[source]
GPTQ-for-LLaMa recently implemented some quantization tricks suggested by the GPTQ authors that improved 7B especially. Maybe llama.cpp hasn't been evaluated with those in place?