←back to thread

Llama.cpp 30B runs with only 6GB of RAM now

(github.com)

1311 points msoad | 1 comments | 31 Mar 23 20:37 UTC | HN request time: 0.2s | source

Show context

lukev ◴[31 Mar 23 21:05 UTC] No.35393652[source]▶

>>35393284 (OP) #

Has anyone done any comprehensive analysis on exactly how much quantization affects the quality of model output? I haven't seen any more than people running it and being impressed (or not) by a few sample outputs.

I would be very curious about some contrastive benchmarks between a quantized and non-quantized version of the same model.

replies(4): >>35393753 #>>35393773 #>>35393898 #>>35394006 #

1. corvec ◴[31 Mar 23 21:26 UTC] No.35393898[source]▶

Define "comprehensive?"

There are some benchmarks here: https://www.reddit.com/r/LocalLLaMA/comments/1248183/i_am_cu... and here: https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-i...

Check out the original paper on quantization, which has some benchmarks: https://arxiv.org/pdf/2210.17323.pdf and this paper, which also has benchmarks and explains how they determined that 4-bit quantization is optimal compared to 3-bit: https://arxiv.org/pdf/2212.09720.pdf

I also think the discussion of that second paper here is interesting, though it doesn't have its own benchmarks: https://github.com/oobabooga/text-generation-webui/issues/17...