Llama.cpp 30B runs with only 6GB of RAM now

1. lukev ◴[31 Mar 23 21:05 UTC] No.35393652[source]▶

Has anyone done any comprehensive analysis on exactly how much quantization affects the quality of model output? I haven't seen any more than people running it and being impressed (or not) by a few sample outputs.

I would be very curious about some contrastive benchmarks between a quantized and non-quantized version of the same model.

replies(4): >>35393753 #>>35393773 #>>35393898 #>>35394006 #

2. bakkoting ◴[31 Mar 23 21:14 UTC] No.35393753[source]▶

>>35393652 (TP) #

Some results here: https://github.com/ggerganov/llama.cpp/discussions/406

tl;dr quantizing the 13B model gives up about 30% of the improvement you get from moving from 7B to 13B - so quantized 13B is still much better than unquantized 7B. Similar results for the larger models.

replies(1): >>35393937 #

3. terafo ◴[31 Mar 23 21:16 UTC] No.35393773[source]▶

>>35393652 (TP) #

For this specific implementation here's info from llama.cpp repo:

Perplexity - model options

5.5985 - 13B, q4_0

5.9565 - 7B, f16

6.3001 - 7B, q4_1

6.5949 - 7B, q4_0

6.5995 - 7B, q4_0, --memory_f16

According to this repo[1] difference is about 3% in their implementation with right group size. If you'd like to know more, I think you should read GPTQ paper[2].

[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa

[2] https://arxiv.org/abs/2210.17323

4. corvec ◴[31 Mar 23 21:26 UTC] No.35393898[source]▶

>>35393652 (TP) #

Define "comprehensive?"

There are some benchmarks here: https://www.reddit.com/r/LocalLLaMA/comments/1248183/i_am_cu... and here: https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-i...

Check out the original paper on quantization, which has some benchmarks: https://arxiv.org/pdf/2210.17323.pdf and this paper, which also has benchmarks and explains how they determined that 4-bit quantization is optimal compared to 3-bit: https://arxiv.org/pdf/2212.09720.pdf

I also think the discussion of that second paper here is interesting, though it doesn't have its own benchmarks: https://github.com/oobabooga/text-generation-webui/issues/17...

5. terafo ◴[31 Mar 23 21:30 UTC] No.35393937[source]▶

>>35393753 #

I wonder where such difference between llama.cpp and [1] repo comes from. F16 difference in perplexity is .3 on 7B model, which is not insignificant. ggml quirks are definitely need to be fixed.

[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa

replies(2): >>35394332 #>>35394357 #

6. mlgoatherder ◴[31 Mar 23 21:37 UTC] No.35394006[source]▶

>>35393652 (TP) #

I've done some experiments here with Llama 13B, in my subjective experience the original fp16 model is significantly better (particularly on coding tasks). There are a bunch of synthetic benchmarks such a wikitext2 PPL and all the whiz bang quantization schemes seem to score well but subjectively something is missing.

I've been able to compare 4 bit GPTQ, naive int8, LLM.int8, fp16, and fp32. LLM.int8 does impressively well but inference is 4-5x slower than native fp16.

Oddly I recently ran a fork of the model on the ONNX runtime, I'm convinced that the model performed better than pytorch/transformers, perhaps subtle differences in floating point behavior etc between kernels on different hardware significantly influence performance.

The most promising next step in the quantization space IMO has to be fp8, there's a lot of hardware vendors adding support, and there's a lot of reasons to believe fp8 will outperform most current quantization schemes [1][2]. Particularly when combined with quantization aware training / fine tuning (I think OpenAI did something similar for GPT3.5 "turbo").

If anybody is interested I'm currently working on an open source fp8 emulation library for pytorch, hoping to build something equivalent to bitsandbytes. If you are interested in collaborating my email is in my profile.

1. https://arxiv.org/abs/2208.09225 2. https://arxiv.org/abs/2209.05433

replies(2): >>35395400 #>>35404063 #

7. bakkoting ◴[31 Mar 23 22:06 UTC] No.35394332{3}[source]▶

>>35393937 #

I'd guess the GPTQ-for-LLaMa repo is using a larger context size. Poking around it looks like GPTQ-for-llama is specifying 2048 [1] vs the default 512 for llama.cpp [2]. You can just specify a longer size on the CLI for llama.cpp if you are OK with the extra memory.

[1] https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/934034c8e...

[2] https://github.com/ggerganov/llama.cpp/tree/3525899277d2e2bd...

8. gliptic ◴[31 Mar 23 22:09 UTC] No.35394357{3}[source]▶

>>35393937 #

GPTQ-for-LLaMa recently implemented some quantization tricks suggested by the GPTQ authors that improved 7B especially. Maybe llama.cpp hasn't been evaluated with those in place?

9. fallous ◴[31 Mar 23 23:54 UTC] No.35395400[source]▶

>>35394006 #

Isn't that related to architecture? The most recent GPUs and tensor procs have native support for 4-bit(partially) and 8-bit int whereas older GPUs take noticeable performance hits for 8-bit vs fp16/32.

replies(1): >>35398397 #

10. mlgoatherder ◴[01 Apr 23 08:25 UTC] No.35398397{3}[source]▶

>>35395400 #

Ah but LLM.int8 (eg. as in huggingface transformers) isn't actually int8, it's a mixed precision encoding scheme that is nominally eight bits per parameter. This means custom cuda kernels etc, these kernels could be improved but without hardware support its always going to be slow.

Straight int8 quantization generally does not work for post training quantization of transformers. The distribution of weights includes a significant amount of outlier values that seem to be important to model performance. Apparently quantization aware training can improve things significantly but I haven't seen any developments for llama yet.

Interestingly on the 4 bit front, NVIDIA has chosen to remove int4 support from the next gen Hopper series. I'm not sure folks realize the industry has already moved on. FP8 feels like a bit of a hack, but I like it!

11. MuffinFlavored ◴[01 Apr 23 21:06 UTC] No.35404063[source]▶

>>35394006 #

> Llama 13B

> Llama.cpp 30B

> LLaMA-65B

the "number B" stands for "number of billions" of parameters... trained on?

like you take 65 billion words (from paragraphs / sentences from like, Wikipedia pages or whatever) and "train" the LLM. is that the metric?

why aren't "more parameters" (higher B) always better? aka return better results

how many "B" parameters is ChatGPT on GPT3.5 vs GPT4?

GPT3: 175b

GPT3.5: ?

GPT4: ?

https://blog.accubits.com/gpt-3-vs-gpt-3-5-whats-new-in-open...

how is Llama with 13B parameters able to compete with GPT3 with 175B parameters? It's 10x+ less? How much RAM goes it take to run "a single node" of GPT3 / GPT3.5 / GPT4?

replies(1): >>35410751 #

12. turmeric_root ◴[02 Apr 23 14:20 UTC] No.35410751{3}[source]▶

>>35404063 #

> the "number B" stands for "number of billions" of parameters... trained on?

No, it's just the size of the network (i.e. number of learnable parameters). The 13/30/65B models were each trained on ~1.4 trillion tokens of training data (each token is around half a word).