I would be very curious about some contrastive benchmarks between a quantized and non-quantized version of the same model.
I would be very curious about some contrastive benchmarks between a quantized and non-quantized version of the same model.
I've been able to compare 4 bit GPTQ, naive int8, LLM.int8, fp16, and fp32. LLM.int8 does impressively well but inference is 4-5x slower than native fp16.
Oddly I recently ran a fork of the model on the ONNX runtime, I'm convinced that the model performed better than pytorch/transformers, perhaps subtle differences in floating point behavior etc between kernels on different hardware significantly influence performance.
The most promising next step in the quantization space IMO has to be fp8, there's a lot of hardware vendors adding support, and there's a lot of reasons to believe fp8 will outperform most current quantization schemes [1][2]. Particularly when combined with quantization aware training / fine tuning (I think OpenAI did something similar for GPT3.5 "turbo").
If anybody is interested I'm currently working on an open source fp8 emulation library for pytorch, hoping to build something equivalent to bitsandbytes. If you are interested in collaborating my email is in my profile.
1. https://arxiv.org/abs/2208.09225 2. https://arxiv.org/abs/2209.05433
> Llama.cpp 30B
> LLaMA-65B
the "number B" stands for "number of billions" of parameters... trained on?
like you take 65 billion words (from paragraphs / sentences from like, Wikipedia pages or whatever) and "train" the LLM. is that the metric?
why aren't "more parameters" (higher B) always better? aka return better results
how many "B" parameters is ChatGPT on GPT3.5 vs GPT4?
GPT3: 175b
GPT3.5: ?
GPT4: ?
https://blog.accubits.com/gpt-3-vs-gpt-3-5-whats-new-in-open...
how is Llama with 13B parameters able to compete with GPT3 with 175B parameters? It's 10x+ less? How much RAM goes it take to run "a single node" of GPT3 / GPT3.5 / GPT4?
No, it's just the size of the network (i.e. number of learnable parameters). The 13/30/65B models were each trained on ~1.4 trillion tokens of training data (each token is around half a word).