Congrats to Apple and Meta, makes sense they did the research, this will go towards efficient serving of LLMs on phones. And it's very easy to implement.
Congrats to Apple and Meta, makes sense they did the research, this will go towards efficient serving of LLMs on phones. And it's very easy to implement.
"zero-shot accuracy retention at 4- and 3-bit compression to be on par with or better than state-of-the-art methods, while maintaining performance comparable to FP16 baselines."
My reading of that says FP16 accuracy at Q3 or Q4 size / memory bandwidth. Which is a huge advantage.
* LLaMA 3 8B: baseline 72.26, 4-bit 71.31, 3-bit 62.79
* LLaMA 3 70B: baseline 79.51, 4-bit 78.06, 3-bit 74.68
These results seem comparable to modern quantization methods—for example, the ~4-bit results for smaller LLaMA models listed here: https://ai.meta.com/blog/meta-llama-quantized-lightweight-mo...
Also seems like the techniques may be possible to combine.