Its comparable to Apache TVM's vulkan in speed on cuda, see https://github.com/mlc-ai/mlc-llm
But honestly, the biggest advantage of llama.cpp for me is being able to split a model so performantly. My puny 16GB laptop can just barely, but very practically, run LLaMA 30B at almost 3 tokens/s, and do it right now. That is crazy!
Please tell me your config! I have an i9-10900 with 32GB of ram that only gets .7 tokens/s on a 30B model
I'm running in Windows using koboldcpp, maybe it's faster in Linux?
I think it was originally designed to be easily embeddable—and most importantly, native code (i.e. not Python)—rather than competitive with GPUs.
I think it's just starting to get into GPU support now, but carefully.
That's correct, yeah. Q4_0 should be the smallest and fastest quantized model.
> I'm running in Windows using koboldcpp, maybe it's faster in Linux?
Possibly. You could try using WSL to test—I think both WSL1 and WSL2 are faster than Windows (but WSL1 should be faster than WSL2).
Like I said, very modest