←back to thread

602 points emrah | 1 comments | | HN request time: 0.207s | source
Show context
emrah ◴[] No.43743338[source]
Available on ollama: https://ollama.com/library/gemma3
replies(2): >>43743657 #>>43743658 #
Der_Einzige ◴[] No.43743658[source]
How many times do I have to say this? Ollama, llamacpp, and many other projects are slower than vLLM/sglang. vLLM is a much superior inference engine and is fully supported by the only LLM frontends that matter (sillytavern).

The community getting obsessed with Ollama has done huge damage to the field, as it's ineffecient compared to vLLM. Many people can get far more tok/s than they think they could if only they knew the right tools.

replies(9): >>43743672 #>>43743695 #>>43743760 #>>43743819 #>>43743824 #>>43743859 #>>43743860 #>>43749101 #>>43753155 #
oezi ◴[] No.43743859[source]
Somebody in this thread mentioned 20.x tok/s on ollama. What are you seeing in vLLM?
replies(1): >>43743889 #
1. Zambyte ◴[] No.43743889[source]
FWIW I'm getting 29 TPS on Ollama on my 7900 XTX with the 27b qat. You can't really compare inference engine to inference engine without keeping the hardware and model fixed.

Unfortunately Ollama and vLLM are therefore incomparable at the moment, because vLLM does not support these models yet.

https://github.com/vllm-project/vllm/issues/16856