←back to thread

602 points emrah | 5 comments | | HN request time: 0.618s | source
Show context
emrah ◴[] No.43743338[source]
Available on ollama: https://ollama.com/library/gemma3
replies(2): >>43743657 #>>43743658 #
Der_Einzige ◴[] No.43743658[source]
How many times do I have to say this? Ollama, llamacpp, and many other projects are slower than vLLM/sglang. vLLM is a much superior inference engine and is fully supported by the only LLM frontends that matter (sillytavern).

The community getting obsessed with Ollama has done huge damage to the field, as it's ineffecient compared to vLLM. Many people can get far more tok/s than they think they could if only they knew the right tools.

replies(9): >>43743672 #>>43743695 #>>43743760 #>>43743819 #>>43743824 #>>43743859 #>>43743860 #>>43749101 #>>43753155 #
Zambyte ◴[] No.43743760[source]
The significant convenience benefits outweigh the higher TPS that vLLM offers in the context of my single machine homelab GPU server. If I was hosting it for something more critical than just myself and a few friends chatting with it, sure. Being able to just paste a model name into Open WebUI and run it is important to me though.

It is important to know about both to decide between the two for your use case though.

replies(1): >>43744902 #
1. Der_Einzige ◴[] No.43744902[source]
Running any HF model on vllm is as simple as pasting a model name into one command in your terminal.
replies(2): >>43745811 #>>43752405 #
2. Zambyte ◴[] No.43745811[source]
What command is it? Because that was not at all my experience.
replies(1): >>43747190 #
3. Der_Einzige ◴[] No.43747190[source]
Vllm serve… huggingface gives run instructions for every model with vllm on their website.
replies(1): >>43747243 #
4. Zambyte ◴[] No.43747243{3}[source]
How do I serve multiple models? I can pick from dozens of models that I have downloaded through Open WebUI.
5. iAMkenough ◴[] No.43752405[source]
Had to build it from source to run on my Mac, and the experimental support doesn't seem to include these latest Gemma 3 QAT models on Apple Silicon.