Most active commenters
  • Der_Einzige(5)
  • Zambyte(4)

←back to thread

602 points emrah | 20 comments | | HN request time: 0.006s | source | bottom
Show context
emrah ◴[] No.43743338[source]
Available on ollama: https://ollama.com/library/gemma3
replies(2): >>43743657 #>>43743658 #
1. Der_Einzige ◴[] No.43743658[source]
How many times do I have to say this? Ollama, llamacpp, and many other projects are slower than vLLM/sglang. vLLM is a much superior inference engine and is fully supported by the only LLM frontends that matter (sillytavern).

The community getting obsessed with Ollama has done huge damage to the field, as it's ineffecient compared to vLLM. Many people can get far more tok/s than they think they could if only they knew the right tools.

replies(9): >>43743672 #>>43743695 #>>43743760 #>>43743819 #>>43743824 #>>43743859 #>>43743860 #>>43749101 #>>43753155 #
2. m00dy ◴[] No.43743672[source]
Ollama is definitely not for production loads but vLLm is.
3. janderson215 ◴[] No.43743695[source]
I did not know this, so thank you. I read a blogpost a while back that encouraged using Ollama and never mention vLLM. Do you recommend reading any particular resource?
4. Zambyte ◴[] No.43743760[source]
The significant convenience benefits outweigh the higher TPS that vLLM offers in the context of my single machine homelab GPU server. If I was hosting it for something more critical than just myself and a few friends chatting with it, sure. Being able to just paste a model name into Open WebUI and run it is important to me though.

It is important to know about both to decide between the two for your use case though.

replies(1): >>43744902 #
5. oezi ◴[] No.43743819[source]
Why is sillytavern the only LLM frontend which matters?
replies(2): >>43744110 #>>43744916 #
6. ach9l ◴[] No.43743824[source]
instead of ranting, maybe explain how to make a qat q4 work with images in vllm, afaik it is not yet possible
7. simonw ◴[] No.43743860[source]
Last I looked vLLM didn't work on a Mac.
replies(1): >>43744759 #
8. oezi ◴[] No.43743859[source]
Somebody in this thread mentioned 20.x tok/s on ollama. What are you seeing in vLLM?
replies(1): >>43743889 #
9. Zambyte ◴[] No.43743889[source]
FWIW I'm getting 29 TPS on Ollama on my 7900 XTX with the 27b qat. You can't really compare inference engine to inference engine without keeping the hardware and model fixed.

Unfortunately Ollama and vLLM are therefore incomparable at the moment, because vLLM does not support these models yet.

https://github.com/vllm-project/vllm/issues/16856

10. GordonS ◴[] No.43744110[source]
I tried sillytavern a few weeks ago... wow, that is an "interesting" UI! I blundered around for a while, couldn't figure out how to do anything useful... and then installed LM Studio instead.
replies(1): >>43744434 #
11. imtringued ◴[] No.43744434{3}[source]
I personally thought the lorebook feature was quite neat and then quickly gave up on it because I couldn't get it to trigger, ever.

Whatever those keyword things are, they certainly don't seem to be doing any form of RAG.

12. mitjam ◴[] No.43744759[source]
Afaik vllm is for concurrent serving with batched inference for higher throughput, not single-user inference. I doubt inference throughput is higher with single prompts at a time than Ollama. Update: this is a good Intro to continuous batching in llm inference: https://www.anyscale.com/blog/continuous-batching-llm-infere...
replies(1): >>43744907 #
13. Der_Einzige ◴[] No.43744902[source]
Running any HF model on vllm is as simple as pasting a model name into one command in your terminal.
replies(2): >>43745811 #>>43752405 #
14. Der_Einzige ◴[] No.43744907{3}[source]
It is much faster on single prompts than ollama. 3X is not unheard of
15. Der_Einzige ◴[] No.43744916[source]
It supports more sampler and other settings than anyone else.
16. Zambyte ◴[] No.43745811{3}[source]
What command is it? Because that was not at all my experience.
replies(1): >>43747190 #
17. Der_Einzige ◴[] No.43747190{4}[source]
Vllm serve… huggingface gives run instructions for every model with vllm on their website.
replies(1): >>43747243 #
18. Zambyte ◴[] No.43747243{5}[source]
How do I serve multiple models? I can pick from dozens of models that I have downloaded through Open WebUI.
19. prometheon1 ◴[] No.43749101[source]
From the HN guidelines: https://news.ycombinator.com/newsguidelines.html

> Be kind. Don't be snarky.

> Please don't post shallow dismissals, especially of other people's work.

In my opinion, your comment is not in line with the guidelines. Especially the part about sillytavern being the only LLM frontend that matters. Telling the devs of any LLM frontend except sillytavern that their app doesn't matter seems exactly like a shallow dismissal of other people's work to me.

20. iAMkenough ◴[] No.43752405{3}[source]
Had to build it from source to run on my Mac, and the experimental support doesn't seem to include these latest Gemma 3 QAT models on Apple Silicon.