(developers.googleblog.com)

602 points emrah | 1 comments | 20 Apr 25 12:22 UTC | HN request time: 0.203s | source

Show context

emrah ◴[20 Apr 25 12:22 UTC] No.43743338[source]▶

Available on ollama: https://ollama.com/library/gemma3

Der_Einzige ◴[20 Apr 25 13:32 UTC] No.43743658[source]▶

How many times do I have to say this? Ollama, llamacpp, and many other projects are slower than vLLM/sglang. vLLM is a much superior inference engine and is fully supported by the only LLM frontends that matter (sillytavern).

The community getting obsessed with Ollama has done huge damage to the field, as it's ineffecient compared to vLLM. Many people can get far more tok/s than they think they could if only they knew the right tools.

replies(9): >>43743672 #>>43743695 #>>43743760 #>>43743819 #>>43743824 #>>43743859 #>>43743860 #>>43749101 #>>43753155 #

simonw ◴[20 Apr 25 14:07 UTC] No.43743860[source]▶

>>43743658 #

Last I looked vLLM didn't work on a Mac.

replies(1): >>43744759 #

mitjam ◴[20 Apr 25 16:26 UTC] No.43744759[source]▶

>>43743860 #

Afaik vllm is for concurrent serving with batched inference for higher throughput, not single-user inference. I doubt inference throughput is higher with single prompts at a time than Ollama. Update: this is a good Intro to continuous batching in llm inference: https://www.anyscale.com/blog/continuous-batching-llm-infere...

replies(1): >>43744907 #

1. Der_Einzige ◴[20 Apr 25 16:49 UTC] No.43744907[source]▶

>>43744759 #

It is much faster on single prompts than ollama. 3X is not unheard of

↑

Gemma 3 QAT Models: Bringing AI to Consumer GPUs