←back to thread

602 points emrah | 2 comments | | HN request time: 0.469s | source
1. miki123211 ◴[] No.43744691[source]
What would be the best way to deploy this if you're maximizing for GPU utilization in a multi-user (API) scenario? Structured output support would be a big plus.

We're working with a GPU-poor organization with very strict data residency requirements, and these models might be exactly what we need.

I would normally say VLLM, but the blog post notably does not mention VLLM support.

replies(1): >>43747210 #
2. PhilippGille ◴[] No.43747210[source]
vLLM lists Gemma 3 as supported, if I'm not mistaken: https://docs.vllm.ai/en/latest/models/supported_models.html#...