(developers.googleblog.com)

602 points emrah | 2 comments | 20 Apr 25 12:22 UTC | HN request time: 0s | source

1. miki123211 ◴[20 Apr 25 16:13 UTC] No.43744691[source]▶

What would be the best way to deploy this if you're maximizing for GPU utilization in a multi-user (API) scenario? Structured output support would be a big plus.

We're working with a GPU-poor organization with very strict data residency requirements, and these models might be exactly what we need.

I would normally say VLLM, but the blog post notably does not mention VLLM support.

replies(1): >>43747210 #

2. PhilippGille ◴[20 Apr 25 23:08 UTC] No.43747210[source]▶

>>43744691 (TP) #

vLLM lists Gemma 3 as supported, if I'm not mistaken: https://docs.vllm.ai/en/latest/models/supported_models.html#...

↑

Gemma 3 QAT Models: Bringing AI to Consumer GPUs