Gemma 3 QAT Models: Bringing AI to Consumer GPUs

(developers.googleblog.com)

602 points emrah | 4 comments | 20 Apr 25 12:22 UTC | HN request time: 0.702s | source

Show context

trebligdivad ◴[20 Apr 25 14:31 UTC] No.43744014[source]▶

It seems pretty impressive - I'm running it on my CPU (16 core AMD 3950x) and it's very very impressive at translation, and the image description is very impressive as well. I'm getting about 2.3token/s on it (compared to under 1/s on the Calme-3.2 I was previously using). It does tend to be a bit chatty unless you tell it not to be; pretty much everything it'll give you a 'breakdown' unless you tell it not to - so for traslation my prompt is 'Translate the input to English, only output the translation' to stop it giving a breakdown of the input language.

replies(2): >>43744070 #>>43747653 #

Havoc ◴[21 Apr 25 00:41 UTC] No.43747653[source]▶

>>43744014 #

The upcoming qwen3 series is supposed to be MoE...likely to give better tk/s on CPU

replies(1): >>43748355 #

1. slekker ◴[21 Apr 25 03:44 UTC] No.43748355[source]▶

>>43747653 #

What's MoE?

replies(2): >>43748381 #>>43749736 #

2. zamalek ◴[21 Apr 25 03:51 UTC] No.43748381[source]▶

>>43748355 (TP) #

Mixture of Experts. Very broadly speaking, there are a bunch of mini networks (experts) which can be independently activated.

3. Havoc ◴[21 Apr 25 08:57 UTC] No.43749736[source]▶

>>43748355 (TP) #

Mixture of experts like other guy said - everything gets loaded into mem but not every byte is needed to generate a token (unlike classic LLMs like gemma).

So for devices that have lots of mem but weaker processing power it can get you similar output quality but faster. So tends to do better on CPU and APU like setups

replies(1): >>43754539 #

4. trebligdivad ◴[21 Apr 25 17:49 UTC] No.43754539[source]▶

>>43749736 #

I'm not even sure they're loading everything into memory for MoE; maybe they can get away with only the relevant experts being paged in.

↑