Gemma 3 QAT Models: Bringing AI to Consumer GPUs

I am running this on 16 GB AMD Radeon 7900 GRE with 64 GB machine with ROCm and llama.cpp on Windows 11. I can use Open-webui or the native gui for the interface. It is made available via an internal IP to all members of my home.

It runs at around 26 tokens/sec and FP16, FP8 is not supported by the Radeon 7900 GRE.

I just love it.

For coding QwQ 32b is still king. But with a 16GB VRAM card it gives me ~3 tokens/sec, which is unusable.

I tried to make Gemma 3 write a powershell script with Terminal gui interface and it ran into dead-ends and finally gave up. QwQ 32B performed a lot better.

But for most general purposes it is great. My kid's been using it to feed his school textbooks and ask it questions. It is better than anything else currently.

Somehow it is more "uptight" than llama or the chinese models like Qwen. Can't put my finger on it, the Chinese models seem nicer and more talkative.