Gemma 3 QAT Models: Bringing AI to Consumer GPUs

1. justanotheratom ◴[20 Apr 25 14:23 UTC] No.43743956[source]▶

Anyone packaged one of these in an iPhone App? I am sure it is doable, but I am curious what tokens/sec is possible these days. I would love to ship "private" AI Apps if we can get reasonable tokens/sec.

replies(4): >>43743983 #>>43744244 #>>43744274 #>>43744863 #

2. Alifatisk ◴[20 Apr 25 14:26 UTC] No.43743983[source]▶

>>43743956 (TP) #

If you ever ship a private AI app, don't forget to implement the export functionality, please!

replies(2): >>43744861 #>>43747697 #

3. nico ◴[20 Apr 25 15:05 UTC] No.43744244[source]▶

>>43743956 (TP) #

What kind of functionality do you need from the model?

For basic conversation and RAG, you can use tinyllama or qwen-2.5-0.5b, both of which run on a raspberry pi at around 5-20 tokens per second

replies(1): >>43748252 #

4. zamadatix ◴[20 Apr 25 15:09 UTC] No.43744274[source]▶

>>43743956 (TP) #

There are many such apps, e.g. Mollama, Enclave AI or PrivateLLM or dozens of others, but you could tell me it runs at 1,000,000 tokens/second on an iPhone and I wouldn't care because the largest model version you're going to be able to load is Gemma 3 4B q4 (12 B won't fit in 8 GB with the OS + you still need context) and it's just not worth the time to use.

That said, if you really care, it generates faster than reading speed (on an A18 based model at least).

replies(1): >>43744535 #

5. woodson ◴[20 Apr 25 15:48 UTC] No.43744535[source]▶

>>43744274 #

Some of these small models still have their uses, e.g. for summarization. Don’t expect them to fully replace ChatGPT.

replies(1): >>43744829 #

6. zamadatix ◴[20 Apr 25 16:38 UTC] No.43744829{3}[source]▶

>>43744535 #

The use case is more "I'm willing to have really bad answers that have extremely high rates of making things up" than based on the application. The same goes for summarization, it's not like it does it well like a large model would.

7. ◴[20 Apr 25 16:43 UTC] No.43744861[source]▶

>>43743983 #

8. nolist_policy ◴[20 Apr 25 16:44 UTC] No.43744863[source]▶

>>43743956 (TP) #

FWIW, I can run Gemma-3-12b-it-qat on my Galaxy Fold 4 with 12Gb ram at around 1.5 tokens / s. I use plain llama.cpp with Termux.

replies(1): >>43745150 #

9. Casteil ◴[20 Apr 25 17:31 UTC] No.43745150[source]▶

>>43744863 #

Does this turn your phone into a personal space heater too?

10. idonotknowwhy ◴[21 Apr 25 00:56 UTC] No.43747697[source]▶

>>43743983 #

You mean conversations? Just the jsonl of the standard hf dataset format to import into other systems?

replies(1): >>43750298 #

11. justanotheratom ◴[21 Apr 25 03:14 UTC] No.43748252[source]▶

>>43744244 #

I am looking for structured output at about 100-200 tokens/second on iPhone 14+. Any pointers?

replies(1): >>43786330 #

12. Alifatisk ◴[21 Apr 25 10:32 UTC] No.43750298{3}[source]▶

>>43747697 #

Yeah I mean conversations.

13. nico ◴[24 Apr 25 19:09 UTC] No.43786330{3}[source]▶

>>43748252 #

The qwq-2.5-0.5b is the tiniest useful model I've used, and pretty easy to fine-tune locally on a Mac. Haven't tried it on an iPhone, but given it runs at about 150-200 tokens/second on a Mac, I'm kinda doubtful it could do the same on an iPhone. But I guess you'd just have to try