(developers.googleblog.com)

602 points emrah | 2 comments | 20 Apr 25 12:22 UTC | HN request time: 0s | source

Show context

justanotheratom ◴[20 Apr 25 14:23 UTC] No.43743956[source]▶

Anyone packaged one of these in an iPhone App? I am sure it is doable, but I am curious what tokens/sec is possible these days. I would love to ship "private" AI Apps if we can get reasonable tokens/sec.

replies(4): >>43743983 #>>43744244 #>>43744274 #>>43744863 #

zamadatix ◴[20 Apr 25 15:09 UTC] No.43744274[source]▶

>>43743956 #

There are many such apps, e.g. Mollama, Enclave AI or PrivateLLM or dozens of others, but you could tell me it runs at 1,000,000 tokens/second on an iPhone and I wouldn't care because the largest model version you're going to be able to load is Gemma 3 4B q4 (12 B won't fit in 8 GB with the OS + you still need context) and it's just not worth the time to use.

That said, if you really care, it generates faster than reading speed (on an A18 based model at least).

replies(1): >>43744535 #

1. woodson ◴[20 Apr 25 15:48 UTC] No.43744535[source]▶

>>43744274 #

Some of these small models still have their uses, e.g. for summarization. Don’t expect them to fully replace ChatGPT.

replies(1): >>43744829 #

2. zamadatix ◴[20 Apr 25 16:38 UTC] No.43744829[source]▶

>>43744535 (TP) #

The use case is more "I'm willing to have really bad answers that have extremely high rates of making things up" than based on the application. The same goes for summarization, it's not like it does it well like a large model would.

↑

Gemma 3 QAT Models: Bringing AI to Consumer GPUs