(developers.googleblog.com)

602 points emrah | 1 comments | 20 Apr 25 12:22 UTC | HN request time: 0.366s | source

Show context

justanotheratom ◴[20 Apr 25 14:23 UTC] No.43743956[source]▶

Anyone packaged one of these in an iPhone App? I am sure it is doable, but I am curious what tokens/sec is possible these days. I would love to ship "private" AI Apps if we can get reasonable tokens/sec.

replies(4): >>43743983 #>>43744244 #>>43744274 #>>43744863 #

nico ◴[20 Apr 25 15:05 UTC] No.43744244[source]▶

>>43743956 #

What kind of functionality do you need from the model?

For basic conversation and RAG, you can use tinyllama or qwen-2.5-0.5b, both of which run on a raspberry pi at around 5-20 tokens per second

replies(1): >>43748252 #

justanotheratom ◴[21 Apr 25 03:14 UTC] No.43748252[source]▶

>>43744244 #

I am looking for structured output at about 100-200 tokens/second on iPhone 14+. Any pointers?

replies(1): >>43786330 #

1. nico ◴[24 Apr 25 19:09 UTC] No.43786330[source]▶

>>43748252 #

The qwq-2.5-0.5b is the tiniest useful model I've used, and pretty easy to fine-tune locally on a Mac. Haven't tried it on an iPhone, but given it runs at about 150-200 tokens/second on a Mac, I'm kinda doubtful it could do the same on an iPhone. But I guess you'd just have to try

↑

Gemma 3 QAT Models: Bringing AI to Consumer GPUs