Most active commenters

    ←back to thread

    602 points emrah | 13 comments | | HN request time: 0.694s | source | bottom
    1. justanotheratom ◴[] No.43743956[source]
    Anyone packaged one of these in an iPhone App? I am sure it is doable, but I am curious what tokens/sec is possible these days. I would love to ship "private" AI Apps if we can get reasonable tokens/sec.
    replies(4): >>43743983 #>>43744244 #>>43744274 #>>43744863 #
    2. Alifatisk ◴[] No.43743983[source]
    If you ever ship a private AI app, don't forget to implement the export functionality, please!
    replies(2): >>43744861 #>>43747697 #
    3. nico ◴[] No.43744244[source]
    What kind of functionality do you need from the model?

    For basic conversation and RAG, you can use tinyllama or qwen-2.5-0.5b, both of which run on a raspberry pi at around 5-20 tokens per second

    replies(1): >>43748252 #
    4. zamadatix ◴[] No.43744274[source]
    There are many such apps, e.g. Mollama, Enclave AI or PrivateLLM or dozens of others, but you could tell me it runs at 1,000,000 tokens/second on an iPhone and I wouldn't care because the largest model version you're going to be able to load is Gemma 3 4B q4 (12 B won't fit in 8 GB with the OS + you still need context) and it's just not worth the time to use.

    That said, if you really care, it generates faster than reading speed (on an A18 based model at least).

    replies(1): >>43744535 #
    5. woodson ◴[] No.43744535[source]
    Some of these small models still have their uses, e.g. for summarization. Don’t expect them to fully replace ChatGPT.
    replies(1): >>43744829 #
    6. zamadatix ◴[] No.43744829{3}[source]
    The use case is more "I'm willing to have really bad answers that have extremely high rates of making things up" than based on the application. The same goes for summarization, it's not like it does it well like a large model would.
    7. ◴[] No.43744861[source]
    8. nolist_policy ◴[] No.43744863[source]
    FWIW, I can run Gemma-3-12b-it-qat on my Galaxy Fold 4 with 12Gb ram at around 1.5 tokens / s. I use plain llama.cpp with Termux.
    replies(1): >>43745150 #
    9. Casteil ◴[] No.43745150[source]
    Does this turn your phone into a personal space heater too?
    10. idonotknowwhy ◴[] No.43747697[source]
    You mean conversations? Just the jsonl of the standard hf dataset format to import into other systems?
    replies(1): >>43750298 #
    11. justanotheratom ◴[] No.43748252[source]
    I am looking for structured output at about 100-200 tokens/second on iPhone 14+. Any pointers?
    replies(1): >>43786330 #
    12. Alifatisk ◴[] No.43750298{3}[source]
    Yeah I mean conversations.
    13. nico ◴[] No.43786330{3}[source]
    The qwq-2.5-0.5b is the tiniest useful model I've used, and pretty easy to fine-tune locally on a Mac. Haven't tried it on an iPhone, but given it runs at about 150-200 tokens/second on a Mac, I'm kinda doubtful it could do the same on an iPhone. But I guess you'd just have to try