I think the 32b models are actually good enough that I might stop paying for ChatGPT plus and Claude.
I get around 20 tok/second on my m3 and I can get 100 tok/second on smaller models or quantized. 80-100 tok/second is the best for interactive usage if you go above that you basically can’t read as fast as it generates.
I also really like the QwQ reaoning model, I haven’t gotten around to try out using locally hosted models for Agents and RAG especially coding agents is what im interested in. I feel like 20 tok/second is fine if it’s just running in the background.
Anyways would love to know others experiences, that was mine this weekend. The way it’s going I really dont see a point in paying, I think on-device is the near future and they should just charge a licensing fee like DB provider for enterprise support and updates.
If you were paying $20/mo for ChatGPT 1 year ago, the 32b models are basically at that level but slightly slower and slightly lower quality but useful enough to consider cancelling your subscriptions at this point.
It's really frustrating that I can't just write off Apple as evil monopolists when they put out hardware like this.