←back to thread

486 points dbreunig | 1 comments | | HN request time: 0s | source
Show context
cjbgkagh ◴[] No.41865626[source]
> We've tried to avoid that by making both the input matrices more square, so that tiling and reuse should be possible.

While it might be possible it would not surprise me if a number of possible optimizations had not made it into Onnx. It appears that Qualcomm does not give direct access to the NPU and users are expected to use frameworks to convert models over to it, and in my experience conversion tools generally suck and leave a lot of optimizations on the table. It could be less of NPUs suck and more of the conversions tools suck. I'll wait until I get direct access - I don't trust conversion tools.

My view of NPUs is that they're great for tiny ML models and very fast function approximations which is my intended use case. While LLMs are the new hotness there are huge number of specialized tasks that small models are really useful for.

replies(2): >>41865847 #>>41868939 #
Hizonner ◴[] No.41868939[source]
> While LLMs are the new hotness there are huge number of specialized tasks that small models are really useful for.

Can you give some examples? Preferably examples that will run continuously enough for even a small model to stay in cache, and are valuable enough to a significant number of users to justify that cache footprint?

I am not saying there aren't any, but I also honestly don't know what they are and would like to.

replies(2): >>41870791 #>>41871252 #
1. cjbgkagh ◴[] No.41871252[source]
I guess these days basically anything in ML prior to LLMs would be considered small. LLMs are rather unusual because of how large they are.

NNs can be used as a general function approximators so any function which can be approximated is a candidate for using a NN in it's place. I have a very complex trig function that produces a high dimensional smooth manifold which I know will only be used within a narrow range of inputs and I can sacrifice some accuracy for speed. My inner loops have inner loops which have inner loops with inner loops. So when you're 4+ inner loops deep the speed becomes essential. I can sweep the entire input domain to make sure the error always stays within limits.

If you're doing things such as counting instructions, intrinics, inline assembly, bit-twiddling, fast math, polynomial approximations, LUTs, fixed point math, etc. you could probably add NNs to your toolkit.

Stockfish uses a 'small' 82K parameter neural net of 3 dense integer only layers (https://news.ycombinator.com/item?id=27734517). I think Stockfish performance would be a really good candidate for testing NPUs as there is a time / accuracy tradeoff.