AI PCs Aren't Good at AI: The CPU Beats the NPU

1. cjbgkagh ◴[17 Oct 24 01:37 UTC] No.41865626[source]▶

> We've tried to avoid that by making both the input matrices more square, so that tiling and reuse should be possible.

While it might be possible it would not surprise me if a number of possible optimizations had not made it into Onnx. It appears that Qualcomm does not give direct access to the NPU and users are expected to use frameworks to convert models over to it, and in my experience conversion tools generally suck and leave a lot of optimizations on the table. It could be less of NPUs suck and more of the conversions tools suck. I'll wait until I get direct access - I don't trust conversion tools.

My view of NPUs is that they're great for tiny ML models and very fast function approximations which is my intended use case. While LLMs are the new hotness there are huge number of specialized tasks that small models are really useful for.

replies(2): >>41865847 #>>41868939 #

2. jaygreco ◴[17 Oct 24 02:24 UTC] No.41865847[source]▶

>>41865626 (TP) #

I came here to say this. I haven’t worked with the Elite X but the past gen stuff I’ve used (865 mostly) the accelerators - compute DSP and much smaller NPU - required _very_ specific setup, compilation with a bespoke toolchain, and communication via RPC to name a few.

I would hope the NPU on Elite X is easier to get to considering the whole copilot+ thing, but I bring this up mainly to make the point that I doubt it’s just as easy as “run general purpose model, expect it to magically teleport onto the NPU”.

3. Hizonner ◴[17 Oct 24 12:17 UTC] No.41868939[source]▶

>>41865626 (TP) #

> While LLMs are the new hotness there are huge number of specialized tasks that small models are really useful for.

Can you give some examples? Preferably examples that will run continuously enough for even a small model to stay in cache, and are valuable enough to a significant number of users to justify that cache footprint?

I am not saying there aren't any, but I also honestly don't know what they are and would like to.

replies(2): >>41870791 #>>41871252 #

4. consteval ◴[17 Oct 24 15:50 UTC] No.41870791[source]▶

>>41868939 #

iPhones use a lot of these. There's a bunch of little features that run on the NPU.

Suggestions, predictive text, smart image search, automatic image classification, text selection in images, image processing. These don't run continuously, but I think they are valuable to a lot of users. The predictive text is quite good, and it's very nice to be able to search for vague terms like "license plate" and get images in my camera roll. Plus, selecting text and copying it from images is great.

For desktop usecases, I'm not sure.

5. cjbgkagh ◴[17 Oct 24 16:41 UTC] No.41871252[source]▶

>>41868939 #

I guess these days basically anything in ML prior to LLMs would be considered small. LLMs are rather unusual because of how large they are.

NNs can be used as a general function approximators so any function which can be approximated is a candidate for using a NN in it's place. I have a very complex trig function that produces a high dimensional smooth manifold which I know will only be used within a narrow range of inputs and I can sacrifice some accuracy for speed. My inner loops have inner loops which have inner loops with inner loops. So when you're 4+ inner loops deep the speed becomes essential. I can sweep the entire input domain to make sure the error always stays within limits.

If you're doing things such as counting instructions, intrinics, inline assembly, bit-twiddling, fast math, polynomial approximations, LUTs, fixed point math, etc. you could probably add NNs to your toolkit.

Stockfish uses a 'small' 82K parameter neural net of 3 dense integer only layers (https://news.ycombinator.com/item?id=27734517). I think Stockfish performance would be a really good candidate for testing NPUs as there is a time / accuracy tradeoff.