AI PCs Aren't Good at AI: The CPU Beats the NPU

(github.com)

488 points dbreunig | 5 comments | 16 Oct 24 19:44 UTC | HN request time: 0.52s | source

Show context

isusmelj ◴[16 Oct 24 20:23 UTC] No.41863460[source]▶

I think the results show that just in general the compute is not used well. That the CPU took 8.4ms and GPU took 3.2ms shows a very small gap. I'd expect more like 10x - 20x difference here. I'd assume that the onnxruntime might be the issue. I think some hardware vendors just release the compute units without shipping proper support yet. Let's see how fast that will change.

Also, people often mistake the reason for an NPU is "speed". That's not correct. The whole point of the NPU is rather to focus on low power consumption. To focus on speed you'd need to get rid of the memory bottleneck. Then you end up designing your own ASIC with it's own memory. The NPUs we see in most devices are part of the SoC around the CPU to offload AI computations. It would be interesting to run this benchmark in a infinite loop for the three devices (CPU, NPU, GPU) and measure power consumption. I'd expect the NPU to be lowest and also best in terms of "ops/watt"

replies(8): >>41863552 #>>41863639 #>>41864898 #>>41864928 #>>41864933 #>>41866594 #>>41869485 #>>41870575 #

theresistor ◴[16 Oct 24 23:26 UTC] No.41864928[source]▶

>>41863460 #

> Also, people often mistake the reason for an NPU is "speed". That's not correct. The whole point of the NPU is rather to focus on low power consumption.

It's also often about offload. Depending on the use case, the CPU and GPU may be busy with other tasks, so the NPU is free bandwidth that can be used without stealing from the others. Consider AI-powered photo filters: the GPU is probably busy rendering the preview, and the CPU is busy drawing UI and handling user inputs.

replies(1): >>41865137 #

1. cakoose ◴[17 Oct 24 00:01 UTC] No.41865137[source]▶

>>41864928 #

Offload only makes sense if there are other advantages, e.g. speed, power.

Without those, wouldn't it be better to use the NPUs silicon budget on more CPU?

replies(4): >>41865175 #>>41865703 #>>41865735 #>>41868502 #

2. heavyset_go ◴[17 Oct 24 00:09 UTC] No.41865175[source]▶

>>41865137 (TP) #

More CPU means siphoning off more of the power budget on mobile devices. The theoretical value of NPUs is power efficiency on a limited budget.

3. theresistor ◴[17 Oct 24 01:52 UTC] No.41865703[source]▶

>>41865137 (TP) #

If you know that you need to offload matmuls, then building matmul hardware is more area efficient than adding an entire extra CPU. Various intermediate points exist along that spectrum, e.g. Cell's SPUs.

4. avianlyric ◴[17 Oct 24 01:58 UTC] No.41865735[source]▶

>>41865137 (TP) #

Not really. To get extra CPU performance that likely means more cores, or some other general compute silicon. That stuff tends to be quite big, simply because it’s so flexible.

NPUs focus on one specific type of computation, matrix multiplication, and usually with low precision integers, because that’s all a neural net needs. That vast reduction in flexibility means you can take lots of shortcuts in your design, allowing you cram more compute into a smaller footprint.

If you look at the M1 chip[1], you can see the entire 16-Neural engine has a foot print about the size of 4 performance cores (excluding their caches). It’s not perfect comparison, without numbers on what the performance core can achieve in terms of ops/second vs the Neural Engine. But it seems reasonable to be that the Neural Engine and handily outperform the performance core complex when doing matmul operations.

[1] https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

5. mapt ◴[17 Oct 24 11:19 UTC] No.41868502[source]▶

>>41865137 (TP) #

For PC CPUs, there are already so many watts per square millimeter that many of the top tiers of the recent generations are running thermally throttled 24/7; More cooling improves performance rather than temperatures because it allows more of the cores to run at 'full' speed or at 'boost' speed. This kills their profitable market segmentation.

In this environment it makes some sense to use more efficient RISC cores, and to spread out cores a bit with dedicated bits that either aren't going to get used all the time, or that are going to be used at lower power draws, and combining cores with better on-die memory availability (extreme L2/L3 caches) and other features. Apple even has some silicon in the power section left as empty space for thermal reasons.

Emily (formerly Anthony) on LTT had a piece on the Apple CPUs that pointed out some of the inherent advantages of the big-chip ARM SOC versus the x86 motherboard-daughterboard arrangement as we start to hit Moore's Wall. https://www.youtube.com/watch?v=LFQ3LkVF5sM

↑