AI PCs Aren't Good at AI: The CPU Beats the NPU

(github.com)

486 points dbreunig | 1 comments | 16 Oct 24 19:44 UTC | HN request time: 0.001s | source

Show context

isusmelj ◴[16 Oct 24 20:23 UTC] No.41863460[source]▶

I think the results show that just in general the compute is not used well. That the CPU took 8.4ms and GPU took 3.2ms shows a very small gap. I'd expect more like 10x - 20x difference here. I'd assume that the onnxruntime might be the issue. I think some hardware vendors just release the compute units without shipping proper support yet. Let's see how fast that will change.

Also, people often mistake the reason for an NPU is "speed". That's not correct. The whole point of the NPU is rather to focus on low power consumption. To focus on speed you'd need to get rid of the memory bottleneck. Then you end up designing your own ASIC with it's own memory. The NPUs we see in most devices are part of the SoC around the CPU to offload AI computations. It would be interesting to run this benchmark in a infinite loop for the three devices (CPU, NPU, GPU) and measure power consumption. I'd expect the NPU to be lowest and also best in terms of "ops/watt"

replies(8): >>41863552 #>>41863639 #>>41864898 #>>41864928 #>>41864933 #>>41866594 #>>41869485 #>>41870575 #

godelski ◴[16 Oct 24 23:21 UTC] No.41864898[source]▶

>>41863460 #

They definitely aren't doing the timing properly, but also what you might think is timing is not what is generally marketed. But I will say, those marketed versions are often easier to compare. One such example is that if you're using GPU then have you actually considered that there's an asynchronous operation as part of your timing?

If you're naively doing `time.time()` then what happens is this

  start = time.time() # cpu records time
  pred = model(input.cuda()).cuda() # push data and model (if not already there) to GPU memory and start computation. This is asynchronous
  end = time.time() # cpu records time, regardless of if pred stores data

You probably aren't expecting that if you don't know systems and hardware. But python (and really any language) is designed to be smart and compile into more optimized things than what you actually wrote. There's no lock, and so we're not going to block operations for cpu tasks. You might ask why do this? Well no one knows what you actually want to do. And do you want the timer library now checking for accelerators (i.e. GPU) every time it records a time? That's going to mess up your timer! (at best you'd have to do a constructor to say "enable locking for this accelerator") So you gotta do something a bit more nuanced.

If you want to actually time GPU tasks, you should look at cuda event timers (in pytorch this is `torch.cuda.Event(enable_timing=True)`. I have another comment with boilerplate)

Edit:

There's also complicated issues like memory size and shape. They definitely are not being nice to the NPU here on either of those. They (and GPUs!!!) want channels last. They did [1,6,1500,1500] but you'd want [1,1500,1500,6]. There's also the issue of how memory is allocated (and they noted IO being an issue). 1500 is a weird number (as is 6) so they aren't doing any favors to the NPU, and I wouldn't be surprised that this is a surprisingly big hit considering how new these things are

And here's my longer comment with more details: https://news.ycombinator.com/item?id=41864828

replies(1): >>41865375 #

artemisart ◴[17 Oct 24 00:49 UTC] No.41865375[source]▶

>>41864898 #

Important precision: the async part is absolutely not python specific, but comes from CUDA, indeed for performance, and you will have to use cuda events too in C++ to properly time it.

For ONNX the runtimes I know of are synchronous as we don't do each operation individually but whole models at once, there is no need for async, the timings should be correct.

replies(1): >>41865495 #

1. godelski ◴[17 Oct 24 01:13 UTC] No.41865495[source]▶

>>41865375 #

Yes, it isn't python, it is... hardware. Not even CUDA specific. It is about memory moving around and optimization (remember, even the CPUs do speculative execution). I say a little more in the larger comment.

I'm less concerned about the CPU baseline and more concerned about the NPU timing. Especially given the other issues

↑