←back to thread

486 points dbreunig | 1 comments | | HN request time: 0.239s | source
Show context
isusmelj ◴[] No.41863460[source]
I think the results show that just in general the compute is not used well. That the CPU took 8.4ms and GPU took 3.2ms shows a very small gap. I'd expect more like 10x - 20x difference here. I'd assume that the onnxruntime might be the issue. I think some hardware vendors just release the compute units without shipping proper support yet. Let's see how fast that will change.

Also, people often mistake the reason for an NPU is "speed". That's not correct. The whole point of the NPU is rather to focus on low power consumption. To focus on speed you'd need to get rid of the memory bottleneck. Then you end up designing your own ASIC with it's own memory. The NPUs we see in most devices are part of the SoC around the CPU to offload AI computations. It would be interesting to run this benchmark in a infinite loop for the three devices (CPU, NPU, GPU) and measure power consumption. I'd expect the NPU to be lowest and also best in terms of "ops/watt"

replies(8): >>41863552 #>>41863639 #>>41864898 #>>41864928 #>>41864933 #>>41866594 #>>41869485 #>>41870575 #
spookie ◴[] No.41864933[source]
I've been building an app in pure C using onnxruntime, and it outperforms a comparable one done with python by a substancial amount. There are many other gains to be made.

(In the end python just calls C, but it's pretty interesting how much performance is lost)

replies(1): >>41867439 #
dacryn ◴[] No.41867439[source]
agree there, but then again using ort in Rust is faster again.

You cannot compare python with a onxx executor.

I don't know what you used in Python, but if it's pytorch or similar, those are built with flexibility in mind, for optimal performance you want to export those to onxx and use whatever executor that is optimized for your env. onxxruntime is one of them, but definitely not the only one, and given it's from Microsoft, some prefer to avoid it and choose among the many free alternatives.

replies(1): >>41867667 #
rerdavies ◴[] No.41867667[source]
Why would the two not be entirely comparable? PyTorch may be slower at building the models; but once the model is compiled and loaded on the NPU, there's just not a whole lot of Python involved anymore. A few hundred CPU cycles to push the input data using python; a few hundred CPU cycles to receive the results using python. And everything in-between gets executed on the NPU.
replies(1): >>41868158 #
noduerme ◴[] No.41868158[source]
I really wish Python wasn't the language controlling all the C code. You need a controller, in a scripting language that's easy to modify, but it's a rather hideous choice. It would be like choosing to build the world's largest social network in PHP or something. lol.
replies(2): >>41868287 #>>41873886 #
1. robertlagrant ◴[] No.41868287[source]
> it's a rather hideous choice

Why?