←back to thread

486 points dbreunig | 4 comments | | HN request time: 0s | source
Show context
isusmelj ◴[] No.41863460[source]
I think the results show that just in general the compute is not used well. That the CPU took 8.4ms and GPU took 3.2ms shows a very small gap. I'd expect more like 10x - 20x difference here. I'd assume that the onnxruntime might be the issue. I think some hardware vendors just release the compute units without shipping proper support yet. Let's see how fast that will change.

Also, people often mistake the reason for an NPU is "speed". That's not correct. The whole point of the NPU is rather to focus on low power consumption. To focus on speed you'd need to get rid of the memory bottleneck. Then you end up designing your own ASIC with it's own memory. The NPUs we see in most devices are part of the SoC around the CPU to offload AI computations. It would be interesting to run this benchmark in a infinite loop for the three devices (CPU, NPU, GPU) and measure power consumption. I'd expect the NPU to be lowest and also best in terms of "ops/watt"

replies(8): >>41863552 #>>41863639 #>>41864898 #>>41864928 #>>41864933 #>>41866594 #>>41869485 #>>41870575 #
AlexandrB ◴[] No.41863552[source]
> Also, people often mistake the reason for an NPU is "speed". That's not correct. The whole point of the NPU is rather to focus on low power consumption.

I have a sneaking suspicion that the real real reason for an NPU is marketing. "Oh look, NVDA is worth $3.3T - let's make sure we stick some AI stuff in our products too."

replies(8): >>41863644 #>>41863654 #>>41865529 #>>41865968 #>>41866150 #>>41866423 #>>41867045 #>>41870116 #
pclmulqdq ◴[] No.41866423[source]
The correct way to make a true "NPU" is to 10x your memory bandwidth and feed a regular old multicore CPU with SIMD/vector instructions (and maybe a matrix multiply unit).

Most of these small NPUs are actually made for CNNs and other models where "stream data through weights" applies. They have a huge speedup there. When you stream weights across data (any LLM or other large model), you are almost certain to be bound by memory bandwidth.

replies(2): >>41866896 #>>41871310 #
bee_rider ◴[] No.41866896[source]
I’m sure we’ll get GPNPU. Low precision matvecs could be fun to play with.
replies(1): >>41868292 #
1. touisteur ◴[] No.41868292[source]
SHAVE from MOVIDIUS was fun, before Intel bought them out.
replies(1): >>41870749 #
2. hedgehog ◴[] No.41870749[source]
Did they become un-fun? There are a bunch on the new Intel CPUs.
replies(1): >>41872974 #
3. touisteur ◴[] No.41872974[source]
Most of the toolchain got hidden behind openvino and there was no hardware released for years. Keembay was 'next year' for years. I have some code for DSP using it that I can't use anymore. Has Intel actually released new shave cores, with an actual dev environment ? I'm curious.
replies(1): >>41873250 #
4. hedgehog ◴[] No.41873250{3}[source]
The politics behind the software issues are complex. At least from the public presentation the new SHAVE cores are not much changed besides bigger vector units. I don't know what it would take to make a lower level SDK available again but it sure seems like it would be useful.