←back to thread

486 points dbreunig | 1 comments | | HN request time: 0s | source
Show context
isusmelj ◴[] No.41863460[source]
I think the results show that just in general the compute is not used well. That the CPU took 8.4ms and GPU took 3.2ms shows a very small gap. I'd expect more like 10x - 20x difference here. I'd assume that the onnxruntime might be the issue. I think some hardware vendors just release the compute units without shipping proper support yet. Let's see how fast that will change.

Also, people often mistake the reason for an NPU is "speed". That's not correct. The whole point of the NPU is rather to focus on low power consumption. To focus on speed you'd need to get rid of the memory bottleneck. Then you end up designing your own ASIC with it's own memory. The NPUs we see in most devices are part of the SoC around the CPU to offload AI computations. It would be interesting to run this benchmark in a infinite loop for the three devices (CPU, NPU, GPU) and measure power consumption. I'd expect the NPU to be lowest and also best in terms of "ops/watt"

replies(8): >>41863552 #>>41863639 #>>41864898 #>>41864928 #>>41864933 #>>41866594 #>>41869485 #>>41870575 #
kmeisthax ◴[] No.41863639[source]
> I think some hardware vendors just release the compute units without shipping proper support yet

This is Nvidia's moat. Everything has optimized kernels for CUDA, and maybe Apple Accelerate (which is the only way to touch the CPU matrix unit before M4, and the NPU at all). If you want to use anything else, either prepare to upstream patches in your ML framework of choice or prepare to write your own training and inference code.

replies(1): >>41868138 #
noduerme ◴[] No.41868138[source]
I'm not sure why this is a moat. Isn't it just a matter of translation from CUDA to some other instruction set? If AMD or someone else makes cheaper hardware that does the same thing, it doesn't seem like a stretch for them to release a PyTorch patch or whatever.
replies(2): >>41868786 #>>41870624 #
david-gpu ◴[] No.41868786[source]
Most of the computations are done inside NVidia proprietary libraries, not open-source CUDA. And if you saw what goes inside those libraries, I think you would agree that it is a substantial moat.
replies(1): >>41870764 #
theGnuMe ◴[] No.41870764[source]
There are clean room approaches like AMDs and Scale.
replies(2): >>41871154 #>>41873069 #
caeril ◴[] No.41871154{3}[source]
Geohot has multiple (and ongoing) rants about the sheer instability of AMD RDNA3 drivers. Lisa Su engaged directly with him on this, and she didn't seem to give a shit about their problems.

AMD is not taking ML applications seriously, outside of their marketing hype.

replies(1): >>41874128 #
1. fvv ◴[] No.41874128{4}[source]
Rdna3 is not cdna