AI PCs Aren't Good at AI: The CPU Beats the NPU

OK, I am one of the developers in onnxruntime team. Perviously working on ROCm EP now has been transfered to QNN EP. The following is purely devrant and the opinions are mine.

So ROCm already sucks whereas QNN sucks even harder!

The conclusion here is NVIDIA knows how to make software that just works. AMD makes software that might work. Qualcomm, however, knows zero piece of shit of how to make a useful software.

The dev experience is just another level of disaster with Qualcomm. Their tools and APIs return absolutely zero useful infomation about what error you are getting, just an error code that you can grep from their include headers from SDK. To debug an error code, you need strace to get the internal error string on the device. Their profiler merely gives you a trace that cannot be associated back to original computing logic with very high stddev on the runtime. Their docs website is not indexed by the MF search engine, not to say LLMs, so if you have any question, good luck then!

So if you don't have a reason to use QNN, just don't use it (and any other NPU you name it).

Back to the benchmark script. There is a lot of flaws as I can see.

1. the session is not warmed up and the iteraion is too small. 2. the onnx graph is too small, I suspect the onnxruntime overhead cannot be ignored in this case. Try stack more gemm in the graph instead of increasing the iteration naively. 3. the "htp_performance_mode": "sustained_high_performance" might give a lower perf compare to "burst" mode.

A more reliable way to benchmark might just dump the context binary[1] and dump context inputs[2] and run this with qnn-net-run to get rid of the onnxruntime overhead.

[1]: https://github.com/cloudhan/misc-nn-test-driver/blob/main/qn... [2]: https://github.com/cloudhan/misc-nn-test-driver/blob/main/qn...