AI PCs Aren't Good at AI: The CPU Beats the NPU

We ran qprof (a Qualcomm NPU profiler) on this benchmark. The profiling results indicate that the workload was distributed to the vector cores instead of the tensor core, which provide the vast majority of the compute power in the NPU (my back of napkin math suggests that HMX is 30x stronger than HVX).

The workload is relatively small, which results in underutilization of the hardware capacity due to the overhead associated with input/output quantization-dequantization and NCHW-NHCW mapping. Padding the weights and inputs to be a multiple of 64 would also help the performance.

Edit: Link to the profiling graph https://imgur.com/a/2OKR93e

Estimated HVX compute capability 421.43*1024/8 = 1.46TOPS in int8,

in which 4 is number of vector cores

2 is number operation per cycle

1.43GHz is HVX frequency

1024bit is vector register width

8bit is precision