←back to thread

486 points dbreunig | 1 comments | | HN request time: 0.226s | source
Show context
NebulaTrek ◴[] No.41872326[source]
We ran qprof (a Qualcomm NPU profiler) on this benchmark. The profiling results indicate that the workload was distributed to the vector cores instead of the tensor core, which provide the vast majority of the compute power in the NPU (my back of napkin math suggests that HMX is 30x stronger than HVX).

The workload is relatively small, which results in underutilization of the hardware capacity due to the overhead associated with input/output quantization-dequantization and NCHW-NHCW mapping. Padding the weights and inputs to be a multiple of 64 would also help the performance.

Edit: Link to the profiling graph https://imgur.com/a/2OKR93e

Estimated HVX compute capability 421.43*1024/8 = 1.46TOPS in int8,

in which 4 is number of vector cores

2 is number operation per cycle

1.43GHz is HVX frequency

1024bit is vector register width

8bit is precision

replies(1): >>41873769 #
1. NebulaTrek ◴[] No.41873769[source]
The formula was formatted wrong, it should be 4 * 2 * 1.43 * 1024 / 8