Deploying a model on an NPU requires significant profile based optimization. Picking up a model that works fine on the CPU but hasn't been optimized for an NPU usually leads to disappointing results.
replies(2):
Reason it doesn't seem that way is that the CPU is so fast we often bottleneck on I/O first. However, for compute-workloads like inference, it really does matter.
gcc -O0 and -O2 has a HUGE performance gain. We don't really have anything to auto-magically do this for models, yet. Compilers are intimately familiar with x86.
Having cache friendly memory access patterns is perhaps the biggest one. Though automatic vectorization is also still not quite there, so in cases where there's a severe bottleneck, doing that manually may still considerably improve performance, if the workload is vectorizable.