I always thought that the main point of NPUs is energy efficiency (and being able to run ML models without taking over all computer resources, making it practical to integrate ML applications in the OS itself in ways that it does not disturb the user or the workflow) rather than being exceptionally faster. At least this has been my experience with running stable diffusion on macs. Similar to using other specialised hardware like media encoders; they are not necessarily faster than a CPU if you throw a dozen+ cpu cores on the task, but it will draw a minuscule part of the power.
replies(1):