AI PCs Aren't Good at AI: The CPU Beats the NPU

  > but to be able to run small models with very little power usage

yes

But first, I should also say you probably don't want to be programming these things with python. I doubt you'll get good performance there, especially as the newness means optimizations haven't been ported well (even using things like TensorRT is not going to be as fast as writing it from scratch, and Nvidia is throwing a lot of man power at that -- for good reason! But it sure as hell will get close and save you a lot of time writing).

They are, like you say, generally optimized for doing repeated similar tasks. That's also where I suspect some of the info gathered here is inaccurate.

  (I have not used these NPU chips so what follows is more educated guesses, but I'll explain. Please correct me if I've made an error)

Second, I don't trust the timing here. I'm certain the CUDA timing (at the end) is incorrect, as the code written wouldn't properly time. Timing is surprisingly not easy. I suspect the advertised operations are only counting operations directly on the NPU while OP would have included CPU operations in their NPU and GPU timings[0]. But the docs have benchmarking tools, so I suspect they're doing something similar. I'd be interested to know the variance and how this holds after doing warmups. They do identify the IO as an issue, and so I think this is evidence of this being an issue.

Third, their data is improperly formatted.

  MATRIX_COUNT, MATRIX_A, MATRIX_B, MATRIX_K = (6, 1500, 1500, 256)
  INPUT0_SHAPE = [1, MATRIX_COUNT, MATRIX_A, MATRIX_K]
  INPUT1_SHAPE = [1, MATRIX_COUNT, MATRIX_K, MATRIX_B]
  OUTPUT_SHAPE = [1, MATRIX_COUNT, MATRIX_A, MATRIX_B]

You want "channels last" here. I suspected this (do this in pytorch too!) and the docs they link confirm.

1500 is also an odd choice and this could be cause for extra misses. I wonder how things would change with 1536, 2048, or even 256. Might (probably) even want to look smaller, since this might be a common preprocessing step. Your models are not processing full res images and if you're going to optimize architecture for models, you're going to use that shape information. Shape optimization is actually pretty important in ML[1]. I suspect this will be quite a large miss.

Fourth, a quick look at the docs and I think the setup is improper. Under "Model Workflow" they mention that they want data in 8 or 16 bit *float*. I'm not going to look too deep, but note that there are different types of floats (e.g. pytorch's bfloat is not the same as torch.half or torch.float16). Mixed precision is still a confusing subject and if you're hitting issues like these it is worth looking at. I very much suggest not just running a standard quantization procedure and calling it a day (start there! But don't end there unless it's "good enough", which doesn't seem too meaningful here.)

FWIW, I still do think these results are useful, but I think they need to be improved upon. This type of stuff is surprisingly complex, but a large amount of that is due to things being new and much of the details still being worked out. Remember that when you're comparing to things like CPU or GPU (especially CUDA) that these have had hundreds of thousands of man hours put into then and at least tens of thousands into high level language libraries (i.e. python) to handle these. I don't think these devices are ready for the average user where you can just work with them from your favorite language's abstraction level, but they're pretty useful if you're willing to work close to the metal.

[0] I don't know what the timing is for this, but I do this in pytorch a lot so here's the boilerplate

    times = torch.empty(rounds)
    # Don't need use dummy data, but here
    input_data = torch.randn((batch_size, *data_shape), device="cuda")
    # Do some warmups first. There's background actions dealing with IO we don't want to measure
    #    You can remove that line and do a dist of times if you want to see this
    # Make sure you generate data and save to a variable (write) or else this won't do anything
    for _ in range(warmup):
        data = model(input_data)
    for i in range(rounds):
        starter = torch.cuda.Event(enable_timing=True)
        ender = torch.cuda.Event(enable_timing=True)
        starter.record()
        data = model(input_data)
        ender.record()
        torch.cuda.synchronize()
        times[i] = starter.elapsed_time(ender)/1000
    total_time = times.sum()

The reason we do it this way is if we just wrap the model output with a timer then we're looking at CPU time but the GPU operations are asynchronous so you could get deceptively fast (or slow) times

[1] https://www.thonking.ai/p/what-shapes-do-matrix-multiplicati...