AI PCs Aren't Good at AI: The CPU Beats the NPU

1. eightysixfour ◴[16 Oct 24 20:32 UTC] No.41863546[source]▶

I thought the purpose of these things was not to be fast, but to be able to run small models with very little power usage? I have a newer AMD laptop with an NPU, and my power usage doesn't change using the video effects that supposedly run on it, but goes up when using the nvidia studio effects.

It seems like the NPUs are for very optimized models that do small tasks, like eye contact, background blur, autocorrect models, transcription, and OCR. In particular, on Windows, I assumed they were running the full screen OCR (and maybe embeddings for search) for the rewind feature.

replies(7): >>41863632 #>>41863779 #>>41863821 #>>41863886 #>>41864628 #>>41864828 #>>41869772 #

2. conradev ◴[16 Oct 24 20:40 UTC] No.41863632[source]▶

>>41863546 (TP) #

That is my understanding as well: low power and low latency.

You can see this in action when evaluating a CoreML model on a macOS machine. The ANE takes half as long as the GPU which takes half as long as the CPU (actual factors being model dependent)

replies(1): >>41863665 #

3. nickpsecurity ◴[16 Oct 24 20:44 UTC] No.41863665[source]▶

>>41863632 #

To take half as long, doesn’t it have to perform twice as fast? Or am I misreading your comment?

replies(2): >>41863726 #>>41865127 #

4. eightysixfour ◴[16 Oct 24 20:49 UTC] No.41863726{3}[source]▶

>>41863665 #

No, you can have latency that is independent of compute performance. The CPU/GPU may have other tasks and the work has to wait for the existing threads to finish, or for them to clock up, or have slower memory paths, etc.

If you and I have the same calculator but I'm working on a set of problems and you're not, and we're both asked to do some math, it may take me longer to return it, even though the instantaneous performance of the math is the same.

replies(1): >>41863792 #

5. boomskats ◴[16 Oct 24 20:56 UTC] No.41863779[source]▶

>>41863546 (TP) #

That's especially true because yours is a Xilinx FPGA. The one that they just attached to the latest gen mobile ryzens is 5x more capable too.

AMD are doing some fantastic work at the moment, they just don't seem to be shouting about it. This one is particularly interesting https://lore.kernel.org/lkml/DM6PR12MB3993D5ECA50B27682AEBE1...

edit: not an FPGA. TIL. :'(

replies(5): >>41863852 #>>41863876 #>>41864048 #>>41864435 #>>41865733 #

6. refulgentis ◴[16 Oct 24 20:57 UTC] No.41863792{4}[source]▶

>>41863726 #

In isolation, makes sense.

Wouldn't it be odd for OP to present examples that are the opposite of their claim, just to get us thinking about "well the CPU is busy?"

Curious for their input.

7. refulgentis ◴[16 Oct 24 21:01 UTC] No.41863821[source]▶

>>41863546 (TP) #

You're absolutely right IMO, given what I heard when launching on-device speech recognition on Pixel, and after leaving Google, what I see from ex. Apple Neural Engine vs. CPU when running ONNX stuff.

I'm a bit suspicious of the article's specific conclusion, because it is Qualcomm's ONNX, and it be out of date. Also, Android loved talking shit about Qualcomm software engineering.

That being said, its directionally correct, insomuch as consumer hardware AI acceleration claims are near-universally BS unless you're A) writing 1P software B) someone in the 1P really wants you to take advantage.

replies(1): >>41864564 #

8. errantspark ◴[16 Oct 24 21:04 UTC] No.41863852[source]▶

>>41863779 #

Wait sorry back up a bit here. I can buy a laptop that has a daughter FPGA in it? Does it have GPIO??? Are we seriously building hardware worth buying again in 2024? Do you have a link?

replies(2): >>41863959 #>>41864293 #

9. beeflet ◴[16 Oct 24 21:08 UTC] No.41863876[source]▶

>>41863779 #

It would be cool if most PCs had a general purpose FPGA that could be repurposed by the operating system. For example you could use it as a security processor like a TPM or as a bootrom, or you could repurpose it for DSP or something.

It just seems like this would be better in terms of firmware/security/bootloading because you would be more able to fix it if an exploit gets discovered, and it would be leaner because different operating systems can implement their own stuff (for example linux might not want pluton in-chip security, windows might not want coreboot or linux-based boot, bare metal applications can have much simpler boot).

replies(1): >>41864617 #

10. eightysixfour ◴[16 Oct 24 21:21 UTC] No.41863959{3}[source]▶

>>41863852 #

It isn't as fun as you think - they are setup for specific use cases and quite small. Here's a link to the software page: https://ryzenai.docs.amd.com/en/latest/index.html

The teeny-tiny "NPU," which is actually an FPGA, is 10 TOPS.

Edit: I've been corrected, not an FPGA, just an IP block from Xilinx.

replies(2): >>41864036 #>>41864062 #

11. eightysixfour ◴[16 Oct 24 21:22 UTC] No.41863976[source]▶

>>41863886 #

The 7940HS shipped before recall and doesn't support it because it is not performant enough, so, that doesn't make sense.

I just gave you a use case, mine in particular uses it for background blur and eye contact filters with the webcam and uses essentially no power to do it. If I do the same filters with nvidia broadcast, the power usage is dramatically higher.

replies(2): >>41864053 #>>41864126 #

12. wtallis ◴[16 Oct 24 21:30 UTC] No.41864036{4}[source]▶

>>41863959 #

It's not a FPGA. It's an NPU IP block from the Xilinx side of the company. It was presumably originally developed to be run on a Xilinx FPGA, but that doesn't mean AMD did the stupid thing and actually fabbed a FPGA fabric instead of properly synthesizing the design for their laptop ASIC. Xilinx involvement does not automatically mean it's an FPGA.

replies(2): >>41864064 #>>41864111 #

13. pclmulqdq ◴[16 Oct 24 21:32 UTC] No.41864048[source]▶

>>41863779 #

It's not an FPGA. It's a VLIW DSP that Xilinx built to go into an FPGA-SoC to help run ML models.

replies(1): >>41864242 #

14. wtallis ◴[16 Oct 24 21:32 UTC] No.41864053{3}[source]▶

>>41863976 #

Intel is also about to launch their first desktop processors with an NPU which falls far short of Microsoft's performance requirements for a "Copilot+ PC". Should still be plenty for webcam use.

15. boomskats ◴[16 Oct 24 21:33 UTC] No.41864062{4}[source]▶

>>41863959 #

Yes, the one on the ryzen 7000 chips like the 7840u isn't massive, but that's the last gen model. The one they've just released with the HX370 chip is estimated at 50 TOPS, which is better than Qualcomm's ARM flagship that this post is about. It's a fivefold improvement in a single generation, it's pretty exciting.

A̵n̵d̵ ̵i̵t̵'̵s̵ ̵a̵n̵ ̵F̵P̵G̵A̵ It's not an FPGA

replies(1): >>41864248 #

16. eightysixfour ◴[16 Oct 24 21:33 UTC] No.41864064{5}[source]▶

>>41864036 #

Thanks for the correction, edited.

17. boomskats ◴[16 Oct 24 21:40 UTC] No.41864111{5}[source]▶

>>41864036 #

Do you have any more reading on this? How come the XDNA drivers depend on Xilinx' XRT runtime?

replies(2): >>41864232 #>>41864296 #

18. moffkalast ◴[16 Oct 24 21:41 UTC] No.41864126{3}[source]▶

>>41863976 #

I doubt there's no notable power draw, NPUs in general have always pulled a handful of watts which should at least about match a modern CPU's idle draw. But it does seem odd that your power usage doesn't change at all, it might be always powered on or something.

Eye contact filters seem like a horrible thing, autocorrect won't work better than a dictionary with a tiny model and I doubt these things can come even close to running whisper for decent voice transcription. Background blur alright, but that's kind of stretching it. I always figured Zoom/Teams do these things serverside anyway.

And alright, if it's not MS making them do it, then they're just chasing the fad themselves while also shipping subpar hardware. Not sure if that makes it better.

replies(2): >>41864298 #>>41866381 #

19. almostgotcaught ◴[16 Oct 24 21:55 UTC] No.41864232{6}[source]▶

>>41864111 #

because XRT has a plugin architecture: XRT<-shim plugin<-kernel driver. The shims register themselves with XRT. The XDNA driver repo houses both the shim and the kernel driver.

replies(1): >>41864611 #

20. almostgotcaught ◴[16 Oct 24 21:56 UTC] No.41864242{3}[source]▶

>>41864048 #

this is the correct answer. one of the compilers for this DSP is https://github.com/Xilinx/llvm-aie.

21. almostgotcaught ◴[16 Oct 24 21:56 UTC] No.41864248{5}[source]▶

>>41864062 #

> And it's an FPGA.

nope it's not.

replies(1): >>41864925 #

22. dekhn ◴[16 Oct 24 22:02 UTC] No.41864293{3}[source]▶

>>41863852 #

If you want GPIOs, you don't need (or want) an FPGA.

I don't know the details of your use case, but I work with low level hardware driven by GPIOs and after a bit of investigation, concluded that having direect GPIO access in a modern PC was not necessary or desirable compared to the alternatives.

replies(1): >>41866390 #

23. wtallis ◴[16 Oct 24 22:03 UTC] No.41864296{6}[source]▶

>>41864111 #

It would be surprising and strange if AMD didn't reuse the software framework they've already built for doing AI when that IP block is instantiated on an FPGA fabric rather than hardened in an ASIC.

replies(1): >>41864630 #

24. Dylan16807 ◴[16 Oct 24 22:03 UTC] No.41864298{4}[source]▶

>>41864126 #

> I doubt these things can come even close to running whisper for decent voice transcription.

Whisper runs almost realtime on a single core of my very old CPU. I'd be very surprised if it can't fit in an NPU.

25. numpad0 ◴[16 Oct 24 22:23 UTC] No.41864435[source]▶

>>41863779 #

Sorry for an OT comment but what is going on with that ascii art!? The content fits within 80 columns just fine[1], is it GPT generated?

1: https://pastebin.com/raw/R9BrqETR

26. kristianp ◴[16 Oct 24 22:37 UTC] No.41864564[source]▶

>>41863821 #

1P?

replies(1): >>41864574 #

27. refulgentis ◴[16 Oct 24 22:38 UTC] No.41864574{3}[source]▶

>>41864564 #

First party, i.e. Google/Apple/Microsoft

28. boomskats ◴[16 Oct 24 22:43 UTC] No.41864611{7}[source]▶

>>41864232 #

Thanks, that makes sense.

29. walterbell ◴[16 Oct 24 22:44 UTC] No.41864617{3}[source]▶

>>41863876 #

Xilinx Artix 7-series PicoEVB fits in M.2 wifi slot and has an OSS toolchain, http://www.enjoy-digital.fr/

30. ◴[16 Oct 24 22:46 UTC] No.41864628[source]▶

>>41863546 (TP) #

31. boomskats ◴[16 Oct 24 22:46 UTC] No.41864630{7}[source]▶

>>41864296 #

Well, I'm irrationally disappointed, but thanks. Appreciate the correction.

32. godelski ◴[16 Oct 24 23:10 UTC] No.41864828[source]▶

>>41863546 (TP) #

  > but to be able to run small models with very little power usage

yes

But first, I should also say you probably don't want to be programming these things with python. I doubt you'll get good performance there, especially as the newness means optimizations haven't been ported well (even using things like TensorRT is not going to be as fast as writing it from scratch, and Nvidia is throwing a lot of man power at that -- for good reason! But it sure as hell will get close and save you a lot of time writing).

They are, like you say, generally optimized for doing repeated similar tasks. That's also where I suspect some of the info gathered here is inaccurate.

  (I have not used these NPU chips so what follows is more educated guesses, but I'll explain. Please correct me if I've made an error)

Second, I don't trust the timing here. I'm certain the CUDA timing (at the end) is incorrect, as the code written wouldn't properly time. Timing is surprisingly not easy. I suspect the advertised operations are only counting operations directly on the NPU while OP would have included CPU operations in their NPU and GPU timings[0]. But the docs have benchmarking tools, so I suspect they're doing something similar. I'd be interested to know the variance and how this holds after doing warmups. They do identify the IO as an issue, and so I think this is evidence of this being an issue.

Third, their data is improperly formatted.

  MATRIX_COUNT, MATRIX_A, MATRIX_B, MATRIX_K = (6, 1500, 1500, 256)
  INPUT0_SHAPE = [1, MATRIX_COUNT, MATRIX_A, MATRIX_K]
  INPUT1_SHAPE = [1, MATRIX_COUNT, MATRIX_K, MATRIX_B]
  OUTPUT_SHAPE = [1, MATRIX_COUNT, MATRIX_A, MATRIX_B]

You want "channels last" here. I suspected this (do this in pytorch too!) and the docs they link confirm.

1500 is also an odd choice and this could be cause for extra misses. I wonder how things would change with 1536, 2048, or even 256. Might (probably) even want to look smaller, since this might be a common preprocessing step. Your models are not processing full res images and if you're going to optimize architecture for models, you're going to use that shape information. Shape optimization is actually pretty important in ML[1]. I suspect this will be quite a large miss.

Fourth, a quick look at the docs and I think the setup is improper. Under "Model Workflow" they mention that they want data in 8 or 16 bit *float*. I'm not going to look too deep, but note that there are different types of floats (e.g. pytorch's bfloat is not the same as torch.half or torch.float16). Mixed precision is still a confusing subject and if you're hitting issues like these it is worth looking at. I very much suggest not just running a standard quantization procedure and calling it a day (start there! But don't end there unless it's "good enough", which doesn't seem too meaningful here.)

FWIW, I still do think these results are useful, but I think they need to be improved upon. This type of stuff is surprisingly complex, but a large amount of that is due to things being new and much of the details still being worked out. Remember that when you're comparing to things like CPU or GPU (especially CUDA) that these have had hundreds of thousands of man hours put into then and at least tens of thousands into high level language libraries (i.e. python) to handle these. I don't think these devices are ready for the average user where you can just work with them from your favorite language's abstraction level, but they're pretty useful if you're willing to work close to the metal.

[0] I don't know what the timing is for this, but I do this in pytorch a lot so here's the boilerplate

    times = torch.empty(rounds)
    # Don't need use dummy data, but here
    input_data = torch.randn((batch_size, *data_shape), device="cuda")
    # Do some warmups first. There's background actions dealing with IO we don't want to measure
    #    You can remove that line and do a dist of times if you want to see this
    # Make sure you generate data and save to a variable (write) or else this won't do anything
    for _ in range(warmup):
        data = model(input_data)
    for i in range(rounds):
        starter = torch.cuda.Event(enable_timing=True)
        ender = torch.cuda.Event(enable_timing=True)
        starter.record()
        data = model(input_data)
        ender.record()
        torch.cuda.synchronize()
        times[i] = starter.elapsed_time(ender)/1000
    total_time = times.sum()

The reason we do it this way is if we just wrap the model output with a timer then we're looking at CPU time but the GPU operations are asynchronous so you could get deceptively fast (or slow) times

[1] https://www.thonking.ai/p/what-shapes-do-matrix-multiplicati...

33. boomskats ◴[16 Oct 24 23:26 UTC] No.41864925{6}[source]▶

>>41864248 #

I've just ordered myself a jump to conclusions mat.

replies(1): >>41865072 #

34. almostgotcaught ◴[16 Oct 24 23:51 UTC] No.41865072{7}[source]▶

>>41864925 #

Lol during grad school my advisor would frequently cut me off and try to jump to a conclusion, while I was explaining something technical often enough he was wrong. So I did really buy him one (off eBay or something). He wasn't pleased.

35. conradev ◴[16 Oct 24 23:59 UTC] No.41865127{3}[source]▶

>>41863665 #

The GPU is stateful and requires loading shaders and initializing pipelines before doing any work. That is where its latency comes from. It is also extremely power hungry.

The CPU is zero latency to get started, but takes longer because it isn't specialized at any one task and isn't massively parallel, so that is why the CPU takes even longer.

The NPU often has a simpler bytecode to do more complex things like matrix multiplication implemented in hardware, rather than having to instantiate a generic compute kernel on the GPU.

36. davemp ◴[17 Oct 24 01:58 UTC] No.41865733[source]▶

>>41863779 #

Unfortunately FPGA fabric is ~2x less power efficient than equivalent ASIC logic at the same clock speeds last time I checked. So implementing general purpose logic on an FPGA is not usually the right option even if you don’t care about FMAX or transistor counts.

37. kalleboo ◴[17 Oct 24 04:16 UTC] No.41866381{4}[source]▶

>>41864126 #

> I doubt these things can come even close to running whisper for decent voice transcription

https://github.com/ggerganov/whisper.cpp/pull/566

"The performance gain is more than x3 compared to 8-thread CPU"

And this is on the 3 year old M1 Pro

38. errantspark ◴[17 Oct 24 04:18 UTC] No.41866390{4}[source]▶

>>41864293 #

I get a lot of use out of the PRUs on the BeagleboneBlack, I would absolutely get use out of an FPGA in a laptop.

replies(1): >>41866503 #

39. dekhn ◴[17 Oct 24 04:49 UTC] No.41866503{5}[source]▶

>>41866390 #

It makes more sense to me to just use the BeagleboneBlack in concert with the FPGA. Unless you have highly specific compute or data movement needs that can't be satisfied over a USB serial link. If you have those needs, and you need a laptop, I guess an FPGA makes sense but that's a teeny market.

40. monkeynotes ◴[17 Oct 24 14:01 UTC] No.41869772[source]▶

>>41863546 (TP) #

I believe that low power = cheaper tokens = more affordable and sustainable, to me this is what a consumer will benefit from overall. Power hungry GPUs seem to sit better in research, commerce, and enterprise.

The Nvidia killer would be chips and memory that are affordable enough to run a good enough model on a personal device, like a smartphone.

I think the future of this tech, if the general populace buys into LLMs being useful enough to pay a small premium for the device, is personal models that by their nature provide privacy. The amount of personal information folks unload on ChatGPT and the like is astounding. AI virtual girlfriend apps frequently get fed the most darkest kinks, vulnerable admissions, and maybe even incriminating conversations, according to Redditors that are addicted to these things. This is all given away to no-name companies that stand up apps on the app store.

Google even states that if you turn Gemini history on then they will be able to review anything you talk about.

For complex token prediction that requires a bigger model the personal could switch to consulting a cloud LLM, but privacy really needs to be ensured for consumers.

I don't believe we need cutting edge reasoning, or party trick LLMs for day to day personal assistance, chat, or information discovery.