(github.com)

488 points dbreunig | 4 comments | 16 Oct 24 19:44 UTC | HN request time: 0.885s | source

Show context

protastus ◴[16 Oct 24 21:10 UTC] No.41863883[source]▶

Deploying a model on an NPU requires significant profile based optimization. Picking up a model that works fine on the CPU but hasn't been optimized for an NPU usually leads to disappointing results.

replies(2): >>41864613 #>>41864649 #

1. CAP_NET_ADMIN ◴[16 Oct 24 22:47 UTC] No.41864649[source]▶

>>41863883 #

Beauty of CPUs - they'll chew through whatever bs code you throw at them at a reasonable speed.

replies(1): >>41869110 #

2. marginalia_nu ◴[17 Oct 24 12:41 UTC] No.41869110[source]▶

>>41864649 (TP) #

I don't think this is correct. The difference between well optimized code and unoptimized code on the CPU is frequently at least an order of magnitude performance.

Reason it doesn't seem that way is that the CPU is so fast we often bottleneck on I/O first. However, for compute-workloads like inference, it really does matter.

replies(1): >>41870758 #

3. consteval ◴[17 Oct 24 15:46 UTC] No.41870758[source]▶

>>41869110 #

While this is true, the most effective optimizations you don't do yourself. The compiler or runtime does it. They get the low-hanging fruit. You can further optimize yourself, but unless your design is fundamentally bad, you're gonna be micro-optimizing.

gcc -O0 and -O2 has a HUGE performance gain. We don't really have anything to auto-magically do this for models, yet. Compilers are intimately familiar with x86.

replies(1): >>41872913 #

4. marginalia_nu ◴[17 Oct 24 19:25 UTC] No.41872913{3}[source]▶

>>41870758 #

While the compiler is decent at producing code that is good in terms of saturating the instruction pipeline, there are many things the compiler simply can't help you with.

Having cache friendly memory access patterns is perhaps the biggest one. Though automatic vectorization is also still not quite there, so in cases where there's a severe bottleneck, doing that manually may still considerably improve performance, if the workload is vectorizable.

↑

AI PCs Aren't Good at AI: The CPU Beats the NPU