But regardless, 16 TOPs is no good for LLMs. Though there is a Ryzen AI demo that shows Llama 7B running on these at 8 tokens/sec. A sub-par experience for a sub-par LLM.
At release time, the windows driver for example included few video processing offloads used by Windows Frameworks used for example by MS Teams for background removal - so that such tasks use less battery on laptops and free up CPU/GPU for other tasks on desktop.
For higher end processing you can use the same AIE-ML coprocessors various chips available previously from Xilinx and now under AMD brand.
they're not the same - versal acaps (whatever you want to call them) have AIE1 arch while phoenix has AIE2 arch. there are significant differences between the two arches (local memory, bfloat16, etc.)
Essentially, AMD is making two tile designs optimized for slightly different computations and claims that they are going to offer both in Versal, but NPUs use exclusively the ML-optimized ones.
https://github.com/amd/RyzenAI-SW/blob/main/example/transfor...
What this should tell you is that "15 TOPs" is an irrelevant number in this benchmark. There are exactly two FLOPs per parameter. Loading the parameters takes more time than processing them.
There are people with less than 8GB of VRAM and they can't load these models into their GPU and end up with the exact same performance as on CPU. The 12tflops of the 3060 Ti 8GB are "no good" for LLMs, because the bottleneck for token generation is memory bandwidth.
My Ryzen 2700 gets 7 tokens per second at 50 GFLOPs. What does this tell you? The NPU can saturate the memory bandwidth of the system.
Now here is the gotcha: Have you tried inputting very large prompts? Because that is where the speedup is going to be extremely noticeable. Instead of waiting minutes on a 2000 token prompt, it will be just as fast as on GPUs, because the initial prompt processing is compute bound.
Also, before calling something subpar, you're going to have to tell me how you are going to put larger models like Goliath 70b or 120b models on your GPU.