Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels

I work on PyTorch and there are many things that make me suspicious about these results. My TL;DR is unless we get a zip file of all the kernels with how they're benchmarked results like this are almost impossible to verify

1. I don't have an M4 but I have an M1 Pro and I tried running the claimed 18x speedup VisionAttention attention example and I get close to identical runtimes. This example has more issues the main optimization the LLM is doing is a fusion and so not comparing to torch.compile is a bit sus. The numerics are off as well and I suspect the atols were way too big. Finally MultiHeadAttention is a deprecated API so using neither SDPA or torch.compile is a weird choice

2. In general 18x (and even some 100x speedups claimed near the end) are just a smell that some kernel is incorrect, the typical way you can get speedups like this is you don't warmup or you forget to synchronize. PyTorch has a lot of benchmarking footguns which is why sharing the exact eval scripts is helpful

3. Speaking of footguns, the shapes I saw in the examples were tiny, in that regime you're more often measuring noise as the primary bottleneck is not compute or memory but overhead

4. Generating many random shapes is also not so safe, some input distributions can make certain kernels trivial for example torch.randn() by default generates samples from a normal distribution with mean 0 and variance 1 and so if you take the mean of a large vector you're almost guaranteed to just get 0 esp if your tolerance is too high

5. KernelBench levels measure vastly different things and if you want to compare to PyTorch operators you want to focus on Level 1, Level 2 is fusions and so the right baseline is torch.compile and more reliable on nightlies. The Mamba 2 example (which I didn't run) also acknowledges that the primary thing it does is fusions which assuming everything is correct would still be strange to baseline vs eager

So please for everyone's sanity if you find a kernel that's 10-100x faster please share the exact code and benchmarking methodology to your smartest performance friends, you should be extremely skeptical of such results often you can discard some numbers based on a simple speed of light analysis. We all desperately want faster kernels but to get them we have to be really fanatical about correctness.

Hey, thanks for the thoughtful comments. A lot of big claims have been made in this area so skepticism is the right default reaction. tl;dr: agree that we should provide the kernels and benchmark suite so this can be evaluated by others, will follow up with that.

A few clarifications:

1. Baselines - We didn't compare to torch.compile because as of PyTorch 2.7, torch.compile doesn't support the MPS backend, and we ran into some issues on many of the problems when using it. GitHub issue: https://github.com/pytorch/pytorch/issues/150121. Once it's supported, it will be the obvious baseline.

2. Methodology - We followed KernelBench’s protocol to establish a baseline on Metal, adding more correctness checks. Warmup and synchronization were done. We recognize the limitations here and are expanding the validation suite.

3. Optimizations - Right now most of the optimizations are fusions, but there is some use of Metal-specific primitives/optimizations. We expect as we make the supervisor more sophisticated, the novelty of the optimized kernels will also increase.

Overall the goal here is to get some % of the benefit of a human expert in kernel engineering, without developer effort. Compiler-based optimizations are great, but hand-tuned implementations are still common for performance-critical models. The hope is that we can automate some of that process.