Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels

1. turbo_wombat ◴[03 Sep 25 19:48 UTC] No.45119741[source]▶

They are comparing unoptimized PyTorch inference, something you would never deploy on a device, to a model with custom kernels.

Yes, of course the model with custom kernels is faster, whether it's written by a human or an AI.

Generally, PyTorch inference is meant to be used during the training process, and when running metrics, not when deploying. When deployed, you should export to ONNX, and then compile the ONNX to the native format of the device.

If you aren't familiar with the pipeline for ML deployment, this is the equivalent of comparing interpreted code to compiled code.

replies(7): >>45119755 #>>45120488 #>>45120646 #>>45121096 #>>45121128 #>>45121957 #>>45132362 #

2. ◴[03 Sep 25 19:50 UTC] No.45119755[source]▶

>>45119741 (TP) #

3. nserrino ◴[03 Sep 25 21:15 UTC] No.45120488[source]▶

>>45119741 (TP) #

PyTorch is the baseline because that's what people prototype in, and the most common reference point. The aim here is to show that you can start from prototype code and automatically produce lower-level kernels (in this case Metal) that are more usable in real deployments, without additional work from the developer. Frontier models are capable at generating efficient Metal kernels automatically/immediately, and will only get better. We expect to see significant improvements as we refine the approach, but it's enough to show this seems to be a tractable problem for AI.

4. CapsAdmin ◴[03 Sep 25 21:34 UTC] No.45120646[source]▶

>>45119741 (TP) #

I have never really worked with pytorch professionally, but it feels to me a lot of the open source, especially generative oriented projects, just use pytorch like this. It makes hacking on the models a whole lot easier.

comfyui is a good example of a project like this.

5. ◴[03 Sep 25 22:33 UTC] No.45121096[source]▶

>>45119741 (TP) #

6. airforce1 ◴[03 Sep 25 22:37 UTC] No.45121128[source]▶

>>45119741 (TP) #

> and then compile the ONNX to the native format of the device.

I'm assuming you are talking about https://github.com/onnx/onnx-mlir?

In your experience, how much faster is a "compiled" onnx model vs. using an onnx runtime?

replies(1): >>45121510 #

7. dapperdrake ◴[03 Sep 25 23:28 UTC] No.45121510[source]▶

>>45121128 #

For other people reading this:

Back in the day TensorFlow had tfdeploy which compiled TensorFlow terms into NumPy matrix operations. Our synthetic tests saw speedups of factor 50.

8. yieldcrv ◴[04 Sep 25 00:24 UTC] No.45121957[source]▶

>>45119741 (TP) #

> Yes, of course the model with custom kernels is faster, whether it's written by a human or an AI.

But that’s the thing, I wouldn’t write a custom kernel before AI

I don't do that level of development or operate at that part of the stack but I’m very experienced in software development

AI significantly augments my skillsets in this area

replies(1): >>45124047 #

9. am17an ◴[04 Sep 25 06:01 UTC] No.45124047[source]▶

>>45121957 #

The point is those kernels exist already, you can just use them off the shelf. In the case where you're trying to write a production grade kernel without operating at that part of the stack... well good luck with that.

replies(1): >>45128199 #

10. spott ◴[04 Sep 25 21:23 UTC] No.45132362[source]▶

>>45119741 (TP) #

vLLM is a LLM model serving framework written using raw PyTorch.

ONNX doesn’t support a bunch of operations that PyTorch does (it isn’t always possible to convert a PyTorch model to ONNX).

Torchserve runs raw PyTorch.

Generally speaking, PyTorch is pretty well optimized. For Mac it has been historically ignored, so the kernels for MPS were all missing or just bad, but on CUDA and Linux they are pretty good.