Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels

(gimletlabs.ai)

186 points nserrino | 1 comments | 03 Sep 25 17:03 UTC | HN request time: 0.001s | source

Show context

turbo_wombat ◴[03 Sep 25 19:48 UTC] No.45119741[source]▶

They are comparing unoptimized PyTorch inference, something you would never deploy on a device, to a model with custom kernels.

Yes, of course the model with custom kernels is faster, whether it's written by a human or an AI.

Generally, PyTorch inference is meant to be used during the training process, and when running metrics, not when deploying. When deployed, you should export to ONNX, and then compile the ONNX to the native format of the device.

If you aren't familiar with the pipeline for ML deployment, this is the equivalent of comparing interpreted code to compiled code.

replies(7): >>45119755 #>>45120488 #>>45120646 #>>45121096 #>>45121128 #>>45121957 #>>45132362 #

yieldcrv ◴[04 Sep 25 00:24 UTC] No.45121957[source]▶

>>45119741 #

> Yes, of course the model with custom kernels is faster, whether it's written by a human or an AI.

But that’s the thing, I wouldn’t write a custom kernel before AI

I don't do that level of development or operate at that part of the stack but I’m very experienced in software development

AI significantly augments my skillsets in this area

replies(1): >>45124047 #

1. am17an ◴[04 Sep 25 06:01 UTC] No.45124047[source]▶

>>45121957 #

The point is those kernels exist already, you can just use them off the shelf. In the case where you're trying to write a production grade kernel without operating at that part of the stack... well good luck with that.

replies(1): >>45128199 #

↑