←back to thread

Show HN: We made our own inference engine for Apple Silicon

(github.com)

186 points darkolorin | 1 comments | 15 Jul 25 11:29 UTC | HN request time: 0.284s | source

We wrote our inference engine on Rust, it is faster than llama cpp in all of the use cases. Your feedback is very welcomed. Written from scratch with idea that you can add support of any kernel and platform.

Show context

giancarlostoro ◴[15 Jul 25 18:26 UTC] No.44574335[source]▶

>>44570048 (OP) #

Hoping the author can answer, I'm still learning about how this all works. My understanding is that inference is "using the model" so to speak. How is this faster than established inference engines specifically on Mac? Are models generic enough that if you build e.g. an inference engine focused on AMD GPUs or even Intel GPUs, would they achieve reasonable performance? I always assumed because Nvidia is king of AI that you had to suck it up, or is it just that most inference engines being used are married to Nvidia?

I would love to understand how universal these models can become.

replies(1): >>44575711 #

1. darkolorin ◴[15 Jul 25 20:54 UTC] No.44575711[source]▶

Basically “faster” means better performance e.g. tokens/s without loosing quality (benchmarks scores for models). So when we say faster we provide more tokens per second than llama cpp. That means we effectively utilize hardware API available (for example we wrote our own kernels) to perform better.