←back to thread

186 points darkolorin | 2 comments | | HN request time: 0.001s | source

We wrote our inference engine on Rust, it is faster than llama cpp in all of the use cases. Your feedback is very welcomed. Written from scratch with idea that you can add support of any kernel and platform.
1. giancarlostoro ◴[] No.44574335[source]
Hoping the author can answer, I'm still learning about how this all works. My understanding is that inference is "using the model" so to speak. How is this faster than established inference engines specifically on Mac? Are models generic enough that if you build e.g. an inference engine focused on AMD GPUs or even Intel GPUs, would they achieve reasonable performance? I always assumed because Nvidia is king of AI that you had to suck it up, or is it just that most inference engines being used are married to Nvidia?

I would love to understand how universal these models can become.

replies(1): >>44575711 #
2. darkolorin ◴[] No.44575711[source]
Basically “faster” means better performance e.g. tokens/s without loosing quality (benchmarks scores for models). So when we say faster we provide more tokens per second than llama cpp. That means we effectively utilize hardware API available (for example we wrote our own kernels) to perform better.