(github.com)

186 points darkolorin | 1 comments | 15 Jul 25 11:29 UTC | HN request time: 0s | source

We wrote our inference engine on Rust, it is faster than llama cpp in all of the use cases. Your feedback is very welcomed. Written from scratch with idea that you can add support of any kernel and platform.

Show context

smpanaro ◴[15 Jul 25 15:43 UTC] No.44572333[source]▶

>>44570048 (OP) #

In practice, how often do the models use the ANE? It sounds like you are optimizing for speed which in my experience always favors GPU.

replies(1): >>44572508 #

AlekseiSavin ◴[15 Jul 25 15:57 UTC] No.44572508[source]▶

>>44572333 #

You're right, modern edge devices are powerful enough to run small models, so the real bottleneck for a forward pass is usually memory bandwidth, which defines the upper theoretical limit for inference speed. Right now, we've figured out how to run computations in a granular way on specific processing units, but we expect the real benefits to come later when we add support for VLMs and advanced speculative decoding, where you process more than one token at a time

replies(1): >>44572973 #

J_Shelby_J ◴[15 Jul 25 16:34 UTC] No.44572973[source]▶

>>44572508 #

VLMs = very large models?

replies(1): >>44573059 #

1. mmorse1217 ◴[15 Jul 25 16:39 UTC] No.44573059{3}[source]▶

>>44572973 #

Probably vision language models.

↑

Show HN: We made our own inference engine for Apple Silicon