(github.com)

186 points darkolorin | 2 comments | 15 Jul 25 11:29 UTC | HN request time: 0.428s | source

We wrote our inference engine on Rust, it is faster than llama cpp in all of the use cases. Your feedback is very welcomed. Written from scratch with idea that you can add support of any kernel and platform.

1. zdw ◴[15 Jul 25 16:16 UTC] No.44572724[source]▶

>>44570048 (OP) #

How does this bench compared to MLX?

replies(1): >>44573093 #

2. jasonjmcghee ◴[15 Jul 25 16:41 UTC] No.44573093[source]▶

>>44572724 (TP) #

I use MLX in lmstudio and it doesn't have whatever issues llama cpp is showing here.

Qwen3-0.6B at 5 t/s doesn't make any sense. Something is clearly wrong for that specific model.

↑

Show HN: We made our own inference engine for Apple Silicon