(github.com)

186 points darkolorin | 1 comments | 15 Jul 25 11:29 UTC | HN request time: 0.32s | source

We wrote our inference engine on Rust, it is faster than llama cpp in all of the use cases. Your feedback is very welcomed. Written from scratch with idea that you can add support of any kernel and platform.

Show context

nodesocket ◴[15 Jul 25 19:16 UTC] No.44574806[source]▶

>>44570048 (OP) #

I just spun up a AWS EC2 g6.xlarge instance to do some llm work. The GPU is NVIDIA L4 24GB and costs $0.8048/per hour. Starting to think about switching to an Apple mac2-m2.metal instance for $0.878/ per hour. Big question is the Mac instance only has 24GB of unified memory.

replies(1): >>44576572 #

1. khurs ◴[15 Jul 25 22:32 UTC] No.44576572[source]▶

>>44574806 #

Unified memory doesn't compare to a Nvidia GPU, the latter is much better.

Just depends on what performance level you need.

↑

Show HN: We made our own inference engine for Apple Silicon