←back to thread

899 points georgehill | 3 comments | | HN request time: 0.623s | source
Show context
nivekney ◴[] No.36216106[source]
On a similar thread, how does it compare to Hippoml?

Context: https://news.ycombinator.com/item?id=36168666

replies(1): >>36216469 #
brucethemoose2 ◴[] No.36216469[source]
We don't necessarily know... Hippo is closed source for now.

Its comparable to Apache TVM's vulkan in speed on cuda, see https://github.com/mlc-ai/mlc-llm

But honestly, the biggest advantage of llama.cpp for me is being able to split a model so performantly. My puny 16GB laptop can just barely, but very practically, run LLaMA 30B at almost 3 tokens/s, and do it right now. That is crazy!

replies(1): >>36217701 #
smiley1437 ◴[] No.36217701[source]
>> run LLaMA 30B at almost 3 tokens/s

Please tell me your config! I have an i9-10900 with 32GB of ram that only gets .7 tokens/s on a 30B model

replies(3): >>36217877 #>>36217992 #>>36219745 #
1. brucethemoose2 ◴[] No.36219745[source]
I'n on a Ryzen 4900HS laptop with a RTX 2060.

Like I said, very modest

replies(1): >>36220337 #
2. smiley1437 ◴[] No.36220337[source]
Are you offloading layers to the RTX2060?
replies(1): >>36221349 #
3. brucethemoose2 ◴[] No.36221349[source]
Some of them, yeah. 17 layers iirc.