←back to thread

899 points georgehill | 1 comments | | HN request time: 0s | source
Show context
nivekney ◴[] No.36216106[source]
On a similar thread, how does it compare to Hippoml?

Context: https://news.ycombinator.com/item?id=36168666

replies(1): >>36216469 #
brucethemoose2 ◴[] No.36216469[source]
We don't necessarily know... Hippo is closed source for now.

Its comparable to Apache TVM's vulkan in speed on cuda, see https://github.com/mlc-ai/mlc-llm

But honestly, the biggest advantage of llama.cpp for me is being able to split a model so performantly. My puny 16GB laptop can just barely, but very practically, run LLaMA 30B at almost 3 tokens/s, and do it right now. That is crazy!

replies(1): >>36217701 #
smiley1437 ◴[] No.36217701{3}[source]
>> run LLaMA 30B at almost 3 tokens/s

Please tell me your config! I have an i9-10900 with 32GB of ram that only gets .7 tokens/s on a 30B model

replies(3): >>36217877 #>>36217992 #>>36219745 #
LoganDark ◴[] No.36217877{4}[source]
> Please tell me your config! I have an i9-10900 with 32GB of ram that only gets .7 tokens/s on a 30B model

Have you quantized it?

replies(1): >>36218570 #
smiley1437 ◴[] No.36218570{5}[source]
The model I have is q4_0 I think that's 4 bit quantized

I'm running in Windows using koboldcpp, maybe it's faster in Linux?

replies(2): >>36219174 #>>36219792 #
brucethemoose2 ◴[] No.36219792{6}[source]
I am running linux with cublast offload, and I am using the new 3 bit quant that was just pulled in a day or two ago.
replies(2): >>36220323 #>>36222560 #
1. smiley1437 ◴[] No.36220323{7}[source]
Thanks! I'll have to try the 3bit to see if that helps