(ggml.ai)

899 points georgehill | 1 comments | 06 Jun 23 16:50 UTC | HN request time: 0s | source

Show context

nivekney ◴[06 Jun 23 17:21 UTC] No.36216106[source]▶

>>36215651 (OP) #

On a similar thread, how does it compare to Hippoml?

Context: https://news.ycombinator.com/item?id=36168666

replies(1): >>36216469 #

brucethemoose2 ◴[06 Jun 23 17:45 UTC] No.36216469[source]▶

>>36216106 #

We don't necessarily know... Hippo is closed source for now.

Its comparable to Apache TVM's vulkan in speed on cuda, see https://github.com/mlc-ai/mlc-llm

But honestly, the biggest advantage of llama.cpp for me is being able to split a model so performantly. My puny 16GB laptop can just barely, but very practically, run LLaMA 30B at almost 3 tokens/s, and do it right now. That is crazy!

replies(1): >>36217701 #

smiley1437 ◴[06 Jun 23 19:14 UTC] No.36217701{3}[source]▶

>>36216469 #

>> run LLaMA 30B at almost 3 tokens/s

Please tell me your config! I have an i9-10900 with 32GB of ram that only gets .7 tokens/s on a 30B model

replies(3): >>36217877 #>>36217992 #>>36219745 #

LoganDark ◴[06 Jun 23 19:26 UTC] No.36217877{4}[source]▶

>>36217701 #

> Please tell me your config! I have an i9-10900 with 32GB of ram that only gets .7 tokens/s on a 30B model

Have you quantized it?

replies(1): >>36218570 #

smiley1437 ◴[06 Jun 23 20:19 UTC] No.36218570{5}[source]▶

>>36217877 #

The model I have is q4_0 I think that's 4 bit quantized

I'm running in Windows using koboldcpp, maybe it's faster in Linux?

replies(2): >>36219174 #>>36219792 #

brucethemoose2 ◴[06 Jun 23 22:14 UTC] No.36219792{6}[source]▶

>>36218570 #

I am running linux with cublast offload, and I am using the new 3 bit quant that was just pulled in a day or two ago.

replies(2): >>36220323 #>>36222560 #

1. smiley1437 ◴[06 Jun 23 23:06 UTC] No.36220323{7}[source]▶

>>36219792 #

Thanks! I'll have to try the 3bit to see if that helps

↑

GGML – AI at the Edge