GGML – AI at the Edge

(ggml.ai)

Show context

nivekney ◴[06 Jun 23 17:21 UTC] No.36216106[source]▶

>>36215651 (OP) #

On a similar thread, how does it compare to Hippoml?

Context: https://news.ycombinator.com/item?id=36168666

replies(1): >>36216469 #

1. brucethemoose2 ◴[06 Jun 23 17:45 UTC] No.36216469[source]▶

>>36216106 #

We don't necessarily know... Hippo is closed source for now.

Its comparable to Apache TVM's vulkan in speed on cuda, see https://github.com/mlc-ai/mlc-llm

But honestly, the biggest advantage of llama.cpp for me is being able to split a model so performantly. My puny 16GB laptop can just barely, but very practically, run LLaMA 30B at almost 3 tokens/s, and do it right now. That is crazy!

replies(1): >>36217701 #

2. smiley1437 ◴[06 Jun 23 19:14 UTC] No.36217701[source]▶

>>36216469 (TP) #

>> run LLaMA 30B at almost 3 tokens/s

Please tell me your config! I have an i9-10900 with 32GB of ram that only gets .7 tokens/s on a 30B model

replies(3): >>36217877 #>>36217992 #>>36219745 #

3. LoganDark ◴[06 Jun 23 19:26 UTC] No.36217877[source]▶

>>36217701 #

> Please tell me your config! I have an i9-10900 with 32GB of ram that only gets .7 tokens/s on a 30B model

Have you quantized it?

replies(1): >>36218570 #

4. oceanplexian ◴[06 Jun 23 19:34 UTC] No.36217992[source]▶

>>36217701 #

With a single NVIDIA 3090 and the fastest inference branch of GPTQ-for-LLAMA https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/fastest-i..., I get a healthy 10-15 tokens per second on the 30B models. IMO GGML is great (And I totally use it) but it's still not as fast as running the models on GPU for now.

replies(2): >>36219157 #>>36219874 #

5. smiley1437 ◴[06 Jun 23 20:19 UTC] No.36218570{3}[source]▶

>>36217877 #

The model I have is q4_0 I think that's 4 bit quantized

I'm running in Windows using koboldcpp, maybe it's faster in Linux?

replies(2): >>36219174 #>>36219792 #

6. LoganDark ◴[06 Jun 23 21:10 UTC] No.36219157{3}[source]▶

>>36217992 #

> IMO GGML is great (And I totally use it) but it's still not as fast as running the models on GPU for now.

I think it was originally designed to be easily embeddable—and most importantly, native code (i.e. not Python)—rather than competitive with GPUs.

I think it's just starting to get into GPU support now, but carefully.

7. LoganDark ◴[06 Jun 23 21:11 UTC] No.36219174{4}[source]▶