GGML – AI at the Edge

1. KronisLV ◴[06 Jun 23 20:20 UTC] No.36218585[source]▶

Just today, I finished a blog post (also my latest submission, felt like could be useful to some) about how to get something like this working in a bundle of something to run models, as well as a web UI for more easy interaction - in my case that was koboldcpp, which can run GGML, both on the CPU (with OpenBLAS) and on the GPU (with CLBlast). Thanks to Hugging Face, getting Metharme, WizardLM or other models is also extremely easy, and the 4-bit quantized ones provide decent performance even on commodity hardware!

I tested it out both locally (6c/12t CPU) and on a Hetzner CPX41 instance (8 AMD cores, 16 GB of RAM, no GPU), the latter of which costs about 25 EUR per month and still can generate decent responses in less than half a minute, my local machine needing approx. double that time. While not quite as good as one might expect (decent response times mean maxing out CPU for the single request, if you don't have a compatible GPU with enough VRAM), the technology is definitely at a point where it's possible for it to make people's lives easier in select use cases with some supervision (e.g. customer support).

What an interesting time to be alive, I wonder where we'll be in a decade.

replies(4): >>36218767 #>>36218947 #>>36219214 #>>36220027 #

2. SparkyMcUnicorn ◴[06 Jun 23 20:37 UTC] No.36218767[source]▶

>>36218585 (TP) #

Seems like serverless is the way to go for fast output while remaining inexpensive.

e.g.

https://replicate.com/stability-ai/stablelm-tuned-alpha-7b

https://github.com/runpod/serverless-workers/tree/main/worke...

https://modal.com/docs/guide/ex/falcon_gptq

replies(2): >>36219155 #>>36227291 #

3. b33j0r ◴[06 Jun 23 20:50 UTC] No.36218947[source]▶

>>36218585 (TP) #

I wish everyone in tech had your perspective. That is what I see, as well.

There is a lull right now between gpt4 and gpt5 (literally and metaphorically). Consumer models are plateauing around 40B for a barely-reasonable RTX 3090 (ggml made this possible).

Now is the time to launch your ideas, all!

4. tikkun ◴[06 Jun 23 21:09 UTC] No.36219155[source]▶

>>36218767 #

I think that's true if you're doing minimal usage / low utilization, otherwise a dedicated instance will be cheaper.

replies(1): >>36220820 #

5. digitallyfree ◴[06 Jun 23 21:15 UTC] No.36219214[source]▶

>>36218585 (TP) #

The fact that this is commodity hardware makes ggml extremely impressive and puts the tech in the hands of everyone. I recently reported my experience running 7B llama.cpp on a 15 year old Core 2 Quad [1] - when that machine came out it was a completely different world and I certainly never imagined how AI would look like today. This was around when the first iPhone was released and everyone began talking about how smartphones would become the next big thing. We saw what happened 15 years later...

Today with the new k-quants users are reporting that 30B models are working with 2-bit quantization on 16GB CPUs and GPUs [2]. That's enabling access to millions of consumers and the optimizations will only improve from there.

[1] https://old.reddit.com/r/LocalLLaMA/comments/13q6hu8/7b_perf...

[2] https://github.com/ggerganov/llama.cpp/pull/1684, https://old.reddit.com/r/LocalLLaMA/comments/141bdll/moneros...

6. c_o_n_v_e_x ◴[06 Jun 23 22:34 UTC] No.36220027[source]▶

>>36218585 (TP) #

What do you mean by commodity hardware? Single server single CPU socket x86/ARM boxes? Anything that does not have a GPU?

replies(2): >>36220104 #>>36222947 #

7. ◴[06 Jun 23 22:42 UTC] No.36220104[source]▶

>>36220027 #

8. menzoic ◴[06 Jun 23 23:56 UTC] No.36220820{3}[source]▶

>>36219155 #

You are correct. The pricing model guarantees this. Pay per compute vs pay for uptime (during which you could have more compute for cheaper)

9. KronisLV ◴[07 Jun 23 04:47 UTC] No.36222947[source]▶

>>36220027 #

> What do you mean by commodity hardware?

In my case, my local workstation has a Ryzen 5 1600 desktop CPU from 2017 (first generation Zen, 14nm) and it still worked decently.

Of course, response times would grow with longer inputs and outputs or larger models, but getting a response in less than a minute when running off of purely CPU is encouraging in of itself.

10. baobabKoodaa ◴[07 Jun 23 14:34 UTC] No.36227291[source]▶

>>36218767 #

I think cold start times will be excessive for serverless in this use case.

replies(1): >>36228971 #

11. SparkyMcUnicorn ◴[07 Jun 23 16:21 UTC] No.36228971{3}[source]▶

>>36227291 #

3 second cold start is good enough for me.