GGML – AI at the Edge

Just today, I finished a blog post (also my latest submission, felt like could be useful to some) about how to get something like this working in a bundle of something to run models, as well as a web UI for more easy interaction - in my case that was koboldcpp, which can run GGML, both on the CPU (with OpenBLAS) and on the GPU (with CLBlast). Thanks to Hugging Face, getting Metharme, WizardLM or other models is also extremely easy, and the 4-bit quantized ones provide decent performance even on commodity hardware!

I tested it out both locally (6c/12t CPU) and on a Hetzner CPX41 instance (8 AMD cores, 16 GB of RAM, no GPU), the latter of which costs about 25 EUR per month and still can generate decent responses in less than half a minute, my local machine needing approx. double that time. While not quite as good as one might expect (decent response times mean maxing out CPU for the single request, if you don't have a compatible GPU with enough VRAM), the technology is definitely at a point where it's possible for it to make people's lives easier in select use cases with some supervision (e.g. customer support).

What an interesting time to be alive, I wonder where we'll be in a decade.