GGML – AI at the Edge

1. world2vec ◴[06 Jun 23 17:25 UTC] No.36216161[source]▶

>>36215651 (OP) #

Might be a silly question but is GGML a similar/competing library to George Hotz's tinygrad [0]?

[0] https://github.com/geohot/tinygrad

replies(2): >>36216187 #>>36218539 #

2. qeternity ◴[06 Jun 23 17:27 UTC] No.36216187[source]▶

>>36216161 (TP) #

No, GGML is a CPU optimized library and quantized weight format that is closely linked to his other project llama.cpp

replies(2): >>36216244 #>>36216266 #

3. stri8ed ◴[06 Jun 23 17:30 UTC] No.36216244[source]▶

>>36216187 #

How does the quantization happen? Are the weights preprocessed before loading the model?

replies(2): >>36216303 #>>36216321 #

4. ggerganov ◴[06 Jun 23 17:32 UTC] No.36216266[source]▶

>>36216187 #

ggml started with focus on CPU inference, but lately we have been augmenting it with GPU support. Although still in development, it already has partial CUDA, OpenCL and Metal backend support

replies(3): >>36216327 #>>36216442 #>>36219452 #

5. sebzim4500 ◴[06 Jun 23 17:35 UTC] No.36216303{3}[source]▶

>>36216244 #

Yes, but to my knowledge it doesn't do any of the complicated optimization stuff that SOTA quantisation methods use. It basically is just doing a bunch of rounding.

There are advantages to simplicity, after all.

replies(1): >>36216416 #

6. ggerganov ◴[06 Jun 23 17:36 UTC] No.36216321{3}[source]▶

>>36216244 #

The weights are preprocessed into integer quants combined with scaling factors in various configurations (4, 5, 8-bits and recently more exotic 2, 3 and 6-bit quants). At runtime, we use efficient SIMD implementations to perform the matrix multiplication at integer level, carefully optimizing for both compute and memory bandwidth. Similar strategies are applied when running GPU inference - using custom kernels for fast Matrix x Vector multiplications

7. qeternity ◴[06 Jun 23 17:36 UTC] No.36216327{3}[source]▶

>>36216266 #

Hi Georgi - thanks for all the work, have been following and using since the availability of Llama base layers!

Wasn’t implying it’s CPU only, just that it started as a CPU optimized library.

8. brucethemoose2 ◴[06 Jun 23 17:42 UTC] No.36216416{4}[source]▶

>>36216303 #

Its not so simple anymore, see https://github.com/ggerganov/llama.cpp/pull/1684

9. freedomben ◴[06 Jun 23 17:43 UTC] No.36216442{3}[source]▶

>>36216266 #

As a person burned by nvidia, I can't thank you enough for the OpenCL support

10. xiphias2 ◴[06 Jun 23 20:16 UTC] No.36218539[source]▶

>>36216161 (TP) #

They are competing (although they are very different, tinygrad is full stack Python, ggml is focusing on a few very important models), but in my opinion George Hotz lost focus a bit by not working more on getting the low level optimizations perfect.

replies(1): >>36219975 #

11. ignoramous ◴[06 Jun 23 21:41 UTC] No.36219452{3}[source]▶

>>36216266 #

(a novice here who knows a couple of fancy terms)

> ...lately we have been augmenting it with GPU support.

Would you say you'd then be building an equivalent to Google's JAX?

Someone even asked if anyone would build a C++ to JAX transpiler [0]... I am wondering if that's something you may implement? Thanks.

[0] https://news.ycombinator.com/item?id=35475675

12. georgehotz ◴[06 Jun 23 22:30 UTC] No.36219975[source]▶

>>36218539 #

Which low level optimizations specifically are you referring to?

I'm happy with most of the abstractions. We are pushing to assembly codegen. And if you meant things like matrix accelerators, that's my next priority.

We are taking more a of breadth first approach. I think ggml is more depth first and application focused. (and I think Mojo is even more breadth first)

replies(1): >>36222732 #

13. xiphias2 ◴[07 Jun 23 04:12 UTC] No.36222732{3}[source]▶

>>36219975 #

Maybe I'd love to see Tinygrad beat GGML in its own game (4 bit LLM support on M1 Mac GPU or Tensor cores) before adding more backends / models.

It's easy to debug because the generated kernels can be compared to GGML, and still gives something practical that we all can play with.

At this point breadth first is a bit boring, because this way we don't know how far tinygrad is from optimal generated output.

replies(1): >>36235891 #

14. Art9681 ◴[08 Jun 23 01:09 UTC] No.36235891{4}[source]▶

>>36222732 #

I just deployed tinygrad thanks to this conversation and I've played with just about every local LLM client and toolchain there is. I just ran the examples as listed in the repo with absolutely zero problems and they just worked. I think their goals of prioritizing ease of use far outweighs any performance optimizations at this stage of the game. Nothing is stopping the team from integrating other projects if their performance delta is worth the pivot.

From what I see, the foundation is there for a great multimodal platform. Very excited to see where this goes.