Running a 180B parameter LLM on a single Apple M2 Ultra

1. rvz ◴[07 Sep 23 15:25 UTC] No.37420331[source]▶

>>37419518 (OP) #

Totally makes sense for C++ or Rust based AI models for inference instead of the over-bloated networks run on Python with sub-optimal inference and fine-tuning costs.

Minimal overhead or zero cost abstractions around deep learning libraries implemented in those languages gives some hope that people like ggerganov are not afraid of the 'don't roll your own deep learning library' dogma and now we can see the results as to why DL on the edge and local AI, is the future of efficiency in deep learning.

We'll see, but Python just can't compete on speed at all, henceforth Modular's Mojo compiler is another one that solves the problem properly with the almost 1:1 familiarity of Python.

replies(5): >>37420484 #>>37420605 #>>37420734 #>>37421354 #>>37422072 #

2. ◴[07 Sep 23 15:33 UTC] No.37420484[source]▶

>>37420331 (TP) #

3. brucethemoose2 ◴[07 Sep 23 15:40 UTC] No.37420605[source]▶

>>37420331 (TP) #

The actual inference is not run in Python in PyTorch, and its usually not bottlenecked by it.

The problem is CUDA, not Python.

LLMs are uniquely suited to local inference in projects like GGML because they are so RAM bandwidth heavy (and hence relatively compute lite), and relatively simple. Your kernel doesn't need to be hyper optimized by 35 Nvidia engineers in 3 stacks before its fast enough to start saturating the memory bus generating tokens.

And yet its still an issue... For instance, llama.cpp is having trouble getting prompt ingestion performance in a native implementation comparable cuBLAS, even though they theoretically have a performance advantage by using the quantization directly.

4. survirtual ◴[07 Sep 23 15:47 UTC] No.37420734[source]▶

>>37420331 (TP) #

Python is generally just the glue language for underlying, highly optimized c++ libs. The improvements aren't just about languages. I would imagine facebook is less focused on inference, so didn't bother to make a highly optimized LLM inference engine. There also just isn't a business case for CPU-bound LLMs at an enterprise scale, so why code for that? Additionally, llama.cpp can be called by python and python could still do all the glue.

There is no language war. Use whatever tool is necessary to achieve effective results for accomplishing the mission.

5. PartiallyTyped ◴[07 Sep 23 16:22 UTC] No.37421354[source]▶

>>37420331 (TP) #

Python is not really the bottleneck in LLM applications. It is for tabular RL, but certainly not for deep RL (i have had discussions with DM folk over this in r/RL, and the ppl from stable diffusion).

The problem is the bus, cuda, and the sheer volume of data that need to be transferred.

Pytorch itself is actually a wrapper around torchlib, which is written in C++.

The compilation step of PyTorch 2.0 provides a sizeable improvement, but not 2 orders of magnitude as you’d expect from python to c++ migrations. The compilation is due to the backend more so than python itself. See Triton for example.

6. neonsunset ◴[07 Sep 23 17:05 UTC] No.37422072[source]▶

>>37420331 (TP) #

I'm not sure why this is downvoted but wanted to chime in that ML successes are taking place, first and foremost, despite Python's shortcomings, which are many.

The user experience of working with language is terrible because most tasks it is utilized in go way beyond "scripting" scenario, which Python was primarily made for (aside from also being easy to pick up and use language).