Most active commenters

rfoo(3)

Popular/hot comments

>>35396982 #
>>35397515 #
>>35397551 #

←back to thread

Llama.cpp 30B runs with only 6GB of RAM now

(github.com)

Show context

jart ◴[31 Mar 23 21:02 UTC] No.35393615[source]▶

>>35393284 (OP) #

Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.

replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #

intelVISA ◴[31 Mar 23 22:03 UTC] No.35394288[source]▶

>>35393615 #

Didn't expect to see two titans today: ggerganov AND jart. Can ya'll slow down you make us mortals look bad :')

Seeing such clever use of mmap makes me dread to imagine how much Python spaghetti probably tanks OpenAI's and other "big ML" shops' infra when they should've trusted in zero copy solutions.

Perhaps SWE is dead after all, but LLMs didn't kill it...

replies(11): >>35395112 #>>35395145 #>>35395165 #>>35395404 #>>35396298 #>>35397484 #>>35398972 #>>35399367 #>>35400001 #>>35400090 #>>35456064 #

1. MontyCarloHall ◴[31 Mar 23 23:24 UTC] No.35395145[source]▶

>>35394288 #

>how much Python spaghetti probably tanks OpenAI's and other "big ML" shops' infra when they should've trusted in zero copy solutions

Probably not all that much. All of the Python numeric computing frameworks (Numpy, PyTorch, TensorFlow, etc.) are basically just wrappers for lower level C++/C/Fortran code. Unless you’re doing something boneheaded and converting framework-native tensors to Python objects, passing tensors around within a framework essentially just passes a pointer around, which has marginal overhead even when encapsulated in a bloated Python object.

Indeed, a huge number of PyTorch operations are explicitly zero copy: https://pytorch.org/docs/stable/tensor_view.html

replies(2): >>35396982 #>>35408170 #

2. oceanplexian ◴[01 Apr 23 04:02 UTC] No.35396982[source]▶

>>35395145 (TP) #

It’s not that the performance is the issue, it’s that it’s unmaintainable and prone to break. Exceptions aren’t handled right, dependencies are a disaster (Proprietary NVIDIA drivers+CUDA+PyTorch+ the various versions of stuff are a complete disaster)

This leads to all sorts of bugs and breaking changes that are cool in an academic or hobbyist setting but a total headache on a large production system.

replies(3): >>35397515 #>>35397551 #>>35398182 #

3. rfoo ◴[01 Apr 23 05:48 UTC] No.35397515[source]▶

>>35396982 #

The stack is very volatile and unmaintainable because it doesn't need to be maintainable. Exactly why we have unmaintainable software in other domains. During the last 10 years there are ALWAYS totally new model architecture with new operations (or in case of CV new bizarre uses of Conv). By the time you get your performant perfectly maintainable masterpiece ready it's not needed anymore. The stack optimizes for flexibility and iteration speed naturally, just like why people use Rails.

In fact I'd love to see that Transformer really dominates. We can then start to converge on software. And compute-wise transformers are really simple, too!

replies(3): >>35399833 #>>35401019 #>>35401942 #

4. CurrentB ◴[01 Apr 23 05:59 UTC] No.35397551[source]▶

>>35396982 #

Yeah, I've been using python for the first time in a while to try out some of the llm stuff and I can't believe how bad the dependency hell is. It's probably particularly bad due to the pace of change in this field. But I spend an hour getting dependencies fixed every time I touch anything. 80% of the Google Collabs I find are just outright broken. I wish there were other viable non python options to try out these things.

replies(3): >>35398107 #>>35399048 #>>35424595 #

5. bboygravity ◴[01 Apr 23 07:37 UTC] No.35398107{3}[source]▶

>>35397551 #

No idea what a Google Collab is, but does the code come with an environment or at least a specifications of which packages and versions to use (requirements.txt)?

It sounds unnecessarily weird to me that people would share Python code that simply doesn't work out at all out of the box.

replies(2): >>35398243 #>>35402417 #

6. eastWestMath ◴[01 Apr 23 07:51 UTC] No.35398182[source]▶

>>35396982 #

I was in a PLT group in grad school going into robotics. I could spend all day ranting about how Python is just completely unsuitable for professional software development. Even something like F# would be an enormous improvement.

7. version_five ◴[01 Apr 23 08:02 UTC] No.35398243{4}[source]▶

>>35398107 #

Its rarely as easy as sharing a requirements.txt. There are lots of things that can still break - for examples you get weird situations where different modules require different versions of a third module. Or all the Cuda toolkit version issues thsy seem to come up with gpu stuff. When we share python, we tend to share a docker image, and even this isn't foolproof. A big problem I think is that it doesn't incentivize building something portable. And it's very hard to test across different machines. Add to that all the different practices re virtual environments, venv, conda, etc, everyone tries to install the dependencies differently or is starting from some nonstandard state. It's a mess.

replies(1): >>35400039 #

8. hunta2097 ◴[01 Apr 23 10:25 UTC] No.35399048{3}[source]▶

>>35397551 #

You're using virtual environments, right?

ML libraries are particularly bad, most other stuff works well.

Friends don't let friends install pip into /usr/lib.

replies(1): >>35406036 #

9. nunobrito ◴[01 Apr 23 12:41 UTC] No.35399833{3}[source]▶

>>35397515 #

Still a poor excuse. Had they written this in Java and things wouldn't be so difficult both on performance and maintainability.

Never understood why people think that indented languages are any simpler when in fact they bring all kinds of trouble for getting things done.

replies(2): >>35400733 #>>35403267 #

10. pablo1107 ◴[01 Apr 23 13:11 UTC] No.35400039{5}[source]▶

>>35398243 #

Maybe using Nix it's a better experience for creating such an environment where you depending also on system utilities.

replies(1): >>35401252 #

11. revelio ◴[01 Apr 23 14:50 UTC] No.35400733{4}[source]▶

>>35399833 #

There's a Java ML library called Tribuo that might be worth looking at.

replies(1): >>35400868 #

12. dopidopHN ◴[01 Apr 23 15:06 UTC] No.35400868{5}[source]▶

>>35400733 #

Thanks, the boring aspect of Java is appealing here.

13. wootland ◴[01 Apr 23 15:27 UTC] No.35401019{3}[source]▶

>>35397515 #

Does this mean it would be easy to move off Python all together? It seems like the problem stems from everyone using pytorch at the base layer. How realistic is it recreate those apis in another, more modern language. Coding in Rust, Go... then distributing a single binary vs. pip hell seems like it would be worth it.

replies(2): >>35401402 #>>35403195 #

14. superkuh ◴[01 Apr 23 15:54 UTC] No.35401252{6}[source]▶

>>35400039 #

Everyone is using llama.cpp because we reject the idea of giving up on system libraries like nix does. That kind of tomfoolery (at least in the desktop context) is only required when you use software projects that use libraries/languages which break forwards compatibility every 3 years.

If you just write straight c++ (without c++xx, or anything like it) you can compile the code on machines from decades ago if you want.

replies(2): >>35402510 #>>35405209 #

15. sroussey ◴[01 Apr 23 16:11 UTC] No.35401402{4}[source]▶

>>35401019 #

Go would be interesting for the reason you could send an executable.

I’d love for JS/TS to dominate as well. Use ‘bun bun’ to send an executable if need be, but also use in in web backends.

16. colinsane ◴[01 Apr 23 17:07 UTC] No.35401942{3}[source]▶

>>35397515 #

> The stack optimizes for flexibility and iteration speed naturally

“unmaintainable” (as in “i’m spending an hour each day sorting out which dep update broke my project”) usually gets in the way of the former point.

17. vkou ◴[01 Apr 23 18:00 UTC] No.35402417{4}[source]▶

>>35398107 #

> No idea what a Google Collab is

It's ~equivalent to a Jupyter notebook.

18. ◴[01 Apr 23 18:09 UTC] No.35402510{7}[source]▶

>>35401252 #

19. rfoo ◴[01 Apr 23 19:19 UTC] No.35403195{4}[source]▶

>>35401019 #

Check https://pytorch.org/tutorials/advanced/cpp_frontend.html

You can easily build a standalone binary (well, it would be GiB+ if you use CUDA... but that's the cost of statically linking cu*), had you coded your model and training loop in C++.

It then happily runs everywhere as long as a NVIDIA GPU driver is available (don't need to install CUDA).

Protip: Your AI research team REALLY DON'T WANT TO DO THIS BECAUSE THEY LOVE PYTHON. Having Python, even with the dependency management shit, is a feature, not a bug.

(if you want Rust / Go and don't want to wrapping libtorch/tf then you have a lot of work to do but yeah it's possible. also there are model compiler guys [1] where the promise is model.py in model.o out you just link it with your code)

[1] https://mlc.ai

20. rfoo ◴[01 Apr 23 19:26 UTC] No.35403267{4}[source]▶

>>35399833 #

There's deeplearning4j (from Theano days!), go figure why it didn't take off.

21. remexre ◴[01 Apr 23 23:19 UTC] No.35405209{7}[source]▶

>>35401252 #

What's c++xx?

replies(1): >>35407666 #

22. AnthonyMouse ◴[02 Apr 23 01:12 UTC] No.35406036{4}[source]▶

>>35399048 #

This just goes to show what a mess this is.

Suppose you have a big piece of compute hardware (e.g. at a university) which is shared by multiple users. They all want to come in and play with these models. Each one is tens to hundreds of gigabytes. Is each user supposed to have their own copy in their home directory?

replies(1): >>35406562 #

23. Accujack ◴[02 Apr 23 02:35 UTC] No.35406562{5}[source]▶

>>35406036 #

This is not exactly a new problem.

replies(1): >>35406640 #

24. AnthonyMouse ◴[02 Apr 23 02:46 UTC] No.35406640{6}[source]▶

>>35406562 #

That's kind of the point. We solved this problem decades ago. You have a system package manager that installs a system-wide copy of the package that everybody can use.

But now we encounter this broken nonsense because solved problems get unsolved by bad software.

25. opless ◴[02 Apr 23 06:06 UTC] No.35407666{8}[source]▶

>>35405209 #

C++11, and greater.

replies(1): >>35416499 #

26. miraculixx ◴[02 Apr 23 07:42 UTC] No.35408170[source]▶

>>35395145 (TP) #

What a bad take!

Python is not the cause of dependency hell. Deep dependency trees are. The only way to deal with this is to use seperate environments and to carefully specify the exact requirements.

Those who claim some language would be a magical fix clearly lack experience in multiple languages.

replies(1): >>35413971 #

27. wnoise ◴[02 Apr 23 19:04 UTC] No.35413971[source]▶

>>35408170 #

It's true nothing forces or forbids this, but some languages/toolings/communities/ecosystems encourage that more than others though.

28. remexre ◴[02 Apr 23 23:26 UTC] No.35416499{9}[source]▶

>>35407666 #

Huh, I was proficient in Rust before "properly" learning C++, so maybe that accounts for it, but I didn't realize C++11 was controversial. Is it just move semantics, or are there some library things that are hard to implement?

replies(1): >>35431561 #

29. nyarlathotep_ ◴[03 Apr 23 14:51 UTC] No.35424595{3}[source]▶

>>35397551 #

IME the ML world with Python is a whole mess on top of the existing dependency issues.

I've been very _careful_ too (using pyenv/virtualenvs etc) for dependency management, but with Nvidia driver dependencies and "missing 'sqlite3/bz2' issues related to the underlying interpreter (not to mention issues with different Python3.x versions) I'm lucky to be able to even run a 'hello world' ML sample after an afternoon of fighting with it.

My Ubuntu install w/ Nvidia card only seems to recognize the GPU in some circumstances even when using the same `conda` env. Often this is remedied by rebooting the machine(?).

No idea how companies manage this stuff in production. Absolute minefield that seems to catastrophically break if you sneeze at it.

I'll admit I am not an expert in managing ML envs, but I've dealt with a lot of python environments for typical CRUD stuff, and while rough at times, it was never this bad.

30. int_19h ◴[03 Apr 23 21:30 UTC] No.35431561{10}[source]▶

>>35416499 #

I think what OP is saying is that decades-old systems wouldn't have C++11-compatible compilers on them.

replies(1): >>35441542 #

31. bboygravity ◴[04 Apr 23 15:49 UTC] No.35441542{11}[source]▶

>>35431561 #

And maybe that "C++" is now basically a bunch of different incompatible languages instead of just 1 language, depending on what "xx" is (11, 14, 17, 20, 23, etc).

It's like Python 2 vs Python 3 except even worse.

replies(1): >>35444029 #

32. int_19h ◴[04 Apr 23 18:32 UTC] No.35444029{12}[source]▶

>>35441542 #

In my experience, C++03 code works just fine without changes on a C++11 and C++14 compilers, so no, it's not at all like Python 2/3. The few features that were ripped out were exactly the stuff that pretty much no-one was using for good reasons (e.g. throw-specifications).

↑