Llama.cpp 30B runs with only 6GB of RAM now

(github.com)

Show context

jart ◴[31 Mar 23 21:02 UTC] No.35393615[source]▶

Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.

replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #

bestcoder69 ◴[01 Apr 23 01:19 UTC] No.35395995[source]▶

>>35393615 #

Thanks for this! I was able to integrate alpaca-30B into a slack bot & a quick tkinter GUI (coded by GPT-4 tbh) by just shelling out to `./main` in both cases, since model loading is so quick now. (I didn't even have to ask GPT-4 to code me up Python bindings to llama's c-style api!)

replies(1): >>35402054 #

1. bugglebeetle ◴[01 Apr 23 17:19 UTC] No.35402054[source]▶

>>35395995 #

What’s your setup for running these? I’m not seeing performance improvements on off the shelf hardware that would allow for this.

replies(1): >>35407026 #

2. MacsHeadroom ◴[02 Apr 23 03:56 UTC] No.35407026[source]▶

>>35402054 (TP) #

I host a llama-13B IRC chatbot on a spare old android phone.

replies(1): >>35407147 #

3. bugglebeetle ◴[02 Apr 23 04:22 UTC] No.35407147[source]▶

>>35407026 #

Have a repo anywhere?

replies(1): >>35418741 #

4. MacsHeadroom ◴[03 Apr 23 05:01 UTC] No.35418741{3}[source]▶

>>35407147 #

It's just the same llama.cpp repo everyone else is using. You just git clone it to your android phone in termux and then run make and you're done. https://github.com/ggerganov/llama.cpp

Assuming you have the model file downloaded (you can use wget to download it) these are the instructions to install and run:

pkg install git

pkg install cmake

pkg install build-essential

git clone https://github.com/ggerganov/llama.cpp

cd llama.cpp

make -j

./main

replies(1): >>35418871 #

5. bugglebeetle ◴[03 Apr 23 05:24 UTC] No.35418871{4}[source]▶

>>35418741 #

Yeah, I’ve already been running llama.cpp locally, but not found it to perform at the level attested in the comment (30B model as a chat bot on commodity hardware). 13B runs okay, but inference appears generally too slow on to do anything useful on my MacBook. I wondered what you might be doing to get usable performance in that context.

replies(1): >>35441374 #

6. MacsHeadroom ◴[04 Apr 23 15:37 UTC] No.35441374{5}[source]▶

>>35418871 #

You can change the number of threads llama.cpp uses with the -t argument. By default it only uses 4. For example, if your CPU has 16 physical cores then you can run ./main -m model.bin -t 16

16 cores would be about 4x faster than the default 4 cores. Eventually you hit memory bottlenecks. So 32 cores is not twice as fast as 13 cores unfortunately.

replies(1): >>35443868 #

7. bugglebeetle ◴[04 Apr 23 18:19 UTC] No.35443868{6}[source]▶

>>35441374 #

Thanks! Will test that out!

↑