Llama.cpp 30B runs with only 6GB of RAM now

(github.com)

1311 points msoad | 1 comments | 31 Mar 23 20:37 UTC | HN request time: 0.001s | source

Show context

jart ◴[31 Mar 23 21:02 UTC] No.35393615[source]▶

Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.

replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #

intelVISA ◴[31 Mar 23 22:03 UTC] No.35394288[source]▶

>>35393615 #

Didn't expect to see two titans today: ggerganov AND jart. Can ya'll slow down you make us mortals look bad :')

Seeing such clever use of mmap makes me dread to imagine how much Python spaghetti probably tanks OpenAI's and other "big ML" shops' infra when they should've trusted in zero copy solutions.

Perhaps SWE is dead after all, but LLMs didn't kill it...

replies(11): >>35395112 #>>35395145 #>>35395165 #>>35395404 #>>35396298 #>>35397484 #>>35398972 #>>35399367 #>>35400001 #>>35400090 #>>35456064 #

rfoo ◴[01 Apr 23 05:40 UTC] No.35397484[source]▶

>>35394288 #

Sigh.

It's not like the zero copy buzzword is going to help you during training, all your weights have to stay on GPU, you are going to sample your training data randomly and your data is on a networked storage anyway, so mmap HURTS. You'd better just O_DIRECT.

Similarly, as long as you run your inference on GPU it's not like you can mmap... And I have indeed worked on inference runtimes for mobile devices and on the rare cases we need to use CPU only (hey, your phone also have GPUs since forever) at $PREVIOUS_JOB we did have a mmap-able model format, it also helps in TEE/SGX/whatever enclave tech. Oh, and there are no Python at all.

The recent development of ggml is interesting as it catches a moment that "big ML shop infra" guys don't care: running models on Apple Silicon. M1/M2s are expensive enough that we don't consider deploying them instead of those 1000000000 bizarre accelerators in production, yet everyone on HN seems to have one and hey it's fast enough for LMs. They are rather unique as they are CPU+high bandwidth RAM+accelerators with totally shared RAM with CPU, instead of some GPU shit.

tldr it's not like "big ML shop infra" guys are stupid and leaves performance on table. They just don't run their production workload on MacBooks. That's where the community shine right?

replies(3): >>35402161 #>>35408183 #>>35429776 #

1. sroussey ◴[01 Apr 23 17:32 UTC] No.35402161[source]▶

>>35397484 #

On a Mac, mmap definitely works for the GPU since it’s all the same unified memory.

↑