Llama.cpp 30B runs with only 6GB of RAM now

(github.com)

1311 points msoad | 5 comments | 31 Mar 23 20:37 UTC | HN request time: 0.811s | source

Show context

jart ◴[31 Mar 23 21:02 UTC] No.35393615[source]▶

Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.

replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #

1. nynx ◴[31 Mar 23 21:24 UTC] No.35393868[source]▶

>>35393615 #

Why is it behaving sparsely? There are only dense operations, right?

replies(2): >>35394105 #>>35399718 #

2. w1nk ◴[31 Mar 23 21:46 UTC] No.35394105[source]▶

>>35393868 (TP) #

I also have this question, yes it should be. The forward pass should require accessing all the weights AFAIK.

replies(2): >>35394210 #>>35396386 #

3. ◴[31 Mar 23 21:57 UTC] No.35394210[source]▶

>>35394105 #

4. ◴[01 Apr 23 02:19 UTC] No.35396386[source]▶

>>35394105 #

5. HarHarVeryFunny ◴[01 Apr 23 12:27 UTC] No.35399718[source]▶

>>35393868 (TP) #

From what I've read there's no evidence it's "behaving sparsely".. That was just offered as a suggestion why it might not be loading all the weights, but makes no sense in terms of the model. It's going to be using all the weights.

Another suggestion is that not all of the word/token embedding table might be used, which would be a function of the input used to test, but that would be easy enough to disprove as there would then be different memory usage for different inputs.

It seems possible the reported memory usage is lower than reality if that's how mmap/top work. In any case, a good use of mmap it seems, especially since for a multi-layer model layer weights will be used sequentially so paged load-on-demand will work relatively well even in a low memory situation.

↑