Llama.cpp 30B runs with only 6GB of RAM now

1. w1nk ◴[31 Mar 23 21:43 UTC] No.35394065[source]▶

Does anyone know how/why this change decreases memory consumption (and isn't a bug in the inference code)?

From my understanding of the issue, mmap'ing the file is showing that inference is only accessing a fraction of the weight data.

Doesn't the forward pass necessitate accessing all the weights and not a fraction of them?

replies(4): >>35394751 #>>35396440 #>>35396507 #>>35398499 #

2. matsemann ◴[31 Mar 23 22:42 UTC] No.35394751[source]▶

>>35394065 (TP) #

Maybe lots of the data is embedding values or tokenizer stuff, where a single prompt uses a fraction of those values. And then the rest of the model is quite small.

replies(1): >>35394872 #

3. w1nk ◴[31 Mar 23 22:54 UTC] No.35394872[source]▶

>>35394751 #

That shouldn't be the case. 30B is a number that directly represents the size of the model, not the size of the other components.

4. jhatemyjob ◴[01 Apr 23 02:28 UTC] No.35396440[source]▶

>>35394065 (TP) #

If you read a file with malloc and memcpy, it copies the data from the kernel to userspace. With mmap there is no copying.

5. losteric ◴[01 Apr 23 02:42 UTC] No.35396507[source]▶

>>35394065 (TP) #

yeah, I believe some readers are misinterpreting the report. The OS manages mmap, it won't show up as "regular" memory utilization because it's lazy-loaded and automatically managed. If the OS can keep the whole file in memory, it will, and it will also magically swap to disk prioritizing explicit memory allocation (malloc).

Sounds like the big win is load time from the optimizations. Also, maybe llama.cpp now supports low-memory systems through mmap swapping? ... at the end of the day, 30B quantized is still 19GB...

6. l33tman ◴[01 Apr 23 08:42 UTC] No.35398499[source]▶

>>35394065 (TP) #

It's not a bug, but it's misreading the htop output as mmap doesn't show up as a resident set size there. The pages are RO and not dirty so it's "on the OS" to count it and the OP had lots of RAM on the computer so the model just resides in his page cache instead.

replies(1): >>35398845 #

7. w1nk ◴[01 Apr 23 09:48 UTC] No.35398845[source]▶

>>35398499 #

Ahh, this would do it, thanks :).