Llama.cpp 30B runs with only 6GB of RAM now

1. kccqzy ◴[01 Apr 23 00:36 UTC] No.35395739[source]▶

I might be missing something but I actually couldn't reproduce. I purposefully chose a computer with 16GiB RAM to run the 30B model. Performance was extremely slow, and the process was clearly not CPU-limited, unlike when it's running the 13B model. It's clearly swapping a lot.

replies(5): >>35396367 #>>35396552 #>>35396848 #>>35398023 #>>35398479 #

2. losteric ◴[01 Apr 23 02:15 UTC] No.35396367[source]▶

>>35395739 (TP) #

Are the weights on NVME? Old SSD? HDD?

replies(2): >>35397005 #>>35402314 #

3. freehorse ◴[01 Apr 23 02:48 UTC] No.35396552[source]▶

>>35395739 (TP) #

Same, performance of the quantised 30B model on my m1 16GB air is absolutely terrible. A couple of things I noticed on activity monitor: 1. "memory used" + "cached files" == 16GB (while swap is zero) 2. Disk reading is 500-600MB/s 3. it seems that every token is computed exactly _after every ~20GB read from disk_ which actually points that for calculating each token it actually re-reads the weights file again (instead of caching it). I actually suspect that swapping may have been more efficient.

The last part (3) that it rereads the whole file again is an assumption and it could just be a coincidence that the new token is computed at every ~20GB read from disk, but it makes sense, as I do not think swapping would have been that inefficient.

replies(1): >>35399836 #

4. PragmaticPulp ◴[01 Apr 23 03:33 UTC] No.35396848[source]▶

>>35395739 (TP) #

> I might be missing something but I actually couldn't reproduce.

Someone in the GitHub comments had the same experience when using a 10GB VM to limit memory usage.

It appears the claims of memory reduction were premature. Perhaps an artifact of how memory usage is being reported by some tools.

5. alchemist1e9 ◴[01 Apr 23 04:06 UTC] No.35397005[source]▶

>>35396367 #

It’s interesting how NVMe will be even more critically important if this lazy weights loading approach works out. PCIe 5 has arrived just in time for LLM interference it seems.

replies(1): >>35397081 #

6. freehorse ◴[01 Apr 23 04:18 UTC] No.35397081{3}[source]▶

>>35397005 #

Well in this case it does not have to do with SSDs, quite the opposite here a performance gain seems to happen by caching the file in RAM in the beginning.

replies(1): >>35399827 #

7. bugglebeetle ◴[01 Apr 23 07:21 UTC] No.35398023[source]▶

>>35395739 (TP) #

Same here M1 MacBook Pro. Zero speed up on loading and inference.

8. versteegen ◴[01 Apr 23 08:38 UTC] No.35398479[source]▶

>>35395739 (TP) #

The apparent answer is here: https://news.ycombinator.com/item?id=35398012

> mmap-ed memory pages backed by a file that aren't dirty aren't counted in an process's RSS usage, only kernel page cache. The mmap-ed regions of virtual memory does get counted in VSZ (virtual memory) but that is just virtual and can be larger than RAM+swap.

9. alchemist1e9 ◴[01 Apr 23 12:41 UTC] No.35399827{4}[source]▶

>>35397081 #

That’s not my understanding. The entire point is the model can’t fit in RAM. mmap allows lazy loading from storage.

replies(1): >>35400197 #

10. muyuu ◴[01 Apr 23 12:41 UTC] No.35399836[source]▶

>>35396552 #

Can you share the intermediate files? They're taking ages to process on my 16GB-RAM laptop

replies(1): >>35400111 #

11. freehorse ◴[01 Apr 23 13:21 UTC] No.35400111{3}[source]▶

>>35399836 #

Which files are you referring to exactly?

replies(1): >>35400360 #

12. freehorse ◴[01 Apr 23 13:35 UTC] No.35400197{5}[source]▶

>>35399827 #

Yes but to compute a token it has to eventually read the data, either cached in RAM or from storage. There is no way that a fast SSD can compete with RAM in terms of I/O speed. To achieve any speed benefit the whole file has to be cached in RAM. This has different benefits eg threads can share memory, and the file does not have to be reread next time it is called because it is already cached in RAM, but, in final analysis, you need to have the RAM, or then you are reading from the disk, and reading 20gb for each token means you need to read 1T for a paragraph of 50 tokens. My m1, which by no means has a slow SSD, reads the file at 500-600mb/s, while a thunderbolt pci-4 enclosure reads at 700-800mb/s, even if you double that it will still take 10-20 seconds per token. To get less than 1 second per token for the 30B model one has to read there at 20gb/s. At the time we have done that, there will be even huger (v)RAMs and even larger models.

replies(1): >>35402456 #

13. muyuu ◴[01 Apr 23 13:59 UTC] No.35400360{4}[source]▶

>>35400111 #

ggml-model-f16.bin and ggml-model-q4_0.bin

those are the output of convert-pth-to-ggml.py and quantize respectively

I had to cancel 30B as I needed to use the computer after some 12 hours, now I have to fix the ext4 filesystem of the drive where I was doing it, fun times for the weekend

guess I'll settle for 13B, I was using 7B but the results are pretty lousy compared to GPT4all's Lora, let alone GPT3.5-turbo or better

I'll give a shot to quantising 13B, I'm on 16GB of RAM locally

replies(1): >>35400877 #

14. dekhn ◴[01 Apr 23 15:09 UTC] No.35400877{5}[source]▶

>>35400360 #

Yeah, the first time I ran the 30B model, it crashed my machine and I had to reinstall from scratch (linux).

15. kccqzy ◴[01 Apr 23 17:49 UTC] No.35402314[source]▶

>>35396367 #

In my case they are on a SATA SSD.

16. alchemist1e9 ◴[01 Apr 23 18:04 UTC] No.35402456{6}[source]▶

>>35400197 #

PCIe 5 NVMe drives can do 11+ GBps so at least 2x your numbers. We seem to be talking past each other because the point of the change is to run inference on a cpu for an LLM with a larger weight size than the host RAM can fit.

It looks to me that if I was planning on building a new machine capable of LLM inference it’s going to be possible using commodity gamer components and if lazy weights is viable, then such a machine with multiple PCIe 5 nvme drives in a raid 0 can potentially almost reach memory bandwidth.

On my list of to investigate next is in regards to inference with GPUs, could somehow multiple smaller GPUs be used with a technique similar to the OP post.