Llama.cpp 30B runs with only 6GB of RAM now

(github.com)

Show context

kccqzy ◴[01 Apr 23 00:36 UTC] No.35395739[source]▶

I might be missing something but I actually couldn't reproduce. I purposefully chose a computer with 16GiB RAM to run the 30B model. Performance was extremely slow, and the process was clearly not CPU-limited, unlike when it's running the 13B model. It's clearly swapping a lot.

replies(5): >>35396367 #>>35396552 #>>35396848 #>>35398023 #>>35398479 #

1. losteric ◴[01 Apr 23 02:15 UTC] No.35396367[source]▶

>>35395739 #

Are the weights on NVME? Old SSD? HDD?

replies(2): >>35397005 #>>35402314 #

2. alchemist1e9 ◴[01 Apr 23 04:06 UTC] No.35397005[source]▶

>>35396367 (TP) #

It’s interesting how NVMe will be even more critically important if this lazy weights loading approach works out. PCIe 5 has arrived just in time for LLM interference it seems.

replies(1): >>35397081 #

3. freehorse ◴[01 Apr 23 04:18 UTC] No.35397081[source]▶

>>35397005 #

Well in this case it does not have to do with SSDs, quite the opposite here a performance gain seems to happen by caching the file in RAM in the beginning.

replies(1): >>35399827 #

4. alchemist1e9 ◴[01 Apr 23 12:41 UTC] No.35399827{3}[source]▶

>>35397081 #

That’s not my understanding. The entire point is the model can’t fit in RAM. mmap allows lazy loading from storage.

replies(1): >>35400197 #

5. freehorse ◴[01 Apr 23 13:35 UTC] No.35400197{4}[source]▶

>>35399827 #

Yes but to compute a token it has to eventually read the data, either cached in RAM or from storage. There is no way that a fast SSD can compete with RAM in terms of I/O speed. To achieve any speed benefit the whole file has to be cached in RAM. This has different benefits eg threads can share memory, and the file does not have to be reread next time it is called because it is already cached in RAM, but, in final analysis, you need to have the RAM, or then you are reading from the disk, and reading 20gb for each token means you need to read 1T for a paragraph of 50 tokens. My m1, which by no means has a slow SSD, reads the file at 500-600mb/s, while a thunderbolt pci-4 enclosure reads at 700-800mb/s, even if you double that it will still take 10-20 seconds per token. To get less than 1 second per token for the 30B model one has to read there at 20gb/s. At the time we have done that, there will be even huger (v)RAMs and even larger models.

replies(1): >>35402456 #

6. kccqzy ◴[01 Apr 23 17:49 UTC] No.35402314[source]▶

>>35396367 (TP) #

In my case they are on a SATA SSD.

7. alchemist1e9 ◴[01 Apr 23 18:04 UTC] No.35402456{5}[source]▶

>>35400197 #

PCIe 5 NVMe drives can do 11+ GBps so at least 2x your numbers. We seem to be talking past each other because the point of the change is to run inference on a cpu for an LLM with a larger weight size than the host RAM can fit.

It looks to me that if I was planning on building a new machine capable of LLM inference it’s going to be possible using commodity gamer components and if lazy weights is viable, then such a machine with multiple PCIe 5 nvme drives in a raid 0 can potentially almost reach memory bandwidth.

On my list of to investigate next is in regards to inference with GPUs, could somehow multiple smaller GPUs be used with a technique similar to the OP post.

↑