←back to thread

1311 points msoad | 7 comments | | HN request time: 1.368s | source | bottom
Show context
kccqzy ◴[] No.35395739[source]
I might be missing something but I actually couldn't reproduce. I purposefully chose a computer with 16GiB RAM to run the 30B model. Performance was extremely slow, and the process was clearly not CPU-limited, unlike when it's running the 13B model. It's clearly swapping a lot.
replies(5): >>35396367 #>>35396552 #>>35396848 #>>35398023 #>>35398479 #
1. losteric ◴[] No.35396367[source]
Are the weights on NVME? Old SSD? HDD?
replies(2): >>35397005 #>>35402314 #
2. alchemist1e9 ◴[] No.35397005[source]
It’s interesting how NVMe will be even more critically important if this lazy weights loading approach works out. PCIe 5 has arrived just in time for LLM interference it seems.
replies(1): >>35397081 #
3. freehorse ◴[] No.35397081[source]
Well in this case it does not have to do with SSDs, quite the opposite here a performance gain seems to happen by caching the file in RAM in the beginning.
replies(1): >>35399827 #
4. alchemist1e9 ◴[] No.35399827{3}[source]
That’s not my understanding. The entire point is the model can’t fit in RAM. mmap allows lazy loading from storage.
replies(1): >>35400197 #
5. freehorse ◴[] No.35400197{4}[source]
Yes but to compute a token it has to eventually read the data, either cached in RAM or from storage. There is no way that a fast SSD can compete with RAM in terms of I/O speed. To achieve any speed benefit the whole file has to be cached in RAM. This has different benefits eg threads can share memory, and the file does not have to be reread next time it is called because it is already cached in RAM, but, in final analysis, you need to have the RAM, or then you are reading from the disk, and reading 20gb for each token means you need to read 1T for a paragraph of 50 tokens. My m1, which by no means has a slow SSD, reads the file at 500-600mb/s, while a thunderbolt pci-4 enclosure reads at 700-800mb/s, even if you double that it will still take 10-20 seconds per token. To get less than 1 second per token for the 30B model one has to read there at 20gb/s. At the time we have done that, there will be even huger (v)RAMs and even larger models.
replies(1): >>35402456 #
6. kccqzy ◴[] No.35402314[source]
In my case they are on a SATA SSD.
7. alchemist1e9 ◴[] No.35402456{5}[source]
PCIe 5 NVMe drives can do 11+ GBps so at least 2x your numbers. We seem to be talking past each other because the point of the change is to run inference on a cpu for an LLM with a larger weight size than the host RAM can fit.

It looks to me that if I was planning on building a new machine capable of LLM inference it’s going to be possible using commodity gamer components and if lazy weights is viable, then such a machine with multiple PCIe 5 nvme drives in a raid 0 can potentially almost reach memory bandwidth.

On my list of to investigate next is in regards to inference with GPUs, could somehow multiple smaller GPUs be used with a technique similar to the OP post.