Llama.cpp 30B runs with only 6GB of RAM now

(github.com)

Show context

jart ◴[31 Mar 23 21:02 UTC] No.35393615[source]▶

Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.

replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #

1. rcarmo ◴[01 Apr 23 07:32 UTC] No.35398083[source]▶

>>35393615 #

This is nothing short of legendary. Was following the thread on Twitter and LOLed at the replies of “Praise be Jart”, but there’s something of the sublime here. Great weight wrangling judo :)

replies(1): >>35400066 #

2. shock-value ◴[01 Apr 23 13:16 UTC] No.35400066[source]▶

>>35398083 (TP) #

It appears that this was just a misreading of how memory usage was being reported and there was actually no improvement here. At least nothing so sensational as being able to run a larger-than-RAM model without swapping from disk on every iteration.

replies(1): >>35400685 #

3. jart ◴[01 Apr 23 14:45 UTC] No.35400685[source]▶

>>35400066 #

Please read the original link to the pull request, where I stated my change offered a 2x improvement in memory usage. You actually are able to load models 2x larger without compromising system stability, because pages are no longer being copied. That's because you previously needed 40gb of RAM to load a 20GB model, in order to ensure your file cache wasn't destroyed and need to reread from disk the next time. Now you only need 20GB to load a 20GB model.

The peculiarity here is that tools like htop were reporting the improvement as being an 8x improvement, which is interesting, because RAM use is only 2x better due to my change. The rusage.com page fault reporting was also interesting too. This is not due to sparseness. It's because htop was subtracting MAP_SHARED memory. The htop docs say on my computer that the color purple is used to display shared memory, and yellow is used to display kernel file caches. But it turned out it just uses yellow for both, even though it shouldn't, because mincore() reported that the shared memory had been loaded into the resident set size.

replies(1): >>35402607 #

4. shock-value ◴[01 Apr 23 18:20 UTC] No.35402607{3}[source]▶

>>35400685 #

It's obviously a productive change and kudos for taking it on, but much of the enthusiasm being generated here was driven by the entirely unanticipated prospect of running a model at full speed using less memory than the model's own footprint, and by the notion that inference with a dense model somehow behaved in a sparse manner at runtime. Best to be a bit more grounded here, particularly with regard to claims that defy common understanding.

replies(2): >>35403354 #>>35412531 #

5. jart ◴[01 Apr 23 19:36 UTC] No.35403354{4}[source]▶

>>35402607 #

I wanted it to be sparse. Doesn't matter if it wasn't. We're already talking about how to modify the training and evaluation to make it sparser. That's the next logical breakthrough in getting inference for larger models running on tinier machines. If you think I haven't done enough to encourage skepticism, then I'd remind you that we all share the same dream of being able to run these large language models on our own. I can't control how people feel. Especially not when the numbers reported by our tools are telling us what we want to be true.

6. ◴[02 Apr 23 16:53 UTC] No.35412531{4}[source]▶

>>35402607 #

↑