←back to thread

1311 points msoad | 1 comments | | HN request time: 0.204s | source
Show context
jart ◴[] No.35393615[source]
Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.
replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #
conradev ◴[] No.35394089[source]
> But we don't have a compelling enough theory yet to explain the RAM usage miracle.

My guess would be that the model is faulted into memory lazily page by page (4K or 16K chunks) as the model is used, so only the actual parts that are needed are loaded.

The kernel also removes old pages from the page cache to make room for new ones, and especially so if the computer is using a lot of its RAM. As with all performance things, this approach trades off inference speed for memory usage, but likely faster overall because you don't have to read the entire thing from disk at the start. Each input will take a different path through the model, and will require loading more of it.

The cool part is that this memory architecture should work just fine with hardware acceleration, too, as long as the computer has unified memory (anything with an integrated GPU). This approach likely won't be possible with dedicated GPUs/VRAM.

This approach _does_ still work to run a dense model with limited memory, but the time/memory savings would just be less. The GPU doesn't multiply every matrix in the file literally simultaneously, so the page cache doesn't need to contain the entire model at once.

replies(2): >>35394240 #>>35394335 #
jart ◴[] No.35394335[source]
I don't think it's actually trading away inference speed. You can pass an --mlock flag, which calls mlock() on the entire 20GB model (you need root to do it), then htop still reports only like 4GB of RAM is in use. My change helps inference go faster. For instance, I've been getting inference speeds of 30ms per token after my recent change on the 7B model, and I normally get 200ms per eval on the 30B model.
replies(7): >>35394430 #>>35394797 #>>35395345 #>>35395412 #>>35395525 #>>35395565 #>>35396256 #
1. microtherion ◴[] No.35396256[source]
> htop still reports only like 4GB of RAM is in use

I think that's just an accounting thing. Many UNIX variants will not "charge" read only memory mapped pages to a process, because they could be shared among many processes and evicted at will.