←back to thread

1311 points msoad | 3 comments | | HN request time: 0s | source
Show context
jart ◴[] No.35393615[source]
Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.
replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #
conradev ◴[] No.35394089[source]
> But we don't have a compelling enough theory yet to explain the RAM usage miracle.

My guess would be that the model is faulted into memory lazily page by page (4K or 16K chunks) as the model is used, so only the actual parts that are needed are loaded.

The kernel also removes old pages from the page cache to make room for new ones, and especially so if the computer is using a lot of its RAM. As with all performance things, this approach trades off inference speed for memory usage, but likely faster overall because you don't have to read the entire thing from disk at the start. Each input will take a different path through the model, and will require loading more of it.

The cool part is that this memory architecture should work just fine with hardware acceleration, too, as long as the computer has unified memory (anything with an integrated GPU). This approach likely won't be possible with dedicated GPUs/VRAM.

This approach _does_ still work to run a dense model with limited memory, but the time/memory savings would just be less. The GPU doesn't multiply every matrix in the file literally simultaneously, so the page cache doesn't need to contain the entire model at once.

replies(2): >>35394240 #>>35394335 #
jart ◴[] No.35394335[source]
I don't think it's actually trading away inference speed. You can pass an --mlock flag, which calls mlock() on the entire 20GB model (you need root to do it), then htop still reports only like 4GB of RAM is in use. My change helps inference go faster. For instance, I've been getting inference speeds of 30ms per token after my recent change on the 7B model, and I normally get 200ms per eval on the 30B model.
replies(7): >>35394430 #>>35394797 #>>35395345 #>>35395412 #>>35395525 #>>35395565 #>>35396256 #
1. conradev ◴[] No.35394430[source]
Very cool! Are you testing after a reboot / with an empty page cache?
replies(1): >>35394709 #
2. jart ◴[] No.35394709[source]
Pretty much. I do my work on a headless workstation that I SSH into, so it's not like competing with Chrome tabs or anything like that. But I do it mostly because that's what I've always done. The point of my change is you won't have to be like me anymore. Many of the devs who contacted after using my change have been saying stuff like, "yes! I can actually run LLaMA without having to close all my apps!" and they're so happy.
replies(1): >>35400835 #
3. dekhn ◴[] No.35400835[source]
Linux has a command to drop caches at runtime (https://www.tecmint.com/clear-ram-memory-cache-buffer-and-sw...) which is VERY useful during debugging.