←back to thread

1311 points msoad | 1 comments | | HN request time: 0.202s | source
Show context
jart ◴[] No.35393615[source]
Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.
replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #
1. AnthonyMouse ◴[] No.35402974[source]
Gosh, thank you for getting to this before I did. The first thing I said when I saw it loading tens of GB from the disk on each run is, is there some reason they're not using mmap?

This isn't just a matter of making the 30B model run in 6GB or whatever. You can now run the largest model, without heavy quantization, and let the OS figure it out. It won't be as fast as having "enough" memory, but it will run.

In theory you could always have done this with swap, but swap is even slower because evictions have to be written back to swap (and wear out your SSD if your swap isn't on glacially slow spinning rust) instead of just discarded because the OS knows where to read it back from the filesystem.

This should also make it much more efficient to run multiple instances at once because they can share the block cache.

(I wonder if anybody has done this with Stable Diffusion etc.)