Llama.cpp 30B runs with only 6GB of RAM now

Gosh, thank you for getting to this before I did. The first thing I said when I saw it loading tens of GB from the disk on each run is, is there some reason they're not using mmap?

This isn't just a matter of making the 30B model run in 6GB or whatever. You can now run the largest model, without heavy quantization, and let the OS figure it out. It won't be as fast as having "enough" memory, but it will run.

In theory you could always have done this with swap, but swap is even slower because evictions have to be written back to swap (and wear out your SSD if your swap isn't on glacially slow spinning rust) instead of just discarded because the OS knows where to read it back from the filesystem.

This should also make it much more efficient to run multiple instances at once because they can share the block cache.

(I wonder if anybody has done this with Stable Diffusion etc.)