The last part (3) that it rereads the whole file again is an assumption and it could just be a coincidence that the new token is computed at every ~20GB read from disk, but it makes sense, as I do not think swapping would have been that inefficient.
Someone in the GitHub comments had the same experience when using a 10GB VM to limit memory usage.
It appears the claims of memory reduction were premature. Perhaps an artifact of how memory usage is being reported by some tools.
> mmap-ed memory pages backed by a file that aren't dirty aren't counted in an process's RSS usage, only kernel page cache. The mmap-ed regions of virtual memory does get counted in VSZ (virtual memory) but that is just virtual and can be larger than RAM+swap.
those are the output of convert-pth-to-ggml.py and quantize respectively
I had to cancel 30B as I needed to use the computer after some 12 hours, now I have to fix the ext4 filesystem of the drive where I was doing it, fun times for the weekend
guess I'll settle for 13B, I was using 7B but the results are pretty lousy compared to GPT4all's Lora, let alone GPT3.5-turbo or better
I'll give a shot to quantising 13B, I'm on 16GB of RAM locally
It looks to me that if I was planning on building a new machine capable of LLM inference it’s going to be possible using commodity gamer components and if lazy weights is viable, then such a machine with multiple PCIe 5 nvme drives in a raid 0 can potentially almost reach memory bandwidth.
On my list of to investigate next is in regards to inference with GPUs, could somehow multiple smaller GPUs be used with a technique similar to the OP post.