Llama.cpp 30B runs with only 6GB of RAM now

(github.com)

1311 points msoad | 1 comments | 31 Mar 23 20:37 UTC | HN request time: 0s | source

Show context

kccqzy ◴[01 Apr 23 00:36 UTC] No.35395739[source]▶

I might be missing something but I actually couldn't reproduce. I purposefully chose a computer with 16GiB RAM to run the 30B model. Performance was extremely slow, and the process was clearly not CPU-limited, unlike when it's running the 13B model. It's clearly swapping a lot.

replies(5): >>35396367 #>>35396552 #>>35396848 #>>35398023 #>>35398479 #

freehorse ◴[01 Apr 23 02:48 UTC] No.35396552[source]▶

>>35395739 #

Same, performance of the quantised 30B model on my m1 16GB air is absolutely terrible. A couple of things I noticed on activity monitor: 1. "memory used" + "cached files" == 16GB (while swap is zero) 2. Disk reading is 500-600MB/s 3. it seems that every token is computed exactly _after every ~20GB read from disk_ which actually points that for calculating each token it actually re-reads the weights file again (instead of caching it). I actually suspect that swapping may have been more efficient.

The last part (3) that it rereads the whole file again is an assumption and it could just be a coincidence that the new token is computed at every ~20GB read from disk, but it makes sense, as I do not think swapping would have been that inefficient.

replies(1): >>35399836 #

muyuu ◴[01 Apr 23 12:41 UTC] No.35399836[source]▶

>>35396552 #

Can you share the intermediate files? They're taking ages to process on my 16GB-RAM laptop

replies(1): >>35400111 #

freehorse ◴[01 Apr 23 13:21 UTC] No.35400111[source]▶

>>35399836 #

Which files are you referring to exactly?

replies(1): >>35400360 #

muyuu ◴[01 Apr 23 13:59 UTC] No.35400360[source]▶

>>35400111 #

ggml-model-f16.bin and ggml-model-q4_0.bin

those are the output of convert-pth-to-ggml.py and quantize respectively

I had to cancel 30B as I needed to use the computer after some 12 hours, now I have to fix the ext4 filesystem of the drive where I was doing it, fun times for the weekend

guess I'll settle for 13B, I was using 7B but the results are pretty lousy compared to GPT4all's Lora, let alone GPT3.5-turbo or better

I'll give a shot to quantising 13B, I'm on 16GB of RAM locally

replies(1): >>35400877 #

1. dekhn ◴[01 Apr 23 15:09 UTC] No.35400877[source]▶

>>35400360 #

Yeah, the first time I ran the 30B model, it crashed my machine and I had to reinstall from scratch (linux).

↑