(github.com)

1311 points msoad | 1 comments | 31 Mar 23 20:37 UTC | HN request time: 0.199s | source

Show context

w1nk ◴[31 Mar 23 21:43 UTC] No.35394065[source]▶

Does anyone know how/why this change decreases memory consumption (and isn't a bug in the inference code)?

From my understanding of the issue, mmap'ing the file is showing that inference is only accessing a fraction of the weight data.

Doesn't the forward pass necessitate accessing all the weights and not a fraction of them?

replies(4): >>35394751 #>>35396440 #>>35396507 #>>35398499 #

l33tman ◴[01 Apr 23 08:42 UTC] No.35398499[source]▶

>>35394065 #

It's not a bug, but it's misreading the htop output as mmap doesn't show up as a resident set size there. The pages are RO and not dirty so it's "on the OS" to count it and the OP had lots of RAM on the computer so the model just resides in his page cache instead.

replies(1): >>35398845 #

1. w1nk ◴[01 Apr 23 09:48 UTC] No.35398845[source]▶

>>35398499 #

Ahh, this would do it, thanks :).

↑

Llama.cpp 30B runs with only 6GB of RAM now