(github.com)

1311 points msoad | 2 comments | 31 Mar 23 20:37 UTC | HN request time: 0.464s | source

Show context

w1nk ◴[31 Mar 23 21:43 UTC] No.35394065[source]▶

Does anyone know how/why this change decreases memory consumption (and isn't a bug in the inference code)?

From my understanding of the issue, mmap'ing the file is showing that inference is only accessing a fraction of the weight data.

Doesn't the forward pass necessitate accessing all the weights and not a fraction of them?

replies(4): >>35394751 #>>35396440 #>>35396507 #>>35398499 #

1. matsemann ◴[31 Mar 23 22:42 UTC] No.35394751[source]▶

>>35394065 #

Maybe lots of the data is embedding values or tokenizer stuff, where a single prompt uses a fraction of those values. And then the rest of the model is quite small.

replies(1): >>35394872 #

2. w1nk ◴[31 Mar 23 22:54 UTC] No.35394872[source]▶

>>35394751 (TP) #

That shouldn't be the case. 30B is a number that directly represents the size of the model, not the size of the other components.

↑

Llama.cpp 30B runs with only 6GB of RAM now