←back to thread

1311 points msoad | 2 comments | | HN request time: 0s | source
Show context
w1nk ◴[] No.35394065[source]
Does anyone know how/why this change decreases memory consumption (and isn't a bug in the inference code)?

From my understanding of the issue, mmap'ing the file is showing that inference is only accessing a fraction of the weight data.

Doesn't the forward pass necessitate accessing all the weights and not a fraction of them?

replies(4): >>35394751 #>>35396440 #>>35396507 #>>35398499 #
1. matsemann ◴[] No.35394751[source]
Maybe lots of the data is embedding values or tokenizer stuff, where a single prompt uses a fraction of those values. And then the rest of the model is quite small.
replies(1): >>35394872 #
2. w1nk ◴[] No.35394872[source]
That shouldn't be the case. 30B is a number that directly represents the size of the model, not the size of the other components.