Does anyone know how/why this change decreases memory consumption (and isn't a bug in the inference code)?
From my understanding of the issue, mmap'ing the file is showing that inference is only accessing a fraction of the weight data.
Doesn't the forward pass necessitate accessing all the weights and not a fraction of them?
replies(4):