I don't understand. I thought each parameter was 16 bit (two bytes) which would predict minimally 60GB of RAM for a 30 billion parameter model. Not 6GB.
I was thinking something similar. Turns out that you don't need all the weights for any given prompt.
> LLaMA 30B appears to be a sparse model. While there's 20GB of weights, depending on your prompt I suppose only a small portion of that needs to be used at evaluation time [...]
Found the answer from the author of this amazing pull request:
https://github.com/ggerganov/llama.cpp/discussions/638#discu...
Does this mean LLaMA only uses 10% of it's brain? An urban legend come to life!
No, your OP is mistaken. The model weights have to all be accessed for the forward pass. What has happened is that using mmap changes where the memory is consumed (kernel vs process) and so it was being incorrectly interpreted. There are still 30B parameters, and you'll need that times however big your floating point representation is to use the model still.