I don't understand. I thought each parameter was 16 bit (two bytes) which would predict minimally 60GB of RAM for a 30 billion parameter model. Not 6GB.
replies(2):
> LLaMA 30B appears to be a sparse model. While there's 20GB of weights, depending on your prompt I suppose only a small portion of that needs to be used at evaluation time [...]
Found the answer from the author of this amazing pull request: https://github.com/ggerganov/llama.cpp/discussions/638#discu...