(github.com)

1311 points msoad | 1 comments | 31 Mar 23 20:37 UTC | HN request time: 0.268s | source

Show context

cubefox ◴[31 Mar 23 21:35 UTC] No.35393976[source]▶

I don't understand. I thought each parameter was 16 bit (two bytes) which would predict minimally 60GB of RAM for a 30 billion parameter model. Not 6GB.

replies(2): >>35394470 #>>35394590 #

heap_perms ◴[31 Mar 23 22:18 UTC] No.35394470[source]▶

>>35393976 #

I was thinking something similar. Turns out that you don't need all the weights for any given prompt.

> LLaMA 30B appears to be a sparse model. While there's 20GB of weights, depending on your prompt I suppose only a small portion of that needs to be used at evaluation time [...]

Found the answer from the author of this amazing pull request: https://github.com/ggerganov/llama.cpp/discussions/638#discu...

replies(1): >>35397093 #

xsmasher ◴[01 Apr 23 04:20 UTC] No.35397093[source]▶

>>35394470 #

Does this mean LLaMA only uses 10% of it's brain? An urban legend come to life!

replies(1): >>35399403 #

w1nk ◴[01 Apr 23 11:32 UTC] No.35399403[source]▶

>>35397093 #

No, your OP is mistaken. The model weights have to all be accessed for the forward pass. What has happened is that using mmap changes where the memory is consumed (kernel vs process) and so it was being incorrectly interpreted. There are still 30B parameters, and you'll need that times however big your floating point representation is to use the model still.

replies(1): >>35407341 #

1. xsmasher ◴[02 Apr 23 05:01 UTC] No.35407341[source]▶

>>35399403 #

But do they all need to be accessed at the same time? If not, pages that are not being actively used can be dropped from memory until needed again.

↑

Llama.cpp 30B runs with only 6GB of RAM now