(github.com)

1311 points msoad | 3 comments | 31 Mar 23 20:37 UTC | HN request time: 0.237s | source

Show context

jart ◴[31 Mar 23 21:02 UTC] No.35393615[source]▶

Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.

replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #

diimdeep ◴[31 Mar 23 23:19 UTC] No.35395091[source]▶

>>35393615 #

Is the title misleading here ?

30B quantized requires 19.5 GB, not 6GB; Otherwise severe swapping to disk

  model    original size   quantized size (4-bit)
  7B     13 GB    3.9 GB
  13B    24 GB    7.8 GB
  30B    60 GB    19.5 GB
  65B    120 GB   38.5 GB

replies(2): >>35395206 #>>35395944 #

1. renewiltord ◴[31 Mar 23 23:32 UTC] No.35395206[source]▶

>>35395091 #

That's the size on disk, my man. When you quantize it to a smaller float size you lose precision on the weights and so the model is smaller. Then here they `mmap` the file and it only needs 6 GiB of RAM!

replies(2): >>35395584 #>>35400189 #

2. ◴[01 Apr 23 00:17 UTC] No.35395584[source]▶

>>35395206 (TP) #

3. gliptic ◴[01 Apr 23 13:33 UTC] No.35400189[source]▶

>>35395206 (TP) #

The size mentioned is already quantized (and to integers, not floats). mmap obviously doesn't do any quantization.

↑

Llama.cpp 30B runs with only 6GB of RAM now