(github.com)

1311 points msoad | 2 comments | 31 Mar 23 20:37 UTC | HN request time: 0.41s | source

Show context

jart ◴[31 Mar 23 21:02 UTC] No.35393615[source]▶

Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.

replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #

diimdeep ◴[31 Mar 23 23:19 UTC] No.35395091[source]▶

>>35393615 #

Is the title misleading here ?

30B quantized requires 19.5 GB, not 6GB; Otherwise severe swapping to disk

  model    original size   quantized size (4-bit)
  7B     13 GB    3.9 GB
  13B    24 GB    7.8 GB
  30B    60 GB    19.5 GB
  65B    120 GB   38.5 GB

replies(2): >>35395206 #>>35395944 #

xiphias2 ◴[01 Apr 23 01:10 UTC] No.35395944[source]▶

>>35395091 #

Now it's clear that there was a bug in the measurement. The author used a machine with lots of RAM, so I guess most of us are still stuck with quantized 13B. Still, the improvement hopefully translates, and I hope that 30B will run with 3 bit quantization in a few days.

replies(1): >>35399319 #

1. diimdeep ◴[01 Apr 23 11:21 UTC] No.35399319[source]▶

>>35395944 #

Also, current SSD's achieve 7.5 GB/s+ read speed, opposed to older SSD from 2013 with 500 MB/s, so performance will drastically differ depending on your system specs in case of pulling weights from disk to RAM on demand. Also, there is $ vmmap <pid> where we can see various statistics about process memory and used swap, that are not available in top or htop.

replies(1): >>35400659 #

2. freehorse ◴[01 Apr 23 14:42 UTC] No.35400659[source]▶

>>35399319 (TP) #

Even with 7.5GB/s you are gonna at best achieve 2.7 seconds for a computing a token, in a hyperoptimistic scenario that you can actually achieve that speed in reading the file, which is too slow for doing much. Maybe if one could get the kernel to swap more aggressively or sth it could cut half that time or so, but it still would be quite slow.

↑

Llama.cpp 30B runs with only 6GB of RAM now