(github.com)

1311 points msoad | 1 comments | 31 Mar 23 20:37 UTC | HN request time: 0.201s | source

Show context

jart ◴[31 Mar 23 21:02 UTC] No.35393615[source]▶

Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.

replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #

1. StillBored ◴[01 Apr 23 05:45 UTC] No.35397499[source]▶

>>35393615 #

Took a look at it, did you try MAP_HUGETLB? This looks like the kind of application that can gain very large runtime advantages from avoiding TLB pressure. It might take a bit longer (or fail entirely) on machines where you can't get enough huge pages, but attempting it (or probing for free pages via /proc/meminfo) and then falling back to mapping it without might take slightly longer for the mmap() but the advantages of taking an order of magnitude (assuming you can get 1G pages) fewer TLB misses might be worth it.

↑

Llama.cpp 30B runs with only 6GB of RAM now