(github.com)

1311 points msoad | 1 comments | 31 Mar 23 20:37 UTC | HN request time: 0.197s | source

Show context

jart ◴[31 Mar 23 21:02 UTC] No.35393615[source]▶

Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.

replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #

1. alchemist1e9 ◴[31 Mar 23 23:37 UTC] No.35395249[source]▶

>>35393615 #

I’m hopeful that when especially skilled developers like you have banged on minimizing inference resources, the others and you will start looking at distributed training ideas. Probably there is a way to decentralize the training so we can all throw in our GPUs together on building the most useful models for code generation than can be free to use and relatively cheap to run inference on. If you have any thoughts on that side of the LLM space I’m sure we would all be super curious to hear them.

Thank you for the amazing work. It’s so appreciated by so many on HN like me I’m sure.

↑

Llama.cpp 30B runs with only 6GB of RAM now