(github.com)

1311 points msoad | 1 comments | 31 Mar 23 20:37 UTC | HN request time: 0.194s | source

Show context

jart ◴[31 Mar 23 21:02 UTC] No.35393615[source]▶

Author here. For additional context, please read https://github.com/ggerganov/llama.cpp/discussions/638#discu... The loading time performance has been a huge win for usability, and folks have been having the most wonderful reactions after using this change. But we don't have a compelling enough theory yet to explain the RAM usage miracle. So please don't get too excited just yet! Yes things are getting more awesome, but like all things in science a small amount of healthy skepticism is warranted.

replies(24): >>35393868 #>>35393942 #>>35394089 #>>35394097 #>>35394107 #>>35394203 #>>35394208 #>>35394244 #>>35394259 #>>35394288 #>>35394408 #>>35394881 #>>35395091 #>>35395249 #>>35395858 #>>35395995 #>>35397318 #>>35397499 #>>35398037 #>>35398083 #>>35398427 #>>35402974 #>>35403334 #>>35468946 #

intelVISA ◴[31 Mar 23 22:03 UTC] No.35394288[source]▶

>>35393615 #

Didn't expect to see two titans today: ggerganov AND jart. Can ya'll slow down you make us mortals look bad :')

Seeing such clever use of mmap makes me dread to imagine how much Python spaghetti probably tanks OpenAI's and other "big ML" shops' infra when they should've trusted in zero copy solutions.

Perhaps SWE is dead after all, but LLMs didn't kill it...

replies(11): >>35395112 #>>35395145 #>>35395165 #>>35395404 #>>35396298 #>>35397484 #>>35398972 #>>35399367 #>>35400001 #>>35400090 #>>35456064 #

1. catchnear4321 ◴[31 Mar 23 23:26 UTC] No.35395165[source]▶

>>35394288 #

Money did.

Why waste developer hours (meaning effort) if you can just scale the infra for a little cash? Do it in small enough increments and the increases only outweigh FTEs if you consider all scaling events and look at a long enough time scale.

Suddenly it takes way too much for way too little, but it cost half as many overpaid developers who can’t be arsed to performance.

Edit: in case that sounds like the opposite of intended, ggerganov and jart are the outliers, the exception.

↑

Llama.cpp 30B runs with only 6GB of RAM now