Llama.cpp 30B runs with only 6GB of RAM now

(github.com)

1311 points msoad | 3 comments | 31 Mar 23 20:37 UTC | HN request time: 0.001s | source

Show context

detrites ◴[31 Mar 23 20:58 UTC] No.35393558[source]▶

The pace of collaborative OSS development on these projects is amazing, but the rate of optimisations being achieved is almost unbelievable. What has everyone been doing wrong all these years cough sorry, I mean to say weeks?

Ok I answered my own question.

replies(5): >>35393627 #>>35393885 #>>35393921 #>>35394786 #>>35397029 #

kmeisthax ◴[31 Mar 23 21:28 UTC] No.35393921[source]▶

>>35393558 #

>What has everyone been doing wrong all these years

So it's important to note that all of these improvements are the kinds of things that are cheap to run on a pretrained model. And all of the developments involving large language models recently have been the product of hundreds of thousands of dollars in rented compute time. Once you start putting six digits on a pile of model weights, that becomes a capital cost that the business either needs to recuperate or turn into a competitive advantage. So everyone who scales up to this point doesn't release model weights.

The model in question - LLaMA - isn't even a public model. It leaked and people copied[0] it. But because such a large model leaked, now people can actually work on iterative improvements again.

Unfortunately we don't really have a way for the FOSS community to pool together that much money to buy compute from cloud providers. Contributions-in-kind through distributed computing (e.g. a "GPT@home" project) would require significant changes to training methodology[1]. Further compounding this, the state-of-the-art is actually kind of a trade secret now. Exact training code isn't always available, and OpenAI has even gone so far as to refuse to say anything about GPT-4's architecture or training set to prevent open replication.

[0] I'm avoiding the use of the verb "stole" here, not just because I support filesharing, but because copyright law likely does not protect AI model weights alone.

[1] AI training has very high minimum requirements to get in the door. If your GPU has 12GB of VRAM and your model and gradients require 13GB, you can't train the model. CPUs don't have this limitation but they are ridiculously inefficient for any training task. There are techniques like ZeRO to give pagefile-like state partitioning to GPU training, but that requires additional engineering.

replies(7): >>35393979 #>>35394466 #>>35395609 #>>35396273 #>>35400202 #>>35400942 #>>35573426 #

terafo ◴[31 Mar 23 21:35 UTC] No.35393979[source]▶

>>35393921 #

AI training has very high minimum requirements to get in the door. If your GPU has 12GB of VRAM and your model and gradients require 13GB, you can't train the model. CPUs don't have this limitation but they are ridiculously inefficient for any training task. There are techniques like ZeRO to give pagefile-like state partitioning to GPU training, but that requires additional engineering.

You can't if you have one 12gb gpu. You can if you have couple of dozens. And then petals-style training is possible. It is all very very new and there are many unsolved hurdles, but I think it can be done.

replies(3): >>35394356 #>>35394585 #>>35395800 #

1. webnrrd2k ◴[31 Mar 23 22:27 UTC] No.35394585[source]▶

>>35393979 #

Maybe a good candidate for the SETI@home treatment?

replies(1): >>35394635 #

2. terafo ◴[31 Mar 23 22:31 UTC] No.35394635[source]▶

>>35394585 (TP) #

It is a good candidate. Tech is good 6-18 months away, though.

replies(1): >>35395229 #

3. nullsense ◴[31 Mar 23 23:35 UTC] No.35395229[source]▶

>>35394635 #

How much faster can we develop the tech if we leverage GPT-4 to do it?

↑