Llama.cpp 30B runs with only 6GB of RAM now

(github.com)

1311 points msoad | 1 comments | 31 Mar 23 20:37 UTC | HN request time: 0s | source

Show context

detrites ◴[31 Mar 23 20:58 UTC] No.35393558[source]▶

The pace of collaborative OSS development on these projects is amazing, but the rate of optimisations being achieved is almost unbelievable. What has everyone been doing wrong all these years cough sorry, I mean to say weeks?

Ok I answered my own question.

replies(5): >>35393627 #>>35393885 #>>35393921 #>>35394786 #>>35397029 #

kmeisthax ◴[31 Mar 23 21:28 UTC] No.35393921[source]▶

>>35393558 #

>What has everyone been doing wrong all these years

So it's important to note that all of these improvements are the kinds of things that are cheap to run on a pretrained model. And all of the developments involving large language models recently have been the product of hundreds of thousands of dollars in rented compute time. Once you start putting six digits on a pile of model weights, that becomes a capital cost that the business either needs to recuperate or turn into a competitive advantage. So everyone who scales up to this point doesn't release model weights.

The model in question - LLaMA - isn't even a public model. It leaked and people copied[0] it. But because such a large model leaked, now people can actually work on iterative improvements again.

Unfortunately we don't really have a way for the FOSS community to pool together that much money to buy compute from cloud providers. Contributions-in-kind through distributed computing (e.g. a "GPT@home" project) would require significant changes to training methodology[1]. Further compounding this, the state-of-the-art is actually kind of a trade secret now. Exact training code isn't always available, and OpenAI has even gone so far as to refuse to say anything about GPT-4's architecture or training set to prevent open replication.

[0] I'm avoiding the use of the verb "stole" here, not just because I support filesharing, but because copyright law likely does not protect AI model weights alone.

[1] AI training has very high minimum requirements to get in the door. If your GPU has 12GB of VRAM and your model and gradients require 13GB, you can't train the model. CPUs don't have this limitation but they are ridiculously inefficient for any training task. There are techniques like ZeRO to give pagefile-like state partitioning to GPU training, but that requires additional engineering.

replies(7): >>35393979 #>>35394466 #>>35395609 #>>35396273 #>>35400202 #>>35400942 #>>35573426 #

seydor ◴[31 Mar 23 22:18 UTC] No.35394466[source]▶

>>35393921 #

> we don't really have a way for the FOSS community to pool together that much money

There must be open source projects with enough money to pool into such a project. I wonder whether wikimedia or apache are considering anything.

replies(2): >>35395484 #>>35397575 #

sceadu ◴[01 Apr 23 00:06 UTC] No.35395484[source]▶

>>35394466 #

Maybe we can repurpose the SETI@home infrastructure :)

replies(1): >>35396801 #

1. kmeisthax ◴[01 Apr 23 03:26 UTC] No.35396801[source]▶

>>35395484 #

BOINC might be usable but the existing distributed training setups assume all nodes have very high speed I/O so they can trade gradients and model updates around quickly. The kind of setup that's feasible for BOINC is "here's a dataset shard, here's the last epoch, send me back gradients and I'll average them with the other ones I get to make the next epoch". This is quite a bit different from, say, the single-node case which is entirely serial and model updates happen every step rather than epoch.

↑