←back to thread

1311 points msoad | 3 comments | | HN request time: 0.732s | source
Show context
detrites ◴[] No.35393558[source]
The pace of collaborative OSS development on these projects is amazing, but the rate of optimisations being achieved is almost unbelievable. What has everyone been doing wrong all these years cough sorry, I mean to say weeks?

Ok I answered my own question.

replies(5): >>35393627 #>>35393885 #>>35393921 #>>35394786 #>>35397029 #
kmeisthax ◴[] No.35393921[source]
>What has everyone been doing wrong all these years

So it's important to note that all of these improvements are the kinds of things that are cheap to run on a pretrained model. And all of the developments involving large language models recently have been the product of hundreds of thousands of dollars in rented compute time. Once you start putting six digits on a pile of model weights, that becomes a capital cost that the business either needs to recuperate or turn into a competitive advantage. So everyone who scales up to this point doesn't release model weights.

The model in question - LLaMA - isn't even a public model. It leaked and people copied[0] it. But because such a large model leaked, now people can actually work on iterative improvements again.

Unfortunately we don't really have a way for the FOSS community to pool together that much money to buy compute from cloud providers. Contributions-in-kind through distributed computing (e.g. a "GPT@home" project) would require significant changes to training methodology[1]. Further compounding this, the state-of-the-art is actually kind of a trade secret now. Exact training code isn't always available, and OpenAI has even gone so far as to refuse to say anything about GPT-4's architecture or training set to prevent open replication.

[0] I'm avoiding the use of the verb "stole" here, not just because I support filesharing, but because copyright law likely does not protect AI model weights alone.

[1] AI training has very high minimum requirements to get in the door. If your GPU has 12GB of VRAM and your model and gradients require 13GB, you can't train the model. CPUs don't have this limitation but they are ridiculously inefficient for any training task. There are techniques like ZeRO to give pagefile-like state partitioning to GPU training, but that requires additional engineering.

replies(7): >>35393979 #>>35394466 #>>35395609 #>>35396273 #>>35400202 #>>35400942 #>>35573426 #
1. JBorrow ◴[] No.35400202[source]
$n00k of compute time is nothing, sorry. This is the kind of thing that academic institutions can give out for free…
replies(1): >>35400271 #
2. linesinthesand ◴[] No.35400271[source]
why don't you write the check then huh
replies(1): >>35406593 #
3. ◴[] No.35406593[source]