Ok I answered my own question.
Ok I answered my own question.
In other words, the groups of folks working on training models don’t necessarily have access to the sort of optimization engineers that are working in other areas.
When all of this leaked into the open, it caused a lot of people knowledgeable in different areas to put their own expertise to the task. Some of those efforts (mmap) pay off spectacularly. Expect industry to copy the best of these improvements.
So it's important to note that all of these improvements are the kinds of things that are cheap to run on a pretrained model. And all of the developments involving large language models recently have been the product of hundreds of thousands of dollars in rented compute time. Once you start putting six digits on a pile of model weights, that becomes a capital cost that the business either needs to recuperate or turn into a competitive advantage. So everyone who scales up to this point doesn't release model weights.
The model in question - LLaMA - isn't even a public model. It leaked and people copied[0] it. But because such a large model leaked, now people can actually work on iterative improvements again.
Unfortunately we don't really have a way for the FOSS community to pool together that much money to buy compute from cloud providers. Contributions-in-kind through distributed computing (e.g. a "GPT@home" project) would require significant changes to training methodology[1]. Further compounding this, the state-of-the-art is actually kind of a trade secret now. Exact training code isn't always available, and OpenAI has even gone so far as to refuse to say anything about GPT-4's architecture or training set to prevent open replication.
[0] I'm avoiding the use of the verb "stole" here, not just because I support filesharing, but because copyright law likely does not protect AI model weights alone.
[1] AI training has very high minimum requirements to get in the door. If your GPU has 12GB of VRAM and your model and gradients require 13GB, you can't train the model. CPUs don't have this limitation but they are ridiculously inefficient for any training task. There are techniques like ZeRO to give pagefile-like state partitioning to GPU training, but that requires additional engineering.
You can't if you have one 12gb gpu. You can if you have couple of dozens. And then petals-style training is possible. It is all very very new and there are many unsolved hurdles, but I think it can be done.
There must be open source projects with enough money to pool into such a project. I wonder whether wikimedia or apache are considering anything.
It’s several things:
* Cutting-edge code, not overly concerned with optimization
* Code written by scientists, who aren’t known for being the world’s greatest programmers
* The obsession the research world has with using Python
Not surprising that there’s a lot of low-hanging fruit that can be optimized.
How so? Why couldn't we just start a gofundme/kickstarter to fund the training of an open-source model?
this is why i think the patent and copyright system is a failure. The idea that having laws protecting information like this would advance the progress of science.
It doesn't, because look how an illegally leaked model gets much more advances in shorter time. The laws protecting IP merely gives a moat to incumbents.
It's not unfeasible, in fact that's how things were done before lots of improvements to the various libraries in essence, many corps still have poorly built pipelines that spend a lot of time in CPU land and not enough in GPU land.
Just an FYI as well - intermediate outputs of models are used in quite a bit of ML, you may see them in some form being used for hyperparameter optimization and searching.
You could potentially crowdfund this, though I should point out that this was already tried and Kickstarter shut it down. The effort in question, "Unstable Diffusion", was kinda sketchy, promising a model specifically tuned for NSFW work. What you'd want is an organization that's responsible, knows how to use state of the art model architectures, and at least is willing to try and stop generative porn.
Which just so happens to be Stability AI. Except they're funded as a for-profit on venture capital, not as a something you can donate to on Kickstarter or Patreon.
If they were to switch from investor subsidy to crowdfunding, however, I'm not entirely sure people would actually be lining up to bear the costs of training. To find out why we need to talk about motive. We can broadly subdivide the users of generative AI into a few categories:
- Companies, who view AI as a way to either juice stock prices by promising a permanent capitalist revolution that will abolish the creative working class. They do not care about ownership, they care about balancing profit and loss. Insamuch as they want AI models not controlled by OpenAI, it is a strategic play, not a moral one.
- Artists of varying degrees of competence who use generative AI to skip past creative busywork such as assembling references or to hack out something quickly. Insamuch as they have critiques of how AI is owned, it is specifically that they do not want to be abolished by capitalists using their own labor as ground meat for the linear algebra data blender. So they are unlikely to crowdfund the thing they are angry is going to put them out of a job.
- No-hopers and other creatively bankrupt individuals who have been sold a promise that AI is going to fix their lack of talent by making talent obsolete. This is, of course, a lie[2]. They absolutely would prefer a model unencumbered by filters on cloud servers or morality clauses in licensing agreements, but they do not have the capital in aggregate to fund such an endeavor.
- Free Software types that hate OpenAI's about-face on open AI. Oddly enough, they also have the same hangups artists do, because much of FOSS is based on copyleft/Share-Alike clauses in the GPL, which things like GitHub Copilot is not equipped to handle. On the other hand they probably would be OK with it if the model was trained on permissive sources and had some kind of regurgitation detector. Consider this one a wildcard.
- Evildoers. This could be people who want a cheaper version of GPT-4 that hasn't been Asimov'd by OpenAI so they can generate shittons of spam. Or people who want a Stable Diffusion model that's really good at making nonconsensual deepfake pornography so they can fuck with people's heads. This was the explicit demographic that "Unstable Diffusion" was trying to target. Problem is, cybercriminals tend to be fairly unsophisticated, because the people who actually know how to crime with impunity would rather make more money in legitimate business instead.
Out of five demographics I'm aware of, two have capital but no motive, two have motive but no capital, and one would have both - but they already have a sour taste in their mouth from the creep-tech vibes that AI gives off.
[0] In practice the only way that profit cap is being hit is if they upend the economy so much that it completely decimates all human labor, in which case they can just overthrow the government and start sending out Terminators to kill the working class[1].
[1] God damn it why do all the best novel ideas have to come by when I'm halfway through another fucking rewrite of my current one
[2] Getting generative AI to spit out good writing or art requires careful knowledge of the model's strengths and limitations. Like any good tool.
But a lot of people would rather only have govt or corp control of it...
The interface is designed to be easy to use (python) and the bit that is actually doing the work is designed to be heavily performant (which is C & CUDA and may even be running on a TPU).
Of course it would save them some money if they could run their models on cheaper hardware, but they've raised $11B so I don't think that's much of a concern right now. Better to spend the efforts on pushing the model forward, which some of these optimisations may make harder.
Yes. These laws are bad. We could fix this with a 2 line change:
Section 1. Article I, Section 8, Clause 8 of this Constitution is hereby repealed.
Section 2. Congress shall make no law abridging the right of the people to publish information.
To fix this, you'd need to ban trade secrecy entirely. As in, if you have some kind of invention or creative work you must publish sufficient information to replicate it "in a timely manner". This would be one of those absolutely insane schemes that only a villain in an Ayn Rand book would come up with.
That'd be a 10,000 fold depreciation of an asset due to a preventable oversight. Ouchies.
The problem is how in the world is ChatGPT so good compared to the average human being? The answer is that human beings (except for the 1%), have their left hands tied behind their back because of copyright law.
You're completely correct that the speed-sensitive parts are written in lower-level libraries, but another way to phrase that is "Python can go really fast, as long as you don't use Python." But this also means ML is effectively hamstrung into only using methods that already exist and have been coded in C++, since anything in Python would be too slow to compete.
There's lots of languages that make good tradeoffs between performance and usability. Python is not one of those languages. It is, at best, only slightly harder to use than Julia, yet orders-of-magnitude slower.