Copyright in its current form is ridiculous, but I support some (much-pared-back) version of copyright that limits rights further, expands fair use, repeals the DMCA, and reduces the copyright term to something on the order of 15-20 years (perhaps with a renewal option as with patents).
I've released a lot of software under the GPL, and the GPL in its current form couldn't exist without copyright.
What copyright should do is protect individual creators, not corporations. And it should protect them even if their work is mixed through complex statistical algorithms such as LLMs.
LLMs wouldn't be possible without _trillions_ of hours of work by people writing books, code, music, etc. they are trained on. The _millions_ of hours of work spent on the training algorithm itself, the chat interface, the scraping scripts, etc. is barely a drop in the bucket.
There is 0 reason the people who spent mere millions of hours of work should get all the reward without giving anything to the rest of the world who put in trillions of hours.
It's not only about regurgitation verbatim. Doing that just means it gets caught more easily.
LLMs are just another way the uber rich try to exploit everyone, hoping that if they exploit every single person's work just a little, they will get away with it.
Nobody is 1000x more productive than the average programmer at writing code. There is no reason somebody should make 1000x more money from it either.
This isn't really how derivative works operate.
If you read Harry Potter and you decide you want to write a book about how Harry and Hermione grow up and become professors at Hogwarts, that's probably going to be a derivative work.
If you read Harry Potter and decide you want to write a book about a little Korean girl who lives with abusive parents but has a knack for science and crawls her way out of that family by inventing things for an eccentric businessman, is that a derivative of Harry Potter? Probably not, even if that was the inspiration for it.
To be a derivative work it has to be pretty similar to the original. That's actually the test, it's based on similarity. Causing it to not be one is done exactly by mixing it with so many other things that it's no longer sufficiently like any of them.
- how things work now vs how they should work - and also how it works when a human does something vs when a an LLM is used to generate something imitating the human work.
A human has limited time and memory. Human time is valuable, computer time is not. Memorizing something by a human takes time.
When a human is inspired by a work and writes something based on that, he invests a lot of time and energy into it. Therefore people have decided that this creative output should be protected by the law.
Also a human is limited by how much he can remember from the original work. Even if writing what you described, he would inevitably fall back on his own life experiences, opinions, attitude, ways of thinking, etc.
When an LLM is used, it generated a statistical mashup of works it ingested during training. No part of this process has any intrinsic value. It literally only costs what the electricity does. And it's almost infinitely scalable. The law might not call it derivative because it was written at a time where this kind of mechanical derivation was not feasible.
BTW, I like that you spell it GAI. General artificial intelligence feels more natural to say. I wonder if there's some rule of english I don't know which makes AGI more correct or if all the highly educated people are just trying to avoid sounding like they're saying "gay".
But they are still based on the training data. An untrained model is a random noise generator. A model trained exclusively on GPL code will therefore obviously only generate useful code thanks to the GPL input. The output is literally derived from the "training data" input and the prompt.
Now, given the input is a much more substantial than the prompt by orders of magnitude, the prompt is basically irrelevant.
So what the license of the output should be based on is the training data. The big players can only avoid this logical conclusion by pretending that the model ("AI") is some kind of intelligent entity and also by training on everything so any license is only a minority of the input. It's just manipulation.
An obvious practical problem with this is that the licenses are variously incompatible with one another:
https://en.wikipedia.org/wiki/License_compatibility
> The big players can only avoid this logical conclusion by pretending that the model ("AI") is some kind of intelligent entity and also by training on everything so any license is only a minority of the input.
Whether it's an intelligent entity or not doesn't really enter into it. The real question is whether the output is taking enough from some particular input to make it a derivative. Which ought to depend on what a given output actually looks like.