Shouldn't the right's extend forward and simply require the LLM code to be deleted?
> Beats me. AI decided to do so and I didn't question it. I did ask AI to look at the OxCaml implementation in the beginning.
This shows that the problem with AI is philosophical, not practical
Those kinds of cases, although they do happen, are exceptional. In a typical output that doesn't not line-for-line resemble a single training input, it is considered a new, but non-copyrightable work.
That's not how it works. If you ask an LLM to write Harry Potter and it writes something that is 99% the same as Harry Potter, it isn't magically free of copyright. That would obviously be insane.
The legal system is still figuring out exactly what the rules are here but it seems likely that it's going to be on the LLM user to know if the output is protected by copyright. I imagine AI vendors will develop secondary search thingies to warn you (if they haven't already), and there will probably be some "reasonable belief" defence in the eventual laws.
Either way it definitely isn't as simple as "LLM wrote it so we can ignore copyright".
(From what I understand, the amount of human input that's required to make the result copyrightable can be pretty small, perhaps even as little as selecting from multiple options. But this is likely to be quite a gray area.)
(It is, of course, exceptionally lazy to leave such things in if you are using the LLM to assist you with a task, and can cause problems of false attribution. Especially in this case where it seems to have just picked a name of one of the maintainers of the project)
To me, this is what seems more insane! If you've never read Harry Potter, and you ask an LLM to write you a story about a wizard boy, and it outputs 80% Harry Potter - how would you even know?
> there will be probably be some "reasonable belief" defence in eventual laws.
This is probably true, but it's irksome to shift all blame away from the LLM producers, using copy-written data to peddle copy-written output. This simply turns the business into copyright infringement as a service - what incentive would they have to actually build those "secondary search thingies" and build them well?
> it definitely isn't as simple as "LLM wrote it so we can ignore copyright".
Agreed. The copyright system is getting stress tested. It will be interesting to see how our legal systems can adapt to this.
Note: I, myself, am guilty of forking projects, adding some simple feature I need with an LLM quickly because I don’t want to take the time to understand the codebase, and using it personally. I don’t attempt to upstream changes like this and waste maintainers’ time until I actually take the time myself to understand the project, the issue, and the solution.
You should be careful about speaking in absolute terms when talking about copyright.
There is nothing that prevents multiple people from owning copyright to identical works. This is also why copyright infringement is such a mess to litigate.
I'd also be interested in knowing why you think code generated by LLMs can't be copyrighted. That's quite a statement.
There's also the problem with copyright law and different jurisdictions.
The obvious way is by searching the training data for close matches. LLMs need to do that and warn you about it. Of course the problem is they all trained on pirated books and then deleted them...
But either way it's kind of a "your problem" thing. You can't really just say "I invented this great tool and it sometimes lets me violate copyright without realising. You don't mind do you, copyright holders?"