That is not to say that we shouldn't do the right thing regardless, but I do think there is a feeling of "who is going to rule the world in the future?" tha underlies governmental decision-making on how much to regulate AI.
That is not to say that we shouldn't do the right thing regardless, but I do think there is a feeling of "who is going to rule the world in the future?" tha underlies governmental decision-making on how much to regulate AI.
For national security reasons I'm perfectly fine with giving LLMs unfettered access to various academic publications, scientific and technical information, that sort of thing. I'm a little more on the fence about proprietary code, but I have a hard time believing there isn't enough code out there already for LLMs to ingest.
Otherwise though, what is an LLM with unfettered access to copyrighted material better at vs one that merely has unfettered access to scientific / technical information + licensed copyrighted material? I would suppose that besides maybe being a more creative writer, the other LLM is far more capable of reproducing copyrighted works.
In effect, the other LLM is a more capable plagiarism machine compared to the other, and not necessarily more intelligent, and otherwise doesn't really add any more value. What do we have to gain from condoning it?
I think the argument I'm making is a little easier to see in the case of image and video models. The model that has unfettered access to copyrighted material is more capable, sure, but more capable of what? Capable of making images? Capable of reproducing Mario and Luigi in an infinite number of funny scenarios? What do we have to gain from that? What reason do we have for not banning such models outright? Not like we're really missing out on any critical security or economic advantages here.
If I'm learning about kinematics maybe it would be more effective to have comparisons to Superman flying faster than a speeding bullet and no amount of dry textbooks and academic papers will make up for the lack of such a comparison.
This is especially relevant when we're talking about science-fiction which has served as the inspiration for many of the leading edge technologies that we use including stuff like LLMs and AI.
A reasonable compromise then is that you can train an AI on Wikipedia, more-or-less. An AI trained this way will have a robust understanding of Superman, enough that it can communicate through metaphor, but it won't have the training data necessary to create a ton of infringing content about Superman (well, it won't be able to create good infringing content anyway. It'll probably have access to a lot of plot summaries but nothing that would help it make a particularly interesting Superman comic or video).
To me it seems like encyclopedias use copyrighted pop culture in a way that constitutes fair use, and so training on them seems fine as long as they consent to it.