They acknowledge the issue is before courts:
> These issues are the subject of intense debate. Dozens of lawsuits are pending in the United States, focusing on the application of copyright’s fair use doctrine. Legislators around the world have proposed or enacted laws regarding the use of copyrighted works in AI training, whether to remove barriers or impose restrictions
Why did they write the finding: I assume it's because it's their responsibility:
> Pursuant to the Register of Copyrights’ statutory responsibility to “[c]onduct studies” and “[a]dvise Congress on national and international issues relating to copyright,”...
All excerpts are from https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...
Politicians, they try to crack as fewer eggs as possible, telling us they are our friends, and we believe them. Now then.. some do more good than bad, some do more bad than good. But on the other hand something that is _good for me_ is _bad for you_ and vice versa. Politicians are just the means to move the needle juuuuuuust a little bit, so show a change, but never make a drastic one. The cost of drastic changes is re-election. And this is the bread and butter of politicians (yes, I am over-over-simplifying but this is human history and a lot will be left out in a comment).
Sure the courts may find its out of their jurisdiction, but they should act as they see fit and let the courts settle that later.
And for the future, here's one heuristic: if there is a profound violation of the law anywhere that (relatively speaking) is ignored or severely downplayed, it is likely that interested parties have arrived at an understanding. Or in other words, a conspiracy.
[1] There are tons of legal arguments on both sides, but for me it is enough to ask: if this is not illegal and is totally fair use (maybe even because, oh no look at what China's doing, etc.), why did they have to resort to & foster piracy in order to obtain this?
If a savant has perfect recall, remembers text perfectly and rearranges that text to create a marginally new text, he'd be sued for breach of copyright.
Only large corporations get away with it.
Why could a copyright office not advise the congress/senate to enact a law that forbids copyrighted material to be used in AI training? This is literally the politicians' job.
* https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...
Yes please.
Delete it for everyone, not just these ridiculous autocrats. It's only helping them in the first place!
European here, but why do you think this is so clear cut? There are other jurisdictions where training on copyrighted data has already been allowed by law/caselaw (Germany and Japan). Why do you need a conspiracy in the US?
AFAICT the US copyright law deals with direct reproductions of a copyrighted piece of content (and also carves out some leeway with direct reproduction, like fair use). I think we can all agree by now that LLMs don't fully reproduce "letter perfect" content, right? What then is the "spirit" of the law that you think was broken here? Isn't this the definition of "transformative work"?
Of note is also the other big case involving books - the one where google was allowed to process mountains of books, they were sued and allowed to continue. How is scanning & indexing tons of books different than scanning & "training" an LLM?
https://www.copyright.gov/ai/Copyright-and-Artificial-Intell...
Any suits would be based on the degree the marginally new copy was fair use. You wouldn't be able to sue the savant for reading and remembering the text.
Using AI to creat marginally new copies of copyrighted work is ALREADY a violation. We don't need a dramatic expansion of copyright law that says that just giving the savant the book to real is a copyright violation.
Plagarism and copyright are two entirely different things. Plagarism is about citations and intellectual integrity. Copyright is a about protecting economic interests, has nothing to to with intellectual integrity, and isn't resolved by citing the original work. In fact most of the contexts where you would be accused of plagarism, would be places like reporting, criticism, education or research goals make fair use arguments much easier.
https://drewdevault.com/2020/08/24/Alice-in-Wonderland.html
https://drewdevault.com/2021/12/23/Sustainable-creativity-po...
Assuming you agree with the idea of inheritance, which is another topic, then it is unfair to deny inheritance of intellectual property. For example if your father has built a house, it will be yours when he dies, it won't become a public house. So why would a book your father wrote just before he died become public domain the moment he dies. It is unfair to those doing who are doing intellectual work, especially older people.
If you want short copyright, is would make more sense to make it 20 years, human or corporate, like patents.
If an artist produces a work they should have the rights to that work. If I self-publish a novel and then penguin decides that novel is really good and they want to publish it, without copyright they'd just do that, totally swamping me with their clout and punishing my ever putting the work out. That's a bad thing.
Contrast that with AI companies:
They don't necessarily want to assert fair use, the results aren't necessarily publicly accessible, the work used isn't cited, users aren't directed to typical sales channels, and many common usages do meaningfully reduce the market for the original content (e.g. AI summaries for paywalled pages).
It's not obvious to me as a non-lawyer that these situations are analogous, even if there's some superficial similarity.
1. Criticizes a highly useful technology 2. Matches a potentially-outdated, strict interpretation of copyright law
My opinion: I think using copyrighted data to train models for sure seems classically illegal. Despite that, Humans can read a book, get inspiration, and write a new book and not be litigated against. When I look at the litany of derivative fantasy novels, it's obvious they're not all fully independent works.
Since AI is and will continue to be so useful and transformative, I think we just need to acknowledge that our laws did not accomodate this use-case, then we should change them.
Humans get litigated against this all the time. There is such thing as, charitably, being too inspired.
https://en.wikipedia.org/wiki/List_of_songs_subject_to_plagi...
the internet demands it.
the people demand free mega upload for everybody, why? because we can (we seem to NOT want to, but that should be a politically solvable problem)
Comparing intellectual property to real or physical property makes no sense. Intellectual property is different because it is non exclusive. If you are living in your father’s house, no one else can be living there. If I am reading your fathers book, that has nothing to do with whether anyone else can read the book.
And that's not even touching the spurious lawsuits about musical similarity. That's what musicians call a genre...
It makes some sense for a very short term literal right to reproduction of a singular work, but any time the concept of derivative works comes into play, it's just a bizarrely dystopian suppression of art, under the supposition that art is commercial activity rather than an innate part of humanity.
The hold US companies have on the world will be dead too.
I also suspect that media piracy will be labelled as the only reason we need copyright, an existing agency will be bolstered to address this concern and then twisted into a censorship bureau.
AI is fine as long as the work it generates is substantially new and transformative. If it breaks and starts spitting out other peoples work verbatim (or nearly verbatim) there is a problem.
Yes, I'm aware that machines aren't people and can't be "inspired", but if the functional results are the same the law should be the same. Vaguely defined ideas like your soul or "inspiration" aren't real. The output is real, measurable, and quantifiable and that's how it should be judged.
If you draw a Venn Diagram of plagiarism and copyright violations, there's a big intersection. For example: if I take your paper, scratch off your name, make some minor tweaks, and submit it; I'm guilty of both plagiarism and copyright violation.
That doesn't make piracy legal, even though I get a lot of use out of it.
Also, a person isn't a computer so the "but I can read a book and get inspired" argument is complete nonsense.
The average copywrite holder would like you to think that the law only allows use of their works in ways that they specifically permit, i.e. that which is not explicitly permitted is forbidden.
But the law is largely the reverse; it only denies use of copyright works in certain ways. That which is not specifically forbidden is permitted.
I mean, owning an idea is kinda gross, I agree. I also personally think that owning land is kinda gross. But we live in a capitalist society right now. If we allow AI companies to train LLMs on copyrighted works without paying for that access, we are choosing to reward these companies instead of the humans who created the data upon which these companies are utterly reliant for said LLMs. Sam Altman, Elon Musk, and all the other tech CEOs will benefit in place of all of the artists I love and admire.
That, to me, sucks.
I understand people who create IP of any sort being upset that software might be able to recreate their IP or stuff adjacent to it without permission. It could be upsetting. But I don't understand how people jump to "Copyright Violation" for the fact of reading. Or even downloading in bulk. The Copyright controls, and has always controlled, creation and distribution of a work. In the nature even of the notice is embedded the concept that the work will be read.
Reading and summarizing have only ever been controlled in western countries via State's secrets type acts, or alternately, non-disclosure agreements between parties. It's just way, way past reality to claim that we have existing laws to cover AI training ingesting information. Not only do we not, such rules would seem insane if you substitute the word human for "AI" in most of these conversations.
"People should not be allowed to read the book I distributed online if I don't want them to."
"People should not be allowed to write Harry Potter fanfic in my writing style."
"People should not be allowed to get formal art training that involves going to museums and painting copies of famous paintings."
We just will not get to a sensible societal place if the dialogue around these issues has such a low bar for understanding the mechanics, the societal tradeoffs we've made so far, and is able to discuss where we might want to go, and what would be best.
That would indeed be nice, but as the article says, that's usually not the case. The rights holder and the author are almost never the same entity in commercial artistic endeavors. I know I'm not the rights holder for my erroneously-considered-art work (software).
> If I self-publish a novel and then penguin decides that novel is really good and they want to publish it, without copyright they'd just do that, totally swamping me with their clout and punishing my ever putting the work out. That's a bad thing.
Why? You created influential art and its influence was spread. Is that not the point of (good) art?
Copyright only comes into play on publication. It's only concerned about publication of the models and publication of works. The machine itself doesn't have agency to publish anything at this point.
abstracting llms from their operators and owners and possible (and probable) ends and the territories they trample upon is nothing short of eye-popping to me. how utterly negligent and disrespectful of fellow people must one be at the heart to give any credence to such arguments
There's definitely problems with corporatization of ownership of these things, I won't disagree.
> Why? You created influential art and its influence was spread. Is that not the point of (good) art?
Why do we expect artists to be selfless? Do you think Stephen King is still writing only because he loves the art? You don't simply make software because you love it, right? Should people not be able to make money off their effort?
Plus, all art is derivative in some sense, it's almost always just a matter of degree.
https://chatgptiseatingtheworld.com/2025/05/12/opinion-why-t...
Consider how many books exist on how to care for trees. Each one of them has similar ideas, but the way those ideas are expressed differ. Copyright protects the content of the book; it doesn’t protect the ideas of how to care for trees.
Pre-publication reports aren't unusual. https://www.federalregister.gov/public-inspection/current
https://www.federalregister.gov/reader-aids/using-federalreg...
> The Federal Register Act requires that the Office of the Federal Register (we) file documents for public inspection at our office in Washington, DC at least one business day before publication in the Federal Register.
What we do know though is that LLMs, similar to humans, do not directly copy information into their "storage". LLMs, like humans, are pretty lossy with their recall.
Compare this to something like a search indexed database, where the recall of information given to it is perfect.
I believe cover song licensing is available mechanically; you don't need permission, you just need to follow the procedures including sending the licensing fees to a rights clearing house. Music has a lot of mechanical licenses and clearing houses, as opposed to other categories of works.
The thing that'd set apart these companies are the services + quality of their work.
If we enter a world where anyone can create a new Mario game and there are thousands of them released on the public web it would be impossible for the rights holders to do anything, and it would be a PR bad move to go after individuals doing it for fun.
In our current society, that means they need some sort of means to make money from their work. Copyright, at least in theory, exists to incentivize the creation of art by protecting an artists ability to monetize it.
If you abolish copyright today, under our current economic framework, what will happen is that people create less art because it goes from a (semi-)viable career to just being completely worthless to pursue. It's simply not a feasible option unless you fundamentally restructure society (which is a different argument entirely.)
Bad PR? The entire copyright enforcement industry has had bad PR pretty much since easy copying enabled grassroots piracy - i.e. since before computers even. It never stopped them. What are you going to do about it? Vote? But all the mainstream parties are onboard with the copyright lobby.
Copyright is about control. If you know a song and you sing it to yourself, somebody overhears it and starts humming it, they have not deprived you of the ability to still know and sing that song. You can make economic arguments, of deprived profit and financial incentives, and that's fine; I'm not arguing against copyright here (I am not a fan of copyright, it's just not my point at the moment), I'm just saying that inheritance does not naturally apply to copyright, because data and ideas are not scarce, finite goods. They are goods that feasibly everybody in the world can inherit rapidly without lessening the amount that any individual person gets.
If real goods could be freely and easily copied the way data can, we might be having some very interesting debates about the logic and morality of inheriting your parents' house and depriving other people of having a copy.
You can find lots of people talking about training, and you can find lots (way more) of people talking about AI training being a violation of copyright, but you can't find anyone talking about both.
Edit: Let me just clarify that I am talking about training, not inference (output).
Also Big Tech: We added 300.000.000 users worth of GTM because we trained in the 10 specific anime movies of Studio Ghibli and are selling their style.
It's less clear whether taking vast amounts of copyrighted material and using it to generate other things rises to the level of copyright violation or not. It's the kind of thing that people would have prevented if it had occurred to them, by writing terms of use that explicitly forbid it. (Which probably means that the Web becomes a much smaller place.)
Your comment seems to suggest that writers and artists have absolutely no conceivable stake in products derived from their work, and that it's purely a misunderstanding on their part. But I'm both a computer scientist and an artist and I don't see how you could reach that conclusion. If my work is not relevant then leave it out.
There are two reasons why it's a problem. The first reason is that any such abstraction is leaky, and those leaks are ripe for abuse. For example, in case of copyright on information, we made it behave like physical property for the consumers, but not for the producers (who still only need to expend resources to create a single work from scratch, and then duplicate it for free while still selling each copy for $$$). This means that selling information is much more lucrative than selling physical things, which is a big reason why our economy is so distorted towards the former now - just look at what the most profitable corporations on the market do.
The second reason is that it artificially entrenches capitalism by enmeshing large parts of the economy into those mechanics, even if they aren't naturally a good fit. This then gets used as an argument to prop up the whole arrangement - "we can't change this, it would break too much!".
IP maximalism is requiring DRM tech in every computer and media-capable device that won't play anything without checking into a central server and also making it illegal to reverse or break that DRM. IP maximalism is extending the current bonkers time interval of copyright (over 100 years) to forever. If AI concerns manage to get this down to a reasonable, modern timeframe it'll be awesome.
Record companies in the 90s tied the noose around their own necks, which is just as well because they're very useless now except for supporting geriatric bands. They should have started selling mp3s for 99 cents in 1997 and maybe they would have made a couple of dollars before their slide into irrelevance.
The specific thing people don't want, which a few weirdos keep pushing, is AI-generated stuff passed off as new creative material. It's fine for fun and games, but no one wants a streaming service of AI-generated music, even if you can't tell it's AI generated. And the minute you think you have that cracked - that an AI can create music/art as good as a human and that humans can't tell, the humans will start making bad music/art in rebellion, and it'll be the cool new thing, and the armies of 10Kw GPUs will be wasting their energy on stuff an 1Mhz 8-bit machine could do in the 80s.
You can probably find a good number of expert programmer + patent lawyers. And presumably some of those osmose enough copyright knowledge from their coworkers to give a knowledgeable answer.
At the end of the day though, the intersection of both doesn't matter. The lawyers win, so what really matters is who has the pulse on how the Fed Circuit will rule on this
Also in this specific case from the article, it's irrelevant?
Is that a problem with the tool, or the person using it? A photocopier can copy an entire book verbatim. Should that be illegal? Or is it the problem that the "training" process can produce a model that has the ability to reproduce copyrighted work? If so, what implication does that hold for human learning? Many people can recite an entire song's lyrics from scratch, and reproducing an entire song's lyrics verbatim is probably enough to be considered copyright infringement. Does that mean the process of a human listening to music counts as copyright infringement?
The article specificaly talks about the creation and distribution of a work. Creation and distribution of a work alone is not a copyright violation. However, if you take in input from something you don't own, and genAI outputs something, it could be considered a copyright violation.
Let's make this clear; genAI is not a copyright issue by itself. However, gen AI becomes an issue when you are using as your source stuff you don't have the copyright or license to. So context here is important. If you see people jumping to copyright violation, it's not out of reading alone.
> "People should not be allowed to read the book I distributed online if I don't want them to."
This is already done. It's been done for decades. See any case where content is locked behind an account. Only select people can view the content. The license to use the site limits who or what can use things.
So it's odd you would use "insane" to describe this.
> "People should not be allowed to write Harry Potter fanfic in my writing style."
Yeah, fan fiction is generally not legal. However, there are some cases where fair use covers it. Most cases of fan fiction are allowed because the author allows it. But no, generally, fan fiction is illegal. This is well known in the fan fiction community. Obviously, if you don't distribute it, that's fine. But we aren't talking about non-distribution cases here.
> "People should not be allowed to get formal art training that involves going to museums and painting copies of famous paintings."
Same with fan fiction. If you replicate a copyrighted piece of art, if you distribute it, that's illegal. If you simply do it for practice, that's fine. But no, if you go around replicating a painting and distribute it, that's illegal.
Of course, technically speaking, none of this is what gen AI models are doing.
> We just will not get to a sensible societal place if the dialogue around these issues has such a low bar for understanding the mechanics
I agree. Personifying gen AI is useless. We should stick to the technical aspects of what it's doing, rather than trying to pretend it's doing human things when it's 100% not doing that in any capacity. I mean, that's fine for the the layman, but anyone with any ounce of technical skill knows that's not true.
Most artists can readily violate copyright, that doesn't me we block them from seeing copyright.
If it's illegal to know the entire contents of a book it is arbitrary to what degree you are able to codify that knowing itself into symbols.
If judges are permitted to rule here it is not about reproduction of commercial goods but about control of humanity's collective understanding.
What about loosely memorizing the gist of a copyrighted text. Is that a breach or fair use? What if a machine does something similar?
This falls under a rather murky area of the law that is not well defined.
(Raises $10 billion based on estimated worth of the resulting models.)
"We can't share the GPT4 prettaining data or weights because they're trade secrets that generate over a billion in revenue for us."
I'll believe they're worth nothing when (a) nobody is buying AI models or (b) AI companies stop using the copyrighted works to train models they sell. So far, it looks like they're lying about the worth of the training data.
"When a model is deployed for purposes such as analysis or research… the outputs are unlikely to substitute for expressive works used in training. But making commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets, especially where this is accomplished through illegal access, goes beyond established fair use boundaries."
While AIs don't reproduce things verbatim like pirates, I can see how they really undermine the market, especially for non-fiction books. If people can get the facts without buying the original book, there's much less incentive for the original author to do the hard research and writing.
Which is a clear failure of the copyright system. Millions of people are expanding our cultural artifacts with their own additions, but all of it is illegal, because they haven't waited another 100 years.
People are interested in these pieces of culture, but they're not going to remain interested in them forever. At least not interested enough to make their own contributions.
Professing of IP without a license AND offering it as a model for money doesn't seem like an unknown use-case to me
It's that the space of intellectual property LAW does not handle the robust capabilities of LLMs. Legislators NEED to pass laws to reflect the new realities or else all prior case law relies on human analogies which fail in the obvious ways you alluded to.
If there was no law governing the use of death stars and mass murder, and the only legal analogy is to environmental damage, then the only crime the legal system can ascribe is mass environmental damage.
Maybe the government should set up a fund to pay all the copyright holders whose works were used to train the AI models. And if it's a pain to track down the rights holders, I'll play a tiny violin.
"Humans are allowed to breathe, so our machine is too, because it is operated by humans!"
That's also why I'm really not worried about the "AI singularity" folks. The hype is IMO blatantly unsubstantiated by the actual capabilities, but gets pushed anyway only because it speaks to this deep-seated faith held across the industry. "AI" is the culmination of an innate belief that people should be replaceable, fungible, perfectly obedient objects, and such a psychosis blinds decision-makers to its actual limits. Only trouble is whether they have the political power to try to force it anyway.
If I were to take an image, and compress it or encrypt it, and then show you data file, you would not be able to see the original copyrighted material anywhere in the data.
But if you had the right computer program, you could use it to regenerate the original image flawlessly.
I think most people would easily agree that distributing the encrypted file without permission is still a distribution of a copyrighted work and against the law.
What if you used _lossy_ encryption, and can merely reproduce a poor quality jpeg of the original image? I think still copyright infringement, right?
Would it matter if you distributed it with an executable that only rendered the image non-deterministically? Maybe one out of 10 times? Or if the command to reproduce it was undocumented?
Okay, so now we have AI. We can ignore the algorithm entirely and how it works, because it's not relevant. There is a large amount of data that it operates on, the weights of the model and so on. You _can_ with the correct prompts, sometimes generate a copy of a copyrighted work, to some degree of fidelity or another.
I do not think it is meaningfully different from the simpler example, just with a lot of extra steps.
I think, legally, it's pretty clear that it is illegally distributing copyrighted material without permission. I think calling it an "ai" just needlessly anthropomorphizes everything. It's a computer program that distributes copyrighted work without permission. It doesn't matter if it's the primary purpose or not.
I think probably there needs to be some kind of new law to fix this situation, but under the current law as it exists, it seems to me to be clearly illegal.
I think you're overstating the legal uniqueness of LLMs. They're covered just fine by the existing legal precedents around copyrighted and derived works, just as building a death star would be covered by existing rules around outer space use and WMDs. Pretending they should be treated differently is IMO the entire lie told by the "AI" companies about copyright.
Point being, laws aren't some God-ordained rules, beautiful in their fractal recursive abstraction, perfectly covering everything that will ever happen in the universe. No, laws are more or less crude hacks that deal with here and now. Intellectual property rights were questionable from the start and only got worse; they've been barely keeping up with digital media in the past couple decades, and they're entirely ill-equipped to deal with generative AI. This is a new situation, and laws need to be updated to cover it.
The direction we're going, it seems more likely it'll be recycling to murder a human.
I find the shift of some right wing politicians and companies from "TPB and megaupload are criminals and its owners must be extradited from foreign countries!" to "Information wants to be free!" much more illuminating.
That is not to say that we shouldn't do the right thing regardless, but I do think there is a feeling of "who is going to rule the world in the future?" tha underlies governmental decision-making on how much to regulate AI.
Corporations are not humans. (It's ridiculous that they have some legal protections in the US like humans, but that's a different issue). AI is also not human. AI is also not a chipmunk.
Why the comparison?
Absolute horse shit. I can start a 1-900 answer line and use any reference I want to answer your question.
However, when an LLM does the same, people now what it to be illegal. It seems pretty straightforward to apply existing copyright law to LLMs in the same way we apply them to humans. If the actual text they generate is substantially similar to a source material that it would constitute a copyright violation if a human were to have done it, then it should be illegal. Otherwise it should not.
edit: and in fact it's not even whether an LLM reproduces text, it's wether someone subsequently publishes that text. The person publishing that text should be the one taking on the legal hit.
Why is that? Seems all logic gets thrown out the window when invoking AI around here. References are given. If the user publishes the output without attribution, NOW you have a problem. People are being so rabid and unreasonable here. Totally bat shit.
Huh? If you agree that "learning from copyrighted works to make new ones" has traditionally not been considered infringement, then can you elaborate on why you think it fundamentally changes when you do it with bots? That would, if anything, seem to be a reversal of classic copyright jurisprudence. Up until 2022, pretty much everyone agreed that "learning from copyrighted works to make new ones" is exactly how it's supposed to work, and would be horrified at the idea of having to separately license that.
Sure, some fundamental dynamic might change when you do it with bots, but you need to make that case in an enforceable, operationalized way.
I've come to think of this as the "Performatively failing to recognize the difference between an organism and a machine" rhetorical device that people employ here and elsewhere.
The person making the argument is capable of distinguishing the two things, they just performatively choose not to do so.
The google news snippets case is, in my non-lawyer opinion, the most obvious touch point. And in that case, it was decided that providing large numbers of snippets in search results was non-infringing, despite being a case of copying text from other people at-scale... And the reasons this was decided are worth reading and internalizing.
There is not an obvious right answer here. Copyright rules are, in fact, Calvinball, and we're deep in uncharted territory.
I'm not sure at all what China will do. I find it likely that they'll forbid AI at least for minors so that they do not become less intelligent.
Military applications are another matter that are not really related to these copyright issues.
Interesting, but everyone is mining copyrighted works to train AI models.
Suppose we accept all of the above. What does that hold for human learning?
Even (especially?) the military is a dumpster fire but it's at least very good at doing what it exists to do.
Their weights are derived from copyrighted works. Evaluating them preserves the semantic meaning and character of the source material. And the output directly competes against the copyrighted source materials.
The fact they're smudgy and non-deterministic doesn't change how they relate to the rights of authors and artists.
>> The fatal flaw in your reasoning: machines aren't humans. You can't reason that a machine has rights from the fact a human has them. Otherwise it's murder to recycle a car.
> We are talking about the rights of the humans training the models and the humans using the models to create new things.
Then that's even easier, because that prevents appeals to things humans do, like learning, from muddying the waters.
If "training the models" entails loading up copyrighted works into your system (e.g. encoded them during training), you've just copied them into a retrieval system and violated copyright based on established precedent. And people have prompted verbatim copyrighted text out of well-known LLMs, which makes it even clearer.
And then to defend LLM training you're left with BS akin to claiming an ASCII encoded copy of a book not a copyright violation, because the book is paper and ASCII is numbers.
I mean, name 2 things anyone owns that aren't dumpster fires?
Long time ago industrial engineers used to say, "Even Toyota has recalls."
Something being a dumpster fire is so common nowadays that you really need a better reason to argue in support of a given entity's ownership. (Or even non-ownership for that matter.)
That said, there are plenty of successful government actions across the world, where Europe or Japan probably have a good advantage with solid public services. Think streets, healthcare, energy infrastructure, water infrastructure, rail, ...
I understand what you're saying but the way you're framing it isn't what I really have a problem with. I still don't agree with the idea that I can't make my own physical copies of Harry Potters books, identical word for word. I think people can choose to buy the physical books from the original publisher because they want to support them or like the idea that it's the "true" physical copy. And I'm going to push back on that a million times less than the concept of things like Moana comic books. But still, it's infringing copyright for me to make Moana comic books in my own home, in private, and never showing them to anyone. And that's ridiculous.
The law is supposed to be impartial. So if the answer is different, then it's not really a law problem we're talking about.
[0] https://en.wikipedia.org/wiki/Dowling_v._United_States_(1985...
1. The National Weather Service. Crown jewel and very effective at predicting the weather and forecasting life threatening events.
2. IRS, generally very good at collecting revenue. 3. National Interagency Fire Service / US Forest service tactical fire suppression
4. NTSB/US Chemicals Safety Board - Both highly regarded.
5. Medicare - Basically clung to with talons by seniors, revealed preference is that they love it.
6. DOE National Labs
7. NIH (spicy pick)
8. Highway System
There are valid critiques of all of these but I don’t think any of them could be universally categorized as a complete dumpster fire.
My proposal is that it's a luddish kneejerk reaction to things people don't understand and don't like. They sense and fear change. For instance here you say it's an issue when AI uses something as a source that you don't have Copyright to. Allow me to update your sentence: "Every paper every scientist or academic wrote that references any copyrighted work becomes an issue". What you said just isn't true. The copyright refers to the right to copy a work.
Distribution: Sure. License your content however you want. That said, in the US a license prohibiting you from READING something just wouldn't be possible. You can limit distribution, copying, etc. This is how journalists can write about sneak previews or leaked information or misfiled court documents released when they should be under seal. The leaking <-- the distribution might violate a contract or a license, but the reading thereof is really not a thing that US law or Common law think they have a right to control, except in the case of the state classifying secrets. As well, here we have people saying "my song in 1983 that I put out on the radio, I don't want AI listening to that song." Did your license in 1983 prohibit computers from processing your song? Does that mean digital radio can't send it out? Essentially that ship has sailed, full stop, without new legislation.
On my last points, I think you're missing my point, Fan fiction is legal if you're not trying to profit from it. It is almost impossible to perfectly copy a painting, although some people are pretty good at it. I think it's perfectly legal to paint a super close copy of say Starry Night, and sell it as "Starry night by Jason Lotito." In any event, the discourse right now claims its wrong for AI to look at and learn from paintings and photographs.
Except in this case, we already have the equivalent of "laws about oxygen consumption": copyright.
> Intellectual property rights were questionable from the start and only got worse; they've been barely keeping up with digital media in the past couple decades, and they're entirely ill-equipped to deal with generative AI.
The laws are not "entirely ill-equipped to deal with generative AI," unless your interests lie in breaking them. All the hand-waving about the laws being "questionable" and "entirely ill-equipped" is just noise.
Under current law OpenAI, Google, etc. have no right to cheap training data, because someone made that data and may have the reasonable interest in getting paid for their efforts. Like all businesses, those companies would ideally like the law to be unfairly biased towards them: to protect them when they charge as much as they can, but not protect anyone else so they can pay as little as possible.
I can't speak for Stephen but I absolutely do. I program for fun all the time.
> Should people not be able to make money off their effort?
Is anyone arguing otherwise?
Of course, if you start your thought by dismissing anybody who doesn't share your position as not sane, it's easy to see how you could fail to capture any of that.
^[1] https://arstechnica.com/tech-policy/2025/05/judge-on-metas-a...
Even saying the military is a dumpster fire isn't accurate. The military has led trillions of dollars worth of extraction for the wealthy and elite across the globe.
In no sane world can you say that the ability to protect GLOBAL shipping lanes as a failure. That one service alone has probably paid for itself thousands of times.
We aren't even talking about things like public education (high school education use to be privatized and something only the elites enjoyed 100 years ago; yes public high school education isn't even 100 years old) or libraries or public parks.
---
I really don't understand this "gobermint iz bad" meme you see in tech circles.
I get more out of my taxes compared to equivalent corporate bills that it's laughable.
Government is comprised of people and the last 50 years has been the government mostly giving money and establishing programs to the small cohorts that have been hoarding all the wealth. Somehow this is never an issue with the government however.
Also never understand the arguments from these types either because if you think the government is bad then you should want it to be better. Better mostly meaning having more money to redistribute and more personal to run programs, but it's never about these things. It's always attacking the government to make it worse at the expense of the people.
There is fair use, but fair is an affirmative defense to infringing copyright. By claiming fair use you are simultaneously admitting infringement. The idea that you have to defend your own private expression of ideas based on other ideas is still wrong in my view.
For national security reasons I'm perfectly fine with giving LLMs unfettered access to various academic publications, scientific and technical information, that sort of thing. I'm a little more on the fence about proprietary code, but I have a hard time believing there isn't enough code out there already for LLMs to ingest.
Otherwise though, what is an LLM with unfettered access to copyrighted material better at vs one that merely has unfettered access to scientific / technical information + licensed copyrighted material? I would suppose that besides maybe being a more creative writer, the other LLM is far more capable of reproducing copyrighted works.
In effect, the other LLM is a more capable plagiarism machine compared to the other, and not necessarily more intelligent, and otherwise doesn't really add any more value. What do we have to gain from condoning it?
I think the argument I'm making is a little easier to see in the case of image and video models. The model that has unfettered access to copyrighted material is more capable, sure, but more capable of what? Capable of making images? Capable of reproducing Mario and Luigi in an infinite number of funny scenarios? What do we have to gain from that? What reason do we have for not banning such models outright? Not like we're really missing out on any critical security or economic advantages here.
Instead of the understanding that copyrights and patents are temporary state-granted monopolies meant to benefit society they are instead framed as real perpetual property rights. This framing fuels support for draconian laws and obscures the real purpose of these laws: to promote innovation and knowledge sharing and not to create eternal corporate fiefdoms.
I agree, what followed was.
> I can start a 1-900 answer line and use any reference I want to answer your question
Yeah, that's not what we are talking about. If you think it was, you should probably do some more research on the topic.
This can only be referring to training, the models themselves are a rounding error in size compared to their training sets.
> If we allow AI companies to train LLMs on copyrighted works without paying for that access, we are choosing to reward these companies instead of the humans who created the data upon which these companies are utterly reliant for said LLMs.
It's interesting how much parallel there is here to the idea that company owners reap the rewards of their employee's labor when doing no additional work themselves. The fruits of labors should go to the individuals who labor, I 100% agree.
Those who were immune were put under the scalpel."
There are college kids with bigger "copyright collections" than that...
If you don't have a tape recorder showing Trump saying "Fire Shira, I don't like what she did and she needs to get out" then you are making assumptions both for his reasons and his involvement. No one has that tape. Which means any claims that this is what happening is entirely speculation. We've seen a decade of people claiming these assumptions as fact, and it's really tiresome.
https://www.theguardian.com/technology/2012/sep/11/minnesota... [1]
Your proposal is moving goal posts.
> Allow me to update your sentence: "Every paper every scientist or academic wrote that references any copyrighted work becomes an issue".
No, I never said that. Fair Use exists.
> Fan fiction is legal if you're not trying to profit from it.
No, it's not.[1] You can make arguments that it should be, but, no.
[1] https://jipel.law.nyu.edu/is-fanfiction-legal/
> I think you're missing my point
I think you got called out, and you are now trying to reframe your original comment so it comes across as having accounted for the things you were called out on.
You think you know what you are talking about, but you don't. But, you rely on the fact that you think you do to lose the money you do.
If you consider it right to get value from the work of your family, and you consider that intellectual work (such as writing a book) to be valuable, then as an inheritor, you should get value from it. And since the way we give value to intellectual work is though copyright, then inheritors should inherit copyright.
If you think that copyright should not exceed lifetime, then the logical consequences would be one of:
- inheritance should be abolished
- intellectual work is less valuable than other forms of work
- intellectual property / copyright is not how intellectual work should be rewarded
There are arguments for abolishing inheritance, it is after all one of the greatest sources of inequality. Essentially, it means 100% inheritance tax in addition to all the work going into the public domain. Problematic in practice.
For the value of intellectual work, well, hard to argue against it on Hacker News without being a massive hypocrite.
And there are alternatives to copyright (i.e. artificial scarcity) for compensating intellectual work like there are alternatives to capitalism. Unfortunately, it often turns out poorly in practice. One suggestion is to have some kind of tax that is fairly distributed between authors in exchange for having their work in the public domain. Problem is: define "fairly".
Note that I am not saying that copyright should last long, you can make copyright 20 years, humans or corporate, inheritable. Simple, gets in the public domain sooner, fairer to older authors, already works for patents. Why insist on "lifetime"?
AI is capable of reproducing copyright (motte) therefore training on copyright is illegal (bailey).
Or C) large corporations (and the wealthy) do whatever they want while you still get extortion letters because your kid torrented a movie.
They really do get to have their cake and eat it too, and I don't see any end to it.
>The RIAA accused her of downloading and distributing more than 1,700 music files on file-sharing site KaZaA
Emphasis mine. I think most people would agree that whatever AI companies are doing with training AI models is different than sending verbatim copies to random people on the internet.
I think most artist who had their works "trained by AI" without compensation would disagree with you.
That might be true but I don't see how it's relevant. There's no provision in copyright law that gives a free pass to humans vs machines, or makes a distinction between them.
This is exactly wrong. You can copy all of Harry Potter into your journal as many times as you want legally (creating copies) so long as you do not distribute it.
I'm not talking about learning. I'm talking about the complete reproduction of a copyrighted work. It doesn't matter how it happens.
"have their cake and eat it too" allegations only work if you're talking about the same entity. The copyright maximalist corporations (ie. publishers) aren't the same as the permissive ones (ie. AI companies). Making such characterizations make as much sense as saying "citizens don't get to eat their cake and eat it too", when referring to the fact that citizens are anti-AI, but freely pirate movies.
Nothing. You don't even need the LLC. I don't think anyone got prosecuted for only downloading. All prosecutions were for distribution. Note that if you're torrenting, even if you stop the moment it's finished (and thus never goes to "seeding"), you're still uploading, and would count as distribution for the purposes of copyright law.
If I'm learning about kinematics maybe it would be more effective to have comparisons to Superman flying faster than a speeding bullet and no amount of dry textbooks and academic papers will make up for the lack of such a comparison.
This is especially relevant when we're talking about science-fiction which has served as the inspiration for many of the leading edge technologies that we use including stuff like LLMs and AI.
Disk size is irrelevant. If you lossy-compress a copyrighted bitmap image to small JPEG image and then sell the JPEG image, it's still copyright infringement.
Library of Congress
National Park Service
U.S. Geological Survey (USGS)
NASA
Smithsonian Institution
Centers for Disease Control and Prevention (CDC)
Social Security Administration (SSA)
Federal Aviation Administration (FAA) air traffic control
U.S. Postal Service (USPS)
[1] used purely as an example
Musicians remain subject to abuse by the recording industry; they're making pennies on each dollar you spend on buying CDs^W^W streaming services. I used to say, don't buy that; go to a concert, buy beer, buy merch, support directly. Nowadays live shows are being swallowed whole through exclusivity deals (both for artists and venues). I used to say, support your favourite artist on Bandcamp, Patreon, etc. But most of these new middlemen are ready for their turn to squeeze.
And now on top of all that, these artists' work is being swallowed whole by yet another machine, disregarding what was left of their rights.
What else do you do? Go busking?
Those extra steps are meaningfully different. In your description, a casual observer could compare the two JPEGs and recognize the inferior copy. However, AI has become so advanced that such detection is becoming impossible. It is clearly voodoo.
Can you link to the exact comments he made? My impression was that he was upset at the fact that they broke T&C of openai, and deepseek's claim of being much cheaper to train than openai didn't factor in the fact that it requried openai's model to bootstrap the training process. Neither of them directly contradict the claim that training is copyright infringement.
???
Did you not literally comment the following?
>A new research paper is obviously materially different from "rearranging that text to create a marginally new text".
What did you mean by that, if that's not your claim?
Another way would be to train an internal model directly on published works, use that model to generate a corpus of sanitary rewritten/reformatted data about the works still under copyright, then use the sanitized corpus to train a final model. For example, the sanitized corpus might describe the Harry Potter books in minute detail but not contain a single sentence taken from the originals. Models trained that way wouldn't be able to reproduce excerpts from Harry Potter books even if the models were distributed as open weights.
Some try to make the argument of "but that's what humans do and it's allowed", but that's not a real argument as it has not been proven, nor it is easy to prove, that machine learning equates human reasoning. In the absence of evidence, the law assumes NO.
Also in general, grey areas don't mean those things are legal.
Edit: this remains true even if you don't like it, ¯\_(ツ)_/¯.
Nope.
You have a right to not publish any work that you own. This is protected by Copyright law.
Copyright covers you from the moment you create some sort of original work (in a tangible medium).
Society doesn't need to measure my mind, they need to measure the output of it. If I behave like a conscious being, I am a conscious being. Alternatively you might phrase it such that "Anything that claims to be conscious must be assumed to be conscious."
It's the only answer to the p-zombie problem that makes sense. None of this is new, philosophers have been debating it for ages. See: https://en.wikipedia.org/wiki/Philosophical_zombie
However, for copyright purposes we can make it even simpler. If the work is new, it's not covered by the original copyright. If it is substantially the same, it isn't. Forget the arguments about the ghost in the machine and the philosophical mumbo-jumbo. It's the output that matters.
In that case I don't think there's anything controversial here? Nobody thinks that if you ask AI to reproduce something verbatim, that you should get a pass because it's AI. All the controversy in this thread seems to be around the training process and whether that breaks copyright laws.
AI companies claim it falls under fair use. Pirates use the same excuse too. Just look at all the clips uploaded to youtube with a "it's fair use guys!" note in the description. The only difference between the two is that the former is novel enough that there's plausible arguments for both sides, and the latter has been so thoroughly litigated that you'd be laughed out of the courtroom for claiming that your torrenting falls under fai ruse.
That's the thing though: intuitively, they do - training the model != generating from the model, and it's the output of a generation that violates copyright (and the user-supplied prompt is a crucial ingredient in getting the potentially copyrighted material to appear). And legally, that's AFAIK still an open question.
> Like all businesses, those companies would ideally like the law to be unfairly biased towards them: to protect them when they charge as much as they can, but not protect anyone else so they can pay as little as possible.
That's 100% true. I know that, I'm not denying that. But in this particular case, I find my own views align with their case. I'm not begrudging them for raking in heaps of money offering generative AI services, because they're legitimately offering value that's at least commensurate (IMHO it's much greater) to what they charge, and that value comes entirely from the work they're uniquely able to do, and any individual work that went into training data contributes approximately zero to it.
(GenAI doesn't rely on any individual work in training data; it relies on the breadth and amount being a notable fraction of humanity's total intellectual output. It so happens that almost all knowledge and culture is subject to copyright, so you couldn't really get to this without stepping on some legal landmines.)
(Also, much like AI companies would like the law to favor them, their opponents in this case would like the law to dictate they should be compensated for their works being used in training data, but compensated way beyond any value their works bring in, which in reality is, again, approximately zero.)
Copyright laws were themselves created by the printing press making it easy to duplicate works, whereas previously if you half-remembered something that was just "inspiration".
But that only gave the impression of helping creative people: today, any new creative person has to compete with the entire reproducible cannon of all of humanity before them — can you write fantasy so well that new readers pick you up over Pratchett or Tolkien?
Now we have AI which are "inspired" (perhaps) by what they read, and half-remember it, in a way that seems similar to pre-printing-press humans sharing stories even if the mechanism is different.
How this is seen according to current law likely varies by jurisdiction; but the law as it is today matters less than what the law will be when the new ones are drafted to account for GenAI.
What that will look like, I am unsure. Could be that for training purposes, copyright becomes eternal… but it's also possible that copyright may cease to exist entirely — laws to protect the entire creative industry may seem good, but if AI displaces all humans from economic activity, will it continue to matter?
I didn't meant to imply that the AI can't quote Shakespeare in Context, just that it shouldn't try to pass off Shakespeare as it's own or plagiarize huge swathes of the source text.
> People are being so rabid and unreasonable here.
People here are more reasonable than average. Wait until mainstream society starts to really feel the impact of all this.
Those procedures are how you ask for permission. As you say, it usually involves a fee but doesn't have to.
The same will happen with AI, no one will go to jail but perhaps it is ruled out that LLMs infringe copyright.
(Same thing happened in the early days of YouTube as well, the solution was stuff like MusicDNA, etc...)
Your radical behaviourism seems an advantage to you when you want to delete one disfavoured part of copyright law, but I assure you, it isn't in your interest. It doesnt universalise well at all. You do not want to be defined by how you happen to verbalise anything, unmoored from your intention, goals, and so on.
The law, and society, imparts much to you that is never measured and much that is unmeasurable. What can be measured is, at least, extremely ambiguous with respect to those mental states which are being attributed. Because we do not attribute mental states by what people say -- this plays very little role (consider what a mess this would make of watching movies). And none of course in the large number of animals which share relevant mental states.
Nothing of relevance is measured by an LLM's output. It is highly unambigious: the LLM has no mental states, and thus is irrelevant to the law, morality, society and everything else.
It's a obcene sort of self-injury to assume that whatever kind of radical behaviourism is necessary to hype the LLM is the right sort. Hype for LLMs does not lead to a credible theory of minds.
The current illegality of the piracy website prevents them from offering a service as nice as Steam. It has to be a sketchy torrent hub that changes URLs every few months. If it was as easy as changing the url to freesteampowered.com or installing an extension inside the steam launcher, the whole "piracy is a service issue" argument loses all relevance. The industry would become unsustainable without DRM (which would be technically legal to crack, but also more incentivized to make harder to crack).
That is even worse without copyright, as then every previous work would be free and you would have to compete with better works that are also free for people.
Go back to the roots of copyright and the answers should be obvious. According to the US constitution, copyright exists "To promote the Progress of Science and useful Arts" and according to the EU, "Copyright ensures that authors, composers, artists, film makers and other creators receive recognition, payment and protection for their works. It rewards creativity and stimulates investment in the creative sector."
If I publish a book and tech companies are allowed to copy it, use it for "training", and later regurgitate the knowledge contained within to their customers then those people have no reason to buy my book. It is a market substitute even though it might not be considered such under our current copyright law. If that is allowed to happen then investment will stop and these books simply won't get written anymore.
The US Supreme Court disagrees, the right of publicity and intellectual property law are explicitly linked.
> The broadcast of a performer’s entire act may undercut the economic value of that performance in a manner analogous to the infringement of a copyright or patent. — Justice White
Oh really ? They didn't had any problem when people installed copyrighted Windows to come after them. BSA. But now Microsoft turns a blind eye because it suits them.
Humans are also very useful and transformative.
People would just delete the malware (DRM) out of the source code that is no longer restricted by copyright.
If your argument is that copyright is good because it discourages DRM, I think you have a very evidently weak argument.
There is also the fact that copyright holders will pressure your ISP into sending threatening letters and shutting off your Internet for piracy, even without you seeding. I haven't gotten the impression that you are in the clear for pirating as long as you don't distribute.
I agree. If you can pay the judge, the congress or the president, it is definitely not stealing. It is (the best) democracy (money can buy). /s
Maybe selling books? Maybe other jobs? The same way that they made money for thousands of years before copyright, really. Books and other arts did exist before copyright!
> and why would someone pay them, if their work is free to be copied at will?
I don't think it's really a matter of if people will pay them. If their art is good, of course people will pay them. People feel good about paying for an original piece of art.
The question is really more about if people will be able to get obscenely rich over being the original creator of some piece of art, to which the answer is it would indeed be less likely.
Again, show me an example where an artist's style was used for copyright infringement in court. Can you produce even one example?
We didn't have modern novelists a thousand years ago. We didn't have mass production until ~500 years ago, and copyright came in in the 1700's. We didn't have mass produced pulp fiction like we do today until the 20th century. There is little copyright-less historical precedent to refer to here, even if we carve out the few hundred years between the printing press and copyright, it's not as though everyone was mass consuming novels, the literacy rate was abysmal. I wonder what artist yearns for the 1650s.
> If their art is good, of course people will pay them.
You say this as if it were a fact, but that's not axiomatic. Once the first copy is in the wild it's fair game for anyone to copy it as they will. Who is paying them? Should the artists return to the days of needing a wealthy patron? Is patreon the answer to all of our problems?
> Maybe selling books?
But how? To who? A publishing house isn't going to pick them up, knowing that any other publishing house can start selling the same book the minute it shows to be popular, and if you're self publishing and you're starting to make good numbers then the publishing houses can eat you alive.
> The question is really more about if people will be able to get obscenely rich over being the original creator of some piece of art, to which the answer is it would indeed be less likely.
No, the question is if ordinary people could make a living off their novels without copyright. It's very hard today, but not impossible. Without copyright it wouldn't be.
"copyright law assigns a set of exclusive rights to authors: to make and sell copies of their works, to create derivative works, and to perform or display their works publicly"
"The owner of a copyright has the exclusive right to do and authorize others to do the following: To reproduce the work in copies or phonorecords;To prepare derivative works based upon the work;"
"Commonly, this involves someone creating or distributing"
https://www.copyright.gov/what-is-copyright/
"U.S. copyright law provides copyright owners with the following exclusive rights: Reproduce the work in copies or phonorecords. Prepare derivative works based upon the work."
https://internationaloffice.berkeley.edu/students/intellectu...
"Copyright infringement occurs when a work is reproduced, distributed, displayed, performed or altered without the creator’s permission."
There are endless legitimate sources for this. Copyright protects many things, not just distribution. It very clearly disallows the creation and production of copyrighted works.
Source: https://futurism.com/the-byte/facebook-trained-ai-pirated-bo...
The general public has been lectured for decades about how piracy is morally wrong, but as soon as startups and corporations are in it for profit, everybody looks away?
I don't mean to say that they literally have to speak the words by using their meat to make the air vibrate. Just that, presuming it has some physical means, it be capable (and willing) to express it in some way.
> It's a obcene sort of self-injury to assume that whatever kind of radical behaviourism is necessary to hype the LLM is the right sort.
I appreciate why you might feel that way. However, I feel it's far worse to pretend we have some undetectable magic within us that allows us to perceive the "realness" of others peoples consciousness by other than physical means.
Fundamentally, you seem to be arguing that something with outputs identical to a human is not human (or even human like), and should not be viewed within the same framework. Do you see how dangerous an idea that is? It is only a short hop from "Humans are different than robots, because of subjective magic" to "Humans are different than <insert race you don't like>, because of subjective magic."
I feel like you're shoving all information under the same label. The most profitable corporations are trading in information that isn't subject to copyright, and it's facts - how you drive, what you eat, where you live. It's newly generated ideas. Maybe it is in how the data is sorted, but they aren't copyrighting that either.
If we're going to overthrow artificial entrenchments of capitalism, I feel like there's better places to start than a lot of copyright. Does it need changes? Absolutely, there's certainly exploitation, but I still don't see "get rid of copyright entirely" as being a good approach. Weirdly, it's one of the places that people are arguing for that. Sometimes the criminal justice system convicts the wrong person, and there should be reform. It's also often criticized as a measure of control for capitalistic oligarchs. Should step one be getting rid of the legal system entirely?
The fact that copyright law is easy to violate and hard to enforce doesn't stop Nintendo from burning millions of dollars on legal fees to engage in life-ruining enforcement actions against randos making fangames.
"Democratization" with respect to copyright law would be changing the law to put Mario in the public domain, either by:
- Reducing term lengths to make Mario literally public domain. It's unclear whether or not such an act would survive the Takings Clause of the US Constitution. Perhaps you could get around that by just saying you can't enforce copyrights older than 20 years even though they nominally exist. Which brings us to...
- Adding legal exceptions to copyright to protect fans making fan games. Unlikely, since in the US we have common law, which means our exceptions have to be legislated from the judicial bench, and judges are extremely leery of 'fair use' arguments that basically say 'it is very inconvenient for me to get permission to use the thing'.
- Creating some kind of social copyright system that "just handles" royalty payments. This is probably the most literal interpretation of 'democratize'. I know of few extant systems for this, though - like, technically ASCAP is this, but NOBODY would ever hold up ASCAP as an example of how to do licensing right. Furthermore without legal backing, Nintendo can just hold out and retain traditional "my way or the highway" licensing rights.
- Outright abolishing copyright and telling artists to fend for themselves. This is the kind of solution that would herald either a total system collapse or extreme authoritarianism. It's like the local furniture guy selling sofas at 99% off because the Mafia is liquidating his gambling debts. Sure, I like free shit, but I also know that furniture guy is getting a pair of cement shoes tonight.
None of these are what AI companies talk about. Adding an exception just for AI training isn't democratizing IP, because you can't democratize AI training. AI is hideously memory-hungry and the accelerators you need to make it work are also expensive. I'm not even factoring in the power budget. They want to replace IP with something worse. The world they want is one where there are three to five foundation models, all owned and controlled by huge tech megacorps, and anyone who doesn't agree with them gets cut off.
All right of publicity laws are intellectual property laws but not all intellectual property laws are right of publicity laws.
All copyright laws are intellectual property laws but not all intellectual property laws are copyright laws.
Right of publicity laws are intellectual property laws because the right of publicity is intellectual property. I don't know how else to articulate this over the internet, maybe its time to consult an AI?
A reasonable compromise then is that you can train an AI on Wikipedia, more-or-less. An AI trained this way will have a robust understanding of Superman, enough that it can communicate through metaphor, but it won't have the training data necessary to create a ton of infringing content about Superman (well, it won't be able to create good infringing content anyway. It'll probably have access to a lot of plot summaries but nothing that would help it make a particularly interesting Superman comic or video).
To me it seems like encyclopedias use copyrighted pop culture in a way that constitutes fair use, and so training on them seems fine as long as they consent to it.
As for the zeitgeist, I'm not sure anything has materially changed. Recently, creators have been very upset over Silicon Valley AI companies ingesting their output. Is this really reflective of "general internet sentiment"? Would those same people have supported abolition of copyright in the past? I doubt it.
It's still copyright infringement if I download a pirated movie and never watch it (writing the bytes to the disk == "training" the disk's "model", reading the bytes back == "generating" from the disk's "model").
> That's 100% true. I know that, I'm not denying that. But in this particular case, I find my own views align with their case.
IMHO, unless you're massively wealthy and/or running a bigcorp, people like you benefit a lot more from copyright than are harmed by it. In a world without copyright protection, some bigcorp will be able to use its size to extract the value from the works that are out there (i.e. Amazon and Netflix will stop paying royalties instantly, but they'll still have customers because they have the scale to distribute). Copyright just means the little guy who's actually creating has some claim to get some of the value directed back to them.
> and any individual work that went into training data contributes approximately zero to it.
Then cut all those works out of the training set. I don't think it's an excuse that the infringement has to happen on a massive scale to be of value to the generative AI company.
It's not related to copyright. It is an example of your hypothetical standard required to attribute something to Trump. My point is that even when he is on camera saying something, that does not prevent the post facto rationalizations. Even if he was on tape firing this person, people would rationalize this away too.
Moana and Moana 2 are both animated movies that have already been made. They're not just figures of one's imagination.
> If I made a Moana comic book, with an entirely original storyline and original art and it was all drawn in my own style and not using 3D assets similar to their movies, that is violating copyright
It might be, or it might not. Copyright protects the creation of derivative works (17 USC 101, 17 USC 103, 17 USC 106), but it's the copyright holder's burden to persuade the court that the allegedly infringing work with the character Moana in it is derivative of their protected work.
Ask yourself the question: what is the value of Moana to you in this hypothetical? What if you used a different name for the character and the character had a different backstory and personality?
> I still don't agree with the idea that I can't make my own physical copies of Harry Potters books
You might think differently if you had sunk thousands of hours into creating a new novel and creative work was your primary form of income.
> But still, it's infringing copyright for me to make Moana comic books in my own home, in private, and never showing them to anyone.
It seems unlikely that Disney is would go after you for that. Kids do it all the time.
Eh. I don't know the history, but my understanding was they were created because the printing press allowed others to deny the original creators the profits to their work, and direct those profits to others who had no hand in it.
After all, in market terms: a publisher that pays its authors can't compete with another that publisher that publishes the same works but without paying any authors. A word without copyright is one where some publisher still makes money, but it's a race to the bottom for authors.
> But that only gave the impression of helping creative people: today, any new creative person has to compete with the entire reproducible cannon of all of humanity before them — can you write fantasy so well that new readers pick you up over Pratchett or Tolkien?
Here's a hole in your thinking: if you like fantasy, would you be content to just re-read Tolkien over and over, forever? Don't you think that'd get boring no matter how good he was?
And empirically, "new creative [people]" manage to complete with Pratchett or Tolkien all the time, as new fantasy works are still being published and read. Do you remember that "Game of Thrones" was a mass cultural phenomenon not too long ago?
I'm worried because decision-makers genuinely don't seem to be bothered very much by actual capabilities, and are perfectly happy to trade massive reductions in quality for cost savings. In other worse, I don't think the limits of LLMS will actually constrain the decision-makers.
Compulsory licenses are interesting aren't they? It just feels wrong. If Metallica doesn't want me to butcher their songs, why should the be forced to allow it?
I honestly can't see how this directly addresses fair use, it's a odd sweeping statement. It implies inventing something that borrows little from many different copyrighted items is somehow not fair use? If it was one for one yes, but it's not it's basically saying creativity is not fair use. If it's not saying this and refers to competition in the existing market they're making a statement about the public good, not fair use. Basically a matter for legislators and what the purpose of copyright is.
(Leaving aside whether the weights of an LLM does actually encode the content of any random snippet of training text. Some stuff does get memorized, but how much and how exactly? That's not the point of the LLM, unlike the jpeg or database.)
And, again, look at the search snippets case - these were words produced by other people, directly transcribed, so open-and-shut from a certain point of view. But the decision went the other way.
I don't see how that affects the argument. The machines are being used by humans. Your argument then boils down to the idea that you can do something manually but it becomes illegal if you use a tool to do it efficiently.
That sounds like you're arguing that they should be legal. Copyright law protects specific expressions, not handwavy "smudgy and non-deterministic" things.
The difference here is that we have people like yourself: those who have zero faith in our government and as such act as double agents or saboteurs. When people such as yourself gain power in the legislator they "starve the beast". Meaning, purposefully deconstruct sections of our government such that they have justification for their ideological belief that our government doesn't work.
You guys work backwards. The foregone conclusion is that government programs never work, and then you develop convoluted strategies to prove that.
I think that sort of assumption of insincerity is worse than what you're accusing them of. You might not like their argument, but it's not inherently incorrect for them to argue that because humans have the right to do something, humans have the right to use tools to do that something and humans have the right to group together and use those tools to do something at a large scale.
I can go through and manually compress "Revenge of the Sith" and then post it online. Or, I can use a compression program like handbrake. Regardless, it is copyright infringement.
Can AI reproduce almost* the same things that exist in it's training data? Sometimes, so sometimes it's copyright infringement. Doesn't help that it's explicitly for-profit and seeks to obsolesce and siphon value from it's training material.
If it were that cut and dried we wouldn't have this conversation at all, so clearly your position isn't objectively true.
If we're going to be giving some rights to LLMs for convenient for-profit ventures, I expect some in-depth analysis on whether that is or is not slavery. You can't just anthropomorphize a computer program when it makes you money but then conveniently ignore the hundreds of years of development of human rights. If that seems silly, then I think LLMs are probably not like humans and the comparisons to human learning aren't justified.
If it's like a human, that makes things very complicated.
Either force AI companies to compensate the artists they're being "inspired" by, or let people torrent a copywashed Toy Story 5.
As a consumer, it would amazing if there were compulsory licenses for film and tv; then we wouldn't have to subscribe to 70 different services to get to the things we want to see. And there would likely be services that spring up to redistribute media where the rightsholders aren't able to or don't care to; it might be pulled from VHS that fans recorded off of TV in the old days, but at least it'd be something.
LLMs seek to be a for-profit replacement for a variety of paid sources. They say "hey, you can get the same thing as Service X for less money with us!"
That's a problem, regardless of how you go about it. It's probably fine if I watch a movie with my friends, who cares. But distributing it over the internet for free is a different issue.
So in those cases, the original authors might have a case. Generally you don't see these LLM doing that though.
>Doesn't help that it's explicitly for-profit and seeks to obsolesce and siphon value from it's training material.
Doesn't hurt either. That's a reason to be butthurt, but that's not a legal argument.
It is a legal argument, fair use specifically takes into account the intention. Just using it for commercial ventures makes the water hotter.
>Meta allegedly tried to conceal the seeding by not using Facebook servers while downloading the dataset to "avoid" the "risk" of anyone "tracing back the seeder/downloader" from Facebook servers
Sounds like they used a VPN, set the upload speed to 1kb/s and stopped after the download is done. If the average Joe copied that setup there's 0% chance he'd get sued, so I don't really see a double standard here. If anything, Meta might get additional scrutiny because they're big enough of a target that rights holders will go through the effort of suing them.
This is the case anyway; there are many writers competing for the opportunity to be published, so the publishers have a massive advantage, and it is the technology of printing (and cheap paper) that makes this a one-sided relationship — if every story teller had to be heard in person, with no recordings or reproductions possible, then story tellers would be found in every community, and they would be valued by their community.
> Here's a hole in your thinking: if you like fantasy, would you be content to just re-read Tolkien over and over, forever? Don't you think that'd get boring no matter how good he was?
The examples aren't meant to be exclusive, and Pratchett has a lot of books.
There's far more books on the market right now than a human can read in a lifetime. At some point, we may have already passed it, there will be far more good books on the market than a human can read in a lifetime, at which point it's not quality, it's fashion.
> And empirically, "new creative [people]" manage to complete with Pratchett or Tolkien all the time, as new fantasy works are still being published and read.
At some point, there will be more books at least as good as Pratchett, Tolkien, Le Guin, McCaffrey, Martin, Heinlein, Niven etc. in each genre, than anyone can read.
> Do you remember that "Game of Thrones" was a mass cultural phenomenon not too long ago?
Published: August 1, 1996 — concurrently with Pratchett.
Better example would have been The Expanse — worth noting that SciFi has a natural advantage over (high) fantasy or romance, as the nature of speculative science fiction means it keeps considering futures that are rendered as obsolete as the worn-down buttons on the calculator that Hari Seldon was rumoured to keep under his pillow.
>LLMs seek to be a for-profit replacement for a variety of paid sources. They say "hey, you can get the same thing as Service X for less money with us!"
What's an LLM supposed to be a substitute for? Are people using them to generate entire books or news articles, rather than buying a book or an issue of the new york times? Same goes for movies. No one is substituting marvel movies with sora video.
Yes.
> No one is substituting marvel movies with sora video.
Yeah because sora kind of sucks. It's great technology, but turns out text is just a little bit easier to generate than 3D videos.
Once sora gets good, you bet your ass they will.
This article is literally about the copyright office finding AI companies violating copyright law by training their models on copyrighted material. I'm not even sure what you're arguing about anymore.
Humans are capable of reproducing copyright illegally, but we allow them to train on copyrighted material legally.
Perhaps measures should be taken to prevent illegal reproduction, but if that's impossible, or too onerous, there should be utilitarian considerations.
Then the crux becomes a debate over utility, which often becomes a religious debate.
To begin with, this very case of Perlmutter getting fired after her office's report is interesting enough, but let's keep it aside. [0]
First, plenty of lobbying has been afoot, pushing DC to allow training on this data to continue. No intention to stop or change course. [1]
Next, when regulatory attempts were in fact made to act against this open theft, those proposed rules were conveniently watered down by Google, Microsoft, Meta, OpenAI and the US government lobbying against the copyright & other provisions. [2]
If you still think, "so what? maybe by strict legal interpretation it's still fair use" -- then explain why OpenAI is selectively signing deals with the likes of Conde Nast if they truly believe this to be the case. [3]
Lastly, when did you last see any US entity or person face no punitive action whatsoever despite illegally downloading (and uploading) millions of books & journal articles; do you remember Aaron Swartz? [4]
You might not agree with my assessment of 'conspiracy', but are you denying there is even an alignment of incentives contrary to the spirit of the law?
[0] https://www.reuters.com/legal/government/trump-fires-head-us...
[1] https://techcrunch.com/2025/03/13/openai-calls-for-u-s-gover...
[2] https://www.euronews.com/next/2025/04/30/big-tech-watered-do...
[3] https://www.reuters.com/technology/openai-signs-deal-with-co...
[4] https://cybernews.com/tech/meta-leeched-82-terabytes-of-pira...
In the end this all comes down to needing the people to care enough.
This isn't some new phenomenon. We do indeed seize assets from buyers if the seller stole them.
My opinion on the matter at hand is this: Artists who complain about GenAI use the hypothetical that you mentioned, where if you can accurately recreate a copyrighted work through specific model usage, then any distribution of the model is a copyright violation. That's why, according to the argument, fair use does not apply.
The real problem with that is that there's a mismatch between the fair use analysis and the actual use at issue. The complaining artists want the fair use inquiry to focus on the damage to the potential market to works in their particular style. That's where the harm is according to them. However, what they use to even get into that stage is the copyright infringement allegation that I described earlier: that the models contain their works on a fixed manner which can be derived without permission.
Not to mention the fact that this position means putting the malicious usage of the models for outright copyright infringement at the output level above the entire class of new works that can be created by its usage. It's effectively saying "because these models can technically be used in an infringing way, it infringes our copyright and any creative potential that these models could help with are insignificant in comparison to that simple fact. Of course, that's not the actual real problem, which is that they output completely new works that compete with our originals, even when they aren't derivatives of, nor substantially similar to, any individual copyrighted work".
Here's a very good article outlining my position in a more articulate way: https://andymasley.substack.com/p/a-defense-of-ai-art
Yes, that's why we judge on a case by case basis. The line is blurry.
I think when you're storing copies of such assets in your database that you're well past the line, though.
If we can agree that taking away of your time is theft (wage theft, to be precise), we as those who rely on intellect in our careers should be able to agree that the taking of our ideas is also theft.
>moved to the Ninth Circuit Court of Appeals, where he argued that the goods he was distributing were not "stolen, converted or taken by fraud", according to the language of 18 U.S.C. 2314 - the interstate transportation statute under which he was convicted. The court disagreed, affirming the original decision and upholding the conviction. Dowling then took the case to the Supreme Court, which sided with his argument and reversed the convictions.
This just tells me that the definition is highly contentious. Having the supreme court reverse a federal ruling already shows misalignment.
Even with NFTs it still was a full year+ of everyone trying to shill them out before the sentiment turned. Machine learning, meanwhile, is actually useful but is being shoved into every hole.
Citation needed. RIAA used to just watch torrents and sent cease and desists to everyone who connected, whether for a minute or for months. It was very much a dragnet, and I highly doubt there was any nuance of "but Your Honor, I only seeded 1MB back so it's all good".
Like Napster et al, their data sets make copies of hundreds of GB of copyrighted works without authors' permission. Ex: The Pile, Commons Crawl, Refined Web, Github Pages. Many copyrighted works on the Internet also have strict terms of use. Some have copyright licenses that say personal use only or non-commercial use.
So, like many prior cases, just posting what isn't yours on HughingFace is already infringement. Copying it from HF to your training cluster is also infringement. It's already illegal until we get laws like Singapore's that allow copyrighted works. Even they have a weakness in the access requirement which might require following terms of use or licenses in the sources.
Only safe routes are public domain, permissive code, and explicit licenses from copyright holders (or those with sub-license permissions).
So, what do you think about the argument that making copies of copyrighted works violates copyright law? That these data sets are themselves copyright violations?
The AI companies will likely be arguing that they don’t need a license, so any terms of use in the license are irrelevant.
You can try and argue that a compression algorithm is some kind of copy of the training data, but that’s an untested legal theory.
It’s very unlikely that she would (or even could) have devoted herself to writing fiction in her free time as a passion project without hope of monetary reward and without any way to live from her writing for the ten years it took to finish the Potter series.
And even if she had somehow managed, you’d never hear about it, because without publishers to act as gatekeepers it’d have been lost in the mountains of fanfic and whatever other slop amateur writers upload to the internet.
So what is the equivalent of "digging too much" in a beach for AI? What fundamentally changes when you learn hyper-fast vs just read a bunch of horror novels to inform better horror novel-writing? What's unfair about AI compared to learning from published novels about how to properly pace your story?
These are the things you need to figure out before making a post equating AI learning with copyright infringement. "It's different" doesn't cut it.
It's annoying to see the current pushback against China focusing so much on inconsequential matters with so much nonsense mixed in, because I do think we do need to push back against China on some things.
My issue is with the rhetoric, if that isn't the rhetoric you are using I am not talking about you.
And honestly there is truth to it. Some people (at work, in rea life, wherever) might come off very inteligent but the moment they say "oh I just read that relevant fact on reddit/twitter/news site 5 minutes ago" you realize they are just like you and repeating relevant information that was consumed recently.
The second one is the "just solve capitalism and we can abolish copyright entirely" argument which is... a total non-starter. Yes, in an idealized utopia, we don't need capitalism or copyright and people can do things just because they want to and society provides for the artist just because humans all value art just that much. It's a fun utopic ideal, but there's many steps between the current state of the world and "you can abolish the idea of copyright", and we aren't even close to that state yet.
Steam is the classic example of how this is effective. You compete with pirates by offering what they can't: a reliable, convenient service. DRM becomes more of a hindrance than a benefit in this situation.
Allowing pirates to offer reliable convenient pirate websites that are "so easy a normie can do it" would be a disaster for all the creative industries. You would need to radically change the rest of society to prevent a total collapse of people making money off art.
Is your argument simply about your interpretation of copyright law and your mentality being that laws are good and breaking them is bad? Because that doesn't seem to be a very informed position to take.
As a private person I no longer feel incentivised to create new content online because I think that all I create will eventually be stolen from me...
"There is nothing inherently wrong with training machines on existing data..." doesn't really conflate a machine with an organism and isn't what I'm talking about.
If you instead had written "I can read the Cat in the Hat to teach my kid to read why can't I use it to train an LLM?"
Then I do think you would be asking with a certain degree of bad faith, you are perfectly capable of distinguishing those two things, in practice, in your everyday life. You do not in fact see them as equivilent.
Your rhetorical choice to be unable to tell the difference would be performative.
You seem to think I'm arguing copyright policy. I really am discussing rhetoric.
Yoinnking their up and mass producing slop sure is a line to cross, though.
As did Disney, apparently.
>what use is regulation if you can just buy it?
I don't like it either, but it still comes down to the same issues. We vote in people who can be bought and don't make a scandal out of it when it happens. The first step to fixing that corruption is to make congress afraid of being ousted if discovered. With today's communication structure, that's easier than ever.
But if the people don't care, we see the obvious Victor.
If we end up saying it is not illegal, then I demand, that it will not be illegal for everyone. No double standards please. Let us all launder copyrighted material this way, labeling it "AI".
> its popularity is indicative of its quality, even if it doesn't match the standards of a literature PhD for "good writing"
This is a false dichotomy. Literature PhDs are not the only people out there who enjoy high-quality literature more than light entertainment, and anyway, you seem to be admitting that there's a type of fiction that doesn't exist unpaid, so isn't this just proving my point correct?
All that said, even if I accept for the sake of argument that the existence of popular free genre fiction would be enough to prove your point (because, in fairness to you, we were originally talking about Harry Potter, which is as genre as it gets)... I went looking, and there are at most a few sporadic examples. A few minutes of research suggest that some books by Cory Doctorow are among the most popular ones. Also, The Martian by Andy Weir used to be freely available, but isn't anymore as far as I can find.
Sorry, but Cory Doctorow and (formerly) Andy Weir represent a pretty small body of work compared to the entire canon of paid novels, so I'm going to have to call BS on your claim unless you provide some examples of your own.