Stealing is stealing. Let's stop with the double standards.
We've held China accountable for counterfeiting products for decades and regulated their exports. So why should Anthropic be allowed to export their products and services after engaging in the same illegal activity?
That's not what's happening here. People weren't downloading music illegally and reselling it on Claude.ai. And while P2P networks led to some great tech, there's no solid proof they actually improved the music industry.
Against companies like Elsevier locking up the worlds knowledge.
Authors are no different to scientists, many had government funding at one point, and it's the publishing companies that got most of the sales.
You can disagree and think Aaron Swartz was evil, but you can't have both.
You can take what Anthropic have show you is possible and do this yourself now.
isohunt: freedom of information
Sayi "they have the money" is not an argument. It's about the amount of effort that is needed to individually buy, scan, process millions of pages. If that's done for you, why re-do it all?
Right guys we don't have rules for thee but not for me in the land of the free?
105B+ is more than Anthropic is worth on paper.
Of course they’re not going to be charged to the fullest extent of the law, they’re not a teenager running Napster in the early 2000s.
I'm against Anthropic stealing teacher's work and discouraging them from ever writing again. Some teachers are already saying this (though probably not in California).
The difference is, Aaron Swartz wasn't planning to build massive datacenters with expensive Nvidia servers all over the world.
They make money off the model weights, which is fair use (as confirmed by recent case law).
This was the result of a cruel and zealous overreach by the prosecutor to try to advance her political career. It should never have gone that far.
The failure of MIT to rally in support of Aaron will never be forgiven.
https://torrentfreak.com/spotifys-beta-used-pirate-mp3-files...
Funky quote:
> Rumors that early versions of Spotify used ‘pirate’ MP3s have been floating around the Internet for years. People who had access to the service in the beginning later reported downloading tracks that contained ‘Scene’ labeling, tags, and formats, which are the tell-tale signs that content hadn’t been obtained officially.
https://gizmodo.com/early-spotify-was-built-on-pirated-mp3-f...
Daniel Ek said: "my mission is to make music accessible and legal to everyone, while ensuring artists and rights holders got paid"
Also, the Swedish government has zero tolerance for piracy.
Society underestimates the chasm that exists between an idea and raising sufficient capital to act on those ideas.
Plenty of people have ideas.
We only really see those that successfully cross it.
Small things EULA breaches, consumer licenses being used commercially for example.
Lesson is simple. If you want to break a law make sure it is very profitable because then you can find investors and get away with it. If you play robin hood you will be met with a hammer.
There is no equality, and seemingly there are worker bees who can be exploited, and there are privileged ones, and of course there are the queens.
I use an adblocker and tbh I think so many people on HN are okay with ad blocking and not piracy when basically both just block the end user from earning money.
I kind of believe that if you really like a software, you really like something. Just ask them what their favourite charity is and donate their or join their patreon/a direct way to support them.
In this context, stealing is often used as a pejorative term to make piracy sound worse than it is. Except for mass distribution, piracy is often regarded as a civil wrong, and not a crime.
Writers that have an authentic human voice and help people think about things in a new way will be fine for a while yet.
This is what every company using media are doing (think Spotify, Netflix, but also journal, ad agency, ...). I don't know why people in HN are giving a pass to AI company for this kind of behavior.
"Anthropic cut up millions of used books to train Claude — and downloaded over 7 million pirated ones too, a judge said."
A not-so-subtle difference.
That said, in a sane world, they shouldn't have needed to cut up all those used books yet again when there's obviously already an existing file that does all the work.
Please keep in mind, copyright is intended as a compromise between benefit to society and to the individual.
A thought experiment, students pirating textbooks and applying that knowledge later on in their work?
When it comes to a lot of these teachers, I'll say, copyright work hand in hand with college and school course book mandates. I've seen plenty of teachers making crazy money off students' backs due to these mandates.
A lot of the content taught in undergrad and school hasn't changed in decades or even centuries. I think we have all the books we'll ever need in certain subjects already, but copyright keeps enriching people who write new versions of these.
Comment is more about the pseudo ethical high ground
Pirating 7 million books, remixing their content, and using that to power Claude.ai is like counterfeiting 7 million branded products and selling them on your personal website. The original creators don't get credit or payment, and someone’s profiting off their work.
All this happens while authors, many of them teachers, are left scratching their heads with four kids to feed
https://www.forbes.com/2009/08/04/online-anime-video-technol...
https://venturebeat.com/business/crunchyroll-for-pirated-ani...
There are many small artists who do this not for money, but for fun and have their renowned styles. Even their styles are ripped off by these generative AI companies and turned into a slot machine to earn money for themselves. These artists didn't consent to that, and this affects their (mental) well-beings.
With that context in mind, what do you think about these people who are not in this for money is ripped out of their years of achievement and their hard work exploited for money by generative AI companies?
It's not about IP (with whatever expansion you prefer) or laws, but ethics in general.
Substitute comics for any medium. Code, music, painting, illustration, literature, short movies, etc.
There's so many texts, and they're so sparse that if I could copyright a work and never publish it, the restriction would be irrelevant. The probability that you would accidentally come upon something close enough that copyright was relevant is almost infinitesimal.
Because of this copyright is an incredibly weak restriction, and that it is as weak as it is shows clearly that any use of a copyrighted work is due to the convenience that it is available.
That is, it's about making use of the work somebody else has done, not about that restricting you somehow.
Therefore copyright is much more legitimate than ordinary property. Ordinary property, especially ownership of land, can actually limit other people. But since copyright is so sparse infringing on it is like going to world with near-infinite space and picking the precise place where somebody has planted a field and deciding to harvest from that particular field.
Consequently I think copyright infringement might actually be worse than stealing.
I do support intellectual property reform that would be considered radical by some, as I imagine you do. But my highest hopes for this situation are more modest: if AI companies are told that their data must be in the public domain to train against, we will finally have a powerful faction among capitalists with a strong incentive to push back against the copyright monopolists when it comes to the continuous renewal of copyright terms.
If the "path of least resistance" for companies like Google, Microsoft, and Meta becomes enlarging the public domain, we might finally begin to address the stagnation of the public domain, and that could be a good thing.
But I think even such a modest hope as that one is unlikely to be realized. :-\
You never noticed the hypocrite behavior all over society?
* O, you drunk drive, big fine, lots of trouble. * O, you drunk drive and are a senator, cop, mayor, ... Well, lets look the other way.
* You have anger management issues and slam somebody to the ground. Jail time. * You as a cop have anger management issues and slams somebody to the ground. Well, paid time off while we investigate and maybe a reprimand. Qualified immunity boy!
* You tax fraud for 10k, felony record, maybe jail time. * You as a exec of a company do tax fraud for 100 million. After 10 years lawyering around, maybe you get something, maybe, ... o, here is a fine of 5 million.
I am sorry but the idea of everybody being equal under the law has always been a illusion.
We are holding China accountable for counterfeiting products because it hurts OUR companies, and their income. But when its "us vs us", well, then it becomes a bit more messy and in general, those with the biggest backing (as in $$$, economic value, and lawyers), tends to win.
Wait, if somebody steal my book, i can sue that person in court, and get a payout (lawyers will cost me more but that is not the point). If some AI company steals my book, well, the chance you win is close to 1%, simply because lots of well paid lawyers will make your winning hard to impossible.
Our society has always been based upon power, wealth and influence. The more you have of it, the more you get away (or reduced) with things, that gets other fined or jailed.
That's a statement carefully crafted to be impossible to disprove. Of course they shipped pirated music (I've seen the files). Of course anyone paying attention knew. Nothing in the music industry was "clean" in those days. But, sure, no credible evidence because any evidence anyone shows you you'll decide is not credible. It's not in anyone's interests to say anything and none of it matters.
My response to this whole thread is just “good”
Aaron Swartz is a saint and a martyr.
And it just so happens that that belief says they can burn whatever they want down because something in the future might happen that absolves them of those crimes.
https://investors.autodesk.com/news-releases/news-release-de...
as long as you buy the book it still should be legal, that is if you actually buy the book and not a "read only" eBook
but the 7_000_000 pirated books are a huge issue, and one from which we have a lot of reason to believe isn't just specific to Anthropic
Yes, style copying is generally considered legal, but as another commenter posted in a related thread "scale matters".
Maybe this will be reconsidered in the near future as the scale is in a much more different level with Generative AI. While there can be no technological solution to this (since it's a social problem to begin with), maybe public opinion about this issue will evolve over time.
To be crystal clear: I'm not against the tech. I'm against abusing and exploiting people for solely monetary profit.
If you're an individual pirating software or media, then from the rights owners' perspective, the most rational thing to do is to make an example of you. It doesn't happen everyday, but it does happen and it can destroy lives.
If you're a corporation doing the same, the calculation is different. If you're small but growing, future revenues are worth more than the money that can be extracted out of you right now, so you might get a legal nastygram with an offer of a reasonable payment to bring you into compliance. And if you're already big enough to be scary, litigation might be just too expensive to the other side even if you answer the letter with "lol, get lost".
Even in the worst case - if Anthropic loses and the company is fined or even shuttered (unlikely) - the people who participated in it are not going to be personally liable and they've in all likelihood already profited immensely.
We have? Are we from different multi-verses?
The one I've lived in to date has not done anything against Chinese counterfeits beyond occasionally seizing counterfeit goods during import. But that's merely occasionally enforcing local counterfeit law, a far cry from punishing the entity producing it.
As a matter of fact, the companies started outsourcing everything to China, making further IP theft and quasi-copies even easier
- riding a wave of change
- not caring too much about legal constraints (or like they would say now "distrupting" the market, which very very often means doing illigal shit which beings them far more money then any penalties they will ever face from it)
- or caring about ethics too much
- and for recent years (starting with Amazone) a lot of technically illegal financing (technically undercutting competitors prices long term based on money from else where (e.g. investors) is unfair competitive advantage (theoretically) clearly not allowed by anti monopoly laws. And before you often still had other monopoly issues (e.g. see wintel)
So yes not systematic not complying with law to get unfair competitive advantage knowing that many of the laws are on the larger picture toothless when applied to huge companies is bread and butter work of US tech giants
but systematic wide spread big things and often many of them, giving US giant a unfair combative advantage
and don't think if you are a EU company you can do the same in the US, nop nop
but naturally the US insist that US companies can do that in the EU and complain every time a US company is fined for not complying for EU law
that isn't "just" stealing, it's organized crime
Did you read the article? The judge literally just legally recognized it.
You are often allowed to nake a digital copy of a physical work you bought. There are tons of used, physical works thay would be good for training LLM's. They'd also be good for training OCR which could do many things, including improve book scanning for training.
This could be reduced to a single act of book destruction per copyrighted work or made unnecessary if copyright law allowed us to share others' works digitally with their licensed customers. Ex: people who own a physical copy or a license to one. Obviously, the implementation could get complex but we wouldn't have to destroy books very often.
"What serves me personally the best for any given situation" for 95% of people.
My theory is that once they saw how much traffic they were getting, they realized how big of a market (subbed/dubbed) anime was.
Don’t have legal access to training data? Simply steal it, but move fast enough to keep ahead of the law. By the time lawsuits hit the company is worth billions and the product is embedded in everyday life.
Fake altruistic mindset. Super sociopathic.
Someone correct me if I am wrong but aren't these works being digitized and transformed in a way to make a profit off of the information that is included in these works?
It would be one thing for an individual to make person use of one or more books, but you got to have some special blindness not to see that a for-profit company's use of this information to improve a for-profit model is clearly going against what copyright stands for.
It's quite the mafia operation over at Amazon.
This of course cannot be allowed to happen, so the the legal system is just a limbo, a bar which regular individuals must strain to pass under but that corporations regularly overstep.
Can anyone make a compelling argument that any of these AI companies have the public's best interest in mind (alignment/superalignment)?
Simply, if the models can think then it is no different than a person reading many books and building something new from their learnings. Digitization is just memory. If the models cannot think then it is meaningless digital regurgitation and plagiarism, not to mention breach of copyright.
The quotes "consistent with copyright's purpose in enabling creativity and fostering scientific progress." and "Like any reader aspiring to be a writer" say, from what I can tell, that the judge has legally ruled the model can think as a human does, and therefore has the legal protections afforded to "creatives."
“Piracy” is mostly a rhetorical term in the context of copyright. Legally, it’s still called infringement or unauthorized copying. But industries and lobbying groups (e.g., RIAA, MPAA) have favored “piracy” for its emotional weight.
Luigi was peanuts in comparison.
“THERE were two “Reigns of Terror,” if we would but remember it and consider it; the one wrought murder in hot passion, the other in heartless cold blood; the one lasted mere months, the other had lasted a thousand years; the one inflicted death upon ten thousand persons, the other upon a hundred millions; but our shudders are all for the “horrors” of the minor Terror, the momentary Terror, so to speak; whereas, what is the horror of swift death by the axe, compared with lifelong death from hunger, cold, insult, cruelty, and heart-break? What is swift death by lightning compared with death by slow fire at the stake? A city cemetery could contain the coffins filled by that brief Terror which we have all been so diligently taught to shiver at and mourn over; but all France could hardly contain the coffins filled by that older and real Terror—that unspeakably bitter and awful Terror which none of us has been taught to see in its vastness or pity as it deserves.”
- Mark Twain
Copyright infringement is unauthorized reproduction - you have made a copy of something, but you have not deprived the original owner of it. At most, you denied them revenue although generally less than the offended party claims, since not all instances of copying would have otherwise resulted in a sale.
Also, there are various incentives for teachers to publish books. Money is just one of them (I wonder how much revenue books bring to the teachers). Prestige and academic recognition is another. There are probably others still. How realistic is the depiction of a deprived teacher whose livelihood depended on the books he published once every several years?
No, that's fallacious. Using anthropomorphic words to describe a machine does not give it the same kinds of rights and affordances we give real people.
This is reaching at best.
I can read 100 books and write a book based on the inspiration I got from the 100 books without any issue. However, if I pirate the 100 books I've still committed copyright infringement despite my new book being fully legal/fair use.
Although, there’s an exception for fictional characters:
https://en.m.wikipedia.org/wiki/Copyright_protection_for_fic...
It's not a common business practice. That's why it's considered newsworthy.
People on the internet have forgotten that the news doesn't report everyday, normal, common things, or it would be nothing but a listing of people mowing their lawns or applying for business loans. The reason something is in the news is because it is unusual or remarkable.
"I saw it online, so it must happen all the time" is a dopy lack of logic that infects society.
> Alsup ruled that Anthropic's use of copyrighted books to train its AI models was "exceedingly transformative" and qualified as fair use
> "All Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies"
It was always somewhat obvious that pirating a library would be copyright infringement. The interesting findings here are that scanning and digitizing a library for internal use is OK, and using it to train models is fair use.
The AI sector, famously known for its inability to raise funding. Anthropic has in the last four years raised 17 billion dollars
What you really should be asking is whether they infringed on the copyrights of the rippers. /s
you're saying copying a book is worse than robbing a farmer of his food and/or livelihood, which cannot be replaced to duplicated. Meanwhile, someone who copies a book does not deprive the author of selling the book again (or a tasty proceedings from harvest).
I can't say I agree, for obvious reasons.
Come up with a better comparison.
> But Alsup drew a firm line when it came to piracy.
> "Anthropic had no entitlement to use pirated copies for its central library," Alsup wrote. "Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic's piracy."
That is, he ruled that
- buying, physically cutting up, physically digitizing books, and using them for training is fair use
- pirating the books for their digital library is not fair use.
> I honestly feel bad for some of these AI companies because the rules around copyright are changing just to target them
The ruling would be a huge win for AI companies if held. It's really weird that you reached the opposite conclusion.
If I didn’t license all the books I trained on, am I not depriving the publisher of revenue, given people will pay me for the AI instead of buying the book?
Because for example if you buy a movie on disc, that's a personal license and you can watch it yourself at home. But you can't like play it at a large public venue that sell tickets to watch it. You need a different and more expensive license to make money off the usage of the content in a larger capacity like that.
Training a generative model on a book is the mechanical equivalent of having a human read the book and learn from it. Is it stealing if a person reads the book and learns from it?
I really think we need to understand this as a society and also realize that moneyed interests will downplay this as much as possible. A lot of the problems we're having today are due to insufficient regulation differentiating between individuals and systems at scale.
Real piracy always involves booty.
Naturally booty is wealth that has been hoarded.
Has nothing to do with wealth that may or may not come in the future, regardless of whether any losses due to piracy have taken place already or not.
Judges consider a four factor when examining fair use[1]. For search engines,
1) The use is transformative, as a tool to find content is very different purpose than the content itself.
2) Nature of the original work runs the full gamut, so search engines don't get points for only consuming factual data, but it was all publicly viewable by anyone as opposed to books which require payment.
3) The search engine store significant portions of the work in the index, but it only redistributes small portions.
4) Search engines, as original devised, don't compete with the original, in fact they can improve potential market of the original by helping more people find them. This has changed over time though, and search engines are increasingly competing with the content they index, and intentionally trying to show the information that people want on the search page itself.
So traditional search which was transformative, only republished small amounts of the originals, and didn't compete with the originals fell firmly on the side of fair use.
Google News and Books on the other hand weren't so clear cut, as they were showing larger portions of the works and were competing with the originals. They had to make changes to those products as a result of lawsuits.
So now lets look at LLMs:
1) LLM are absolutely transformative. Generating new text at users request is a very different purpose and character from the original works.
2) Again runs the full gamut (setting aside the clear copyright infringement downloading of illegally distributed books which is a separate issue)
3) For training purposes, LLMs don't typically preserve entire works, so the model is in a better place legally than a search index, which has precedent that storing entire works privately can be fair use depending on the other factors. For inference, even though they are less likely to reproduce the originals in their outputs than search engines, there are failure cases where an LLM over-trained on a work, and a significant amount the original can be reproduced.
4) LLMs have tons of uses some of which complement the original works and some of which compete directly with them. Because of this, it is likely that whether LLMs are fair use will depend on how they are being used - eg ignore the LLM altogether and consider solely the output and whether it would be infringing if a human created it.
This case was solely about whether training on books is fair use, and did not consider any uses of the LLM. Because LLMs are a very transformative use, and because they don't store original verbatim, it weighs strongly as being fair use.
I think the real problems that LLMs face will be in factors 3 and 4, which is very much context specific. The judge himself said that the plaintiffs are free to file additional lawsuits if they believe the LLM outputs duplicate the original works.
[1] https://fairuse.stanford.edu/overview/fair-use/four-factors/
What you're proposing is considering LLMs to be equal to humans when considering how original works are created. You could make the argument that LLM training data is no different from a human "training" themself over a lifetime of consuming content, but that's a philosophical argument that is at odds with our current legal understanding of copyright law.
Yes, but copying isn't stealing, because the person you "take" from still has their copy.
If you're allowed to call copying stealing, then I should be allowed to call hysterical copyright rabblerousing rape. Quit being a rapist, pyman.
Alsup detailed Anthropic's training process with books: The OpenAI rival
spent "many millions of dollars" buying used print books, which the
company or its vendors then stripped of their bindings, cut the pages,
and scanned into digital files.
I've noticed an increase in used book prices in the recent past and now wonder if there is an LLM effect in the market.Learning from the book is, well, learning from the book. Yes, they intended to make money off of that learning... but then I guess a medical student reading medical textbooks intends to profit off of what they learn from them. Guess that's not fair use either (well, it's really just use, as in the intended use for all books since they were first invented).
Once a person has to believe that copyright has any moral weight at all, I guess all rational though becomes impossible for them. Somehow, they're not capable of entertaining the idea that copyright policy was only ever supposed to be this pragmatic thing to incentivize creative works... and that whatever little value it has disappears entirely once the policy is twisted to consolidate control.
Just as the farmer obtains his livelihood from the investment-of-energy-to-raise-crops-to-energy cycle the author has his livelihood by the investment-of-energy-to-finding-a-useful-work-to-energy cycle.
So he is in fact robbed in a very similar way.
So what is the right interpretation of the law with regards to how AI is using it? What better incentivizes innovation? Do we let AI companies scan everything because AI is innovative? Or do we think letting AI vacuum up creative works to then stochastically regurgitate tiny (or not so tiny) slices of them at a time will hurt innovation elsewhere?
But obviously the real answer here is money. Copyright is powerful because monied interests want it to be. Now that copyright stands in the way of monied interests for perhaps the first time, we will see how dedicated we actually were to whatever justifications we've been seeing for DRM and copyright for the last several decades.
First, Authors argue that using works to train Claude’s underlying LLMs was like using
works to train any person to read and write, so Authors should be able to exclude Anthropic
from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for
training or learning as such. Everyone reads texts, too, then writes new texts. They may need
to pay for getting their hands on a text in the first instance. But to make anyone pay
specifically for the use of a book each time they read it, each time they recall it from memory,
each time they later draw upon it when writing new things in new ways would be unthinkable.
For centuries, we have read and re-read books. We have admired, memorized, and internalized
their sweeping themes, their substantive points, and their stylistic solutions to recurring writing
problems.
...
In short, the purpose and character of using copyrighted works to train LLMs to generate
new text was quintessentially transformative. Like any reader aspiring to be a writer,
Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but
to turn a hard corner and create something different. If this training process reasonably
required making copies within the LLM or otherwise, those copies were engaged in a
transformative use.
[1] https://authorsguild.org/app/uploads/2025/06/gov.uscourts.ca...(2) Once you make something publicly available, anyone can learn from it. No consent necessary.
(3) Being upset does not grant you special privileges under the law.
(4) If you don't like the idea of paying for AI art, free software is both plentiful and competitive with just about anything proprietary.
Now, in theory, you learning from an author's works and competing with them in the same market could meaningfully deprive them of income, but it's a very difficult argument to prove.
On the other hand, with AI companies it's an easier argument to make. If Anthropic trained on all of your books (which is somewhat likely if you're a fairly popular author) and you saw a substantial loss of income after the release of one of their better models (presumably because people are just using the LLM to write their own stories rather than buy your stuff), then it's a little bit easier to connect the dots. A company used your works to build a machine that competes with you, which arguably violates the fair use principle.
Gets to the very principle of copyright, which is that you shouldn't have to compete against "yourself" because someone copied you.
Also please don't use word "learning", use "creating software using copyrighted materials".
Also let's think together how can we prevent AI companies from using our work using technical measures if the law doesn't work?
Found it: https://www.nbcnews.com/tech/tech-news/federal-judge-rules-c...
> “That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft,” [Judge] Alsup wrote, “but it may affect the extent of statutory damages.”
> In fact this business was the ultimate in deconstruction: First one and then the other would pull books off the racks and toss them into the shredder's maw. The maintenance labels made calm phrases of the horror: The raging maw was a "NaviCloud custom debinder." The fabric tunnel that stretched out behind it was a "camera tunnel...." The shredded fragments of books and magazine flew down the tunnel like leaves in tornado, twisting and tumbling. The inside of the fabric was stitched with thousands of tiny cameras. The shreds were being photographed again and again, from every angle and orientation, till finally the torn leaves dropped into a bin just in front of Robert. Rescued data. BRRRRAP! The monster advanced another foot into the stacks, leaving another foot of empty shelves behind it.
First, Authors argue that using works to train Claude’s underlying LLMs
was like using works to train any person to read and write, so Authors
should be able to exclude Anthropic from this use (Opp. 16).
Second, to that last point, Authors further argue that the training was
intended to memorize their works’ creative elements — not just their
works’ non-protectable ones (Opp. 17).
Third, Authors next argue that computers nonetheless should not be
allowed to do what people do.
https://media.npr.org/assets/artslife/arts/2025/order.pdfBut this analogy seems wrong. First, LLM is not a human and cannot "learn" or "train" - only human can do it. And LLM developers are not aspiring to become writers and do not learn anything, they just want to profit by making software using copyrighted material. Also people do not read millions of books to become a writer.
No, that doesn't undo the infringement. At most, that would mitigate actual damages, but actual damages aren't likely to be important, given that statutory damages are an alternative and are likely to dwarf actual damages. (It may also figure into how the court assigns statutory damages within the very large range available for those, but that range does not go down to $0.)
> They will have ceased and desisted.
"Cease and desist" is just to stop incurring additional liability. (A potential plaintiff may accept that as sufficient to not sue if a request is made and the potential defendant complies, because litigation is uncertain and expensive. But "cease and desist" doesn't undo wrongs and neutralize liability when they've already been sued over.)
The whole point of copyright is to ensure you're paid for your work. AI companies shouldn't pirate, but if they pay for your work, they should be able to use it however they please, including training an LLM on it.
If that LLM reproduces your work, then the AI company is violating copyright, but if the LLM doesn't reproduce your work, then you have not been harmed. Trying to claim harm when you haven't been due to some philosophical difference in opinion with the AI company is an abuse of the courts.
This is one of those mental gymnastics exercises that makes copyright law so obtuse and effectively unenforceable.
As an alternative, imagine a scriptwriter buys a textbook on orbital mechanics, while writing Gravity (2013). A large number of people watch the finished film, and learn something about orbital mechanics, therefore not needing the textbook anymore, causing a loss of revenue for the textbook author. Should the author be entitled to a percentage of Gravity's profit?
We'd be better off abolishing everything related to copyright and IP law alltogether. These laws might've made sense back in the days of the printing press but they're just nonsensical nowadays.
I could agree with exceptions for non-commercial activity like scientific research, but AI companies are made for extracting profits and not for doing research.
> AI companies shouldn't pirate, but if they pay for your work, they should be able to use it however they please, including training an LLM on it.
It doesn't work this way. If you buy a movie it doesn't mean you can sell goods with movie characters.
> then you have not been harmed.
I am harmed because less people will buy the book if they can simply get an answer from LLM. Less people will hire me to write code if an LLM trained on my code can do it. Maybe instead of books we should start making applications that protect the content and do not allow copying text or making screenshots. ANd instead of open-source code we should provide binary WASM modules.
> Alsup ruled that Anthropic's use of copyrighted books to train its AI models was "exceedingly transformative" and qualified as fair use, a legal doctrine that allows certain uses of copyrighted works without the copyright owner's permission.
And the harm you describe is not a recognized harm. You don't own information, you own creative works in their entirety. If your work is simply a reference, then the fact being referenced isn't something you own, thus you are not harmed if that fact is shared elsewhere.
It is an abuse of the courts to attempt to prevent people who have purchased your works from using those works to train an LLM. It's morally wrong.
Saying that though there are tools for digitizing books that don't require destroying them
And to be clear, we javelin the word infringement precisely because it is not theft.
In addition to the deprived revenue, piracy also improves on the general relevance the author has or may have in the public sphere. Essentially, one of the side effects of piracy is basically advertising.
Doctorow was one of the early ones to bring this aspect of it up.
Or even, is an individual operating within the law as fair use, the same as a voracious all-consuming AI training bot consuming everything the same in spirit?
Consider a single person in a National Park, allowed to pick and eat berries, compared to bringing a combine harvester to take it all.
No. The point of copyright is that the author gets to decide under what terms their works are copied. That's the essence of copyright. In many cases, authors will happily sell you a copy of their work, but they're under no obligation to do so. They can claim a copyright and then never release their work to the general public. That's perfectly within their rights, and they can sue to stop anybody from distributing copies.
Referring to this? (Wikipedia's disambiguation page doesn't seem to have a more likely article.)
https://en.wikipedia.org/wiki/Richard_Stallman#Copyright_red...
From there, the cases would likely focus on whether that fits in established criteria for digitized copies, whether they're allowed in the training process itself, and the copyright status of the resulting model. Some countries allow all of that if you legally obtained the material in the first place. Also, they might factor whether it's for commercial use or not.
> pirating the books for their digital library is not fair use.
"Pirating" is a fuzzy word and has no real meaning. Specifically, I think this is the cruz:
> without adding new copies, creating new works, or redistributing existing copies
Essentially: downloading is fine, sharing/uploading up is not. Which makes sense. The assertion here is that Anthropic (from this line) did not distribute the files they downloaded.
Now places like Flea markets have been known to have a counterfeit DVD or two.
And there is more than one way to compare to non-digital content.
Regular books and periodicals can be sold out and/or out-of-print, but digital versions do not have these same exact limitations.
A great deal of the time though, just the opposite occurs, and a surplus is printed that no one will ever read, and which will eventually be disposed of.
Newspapers are mainly in the extreme category where almost always a significant number of surplus copies are intentionally printed.
It's all part of the same publication, a huge portion of which no one has ever rightfully expected for every copy to earn anything at all, much less a return on every single copy making it back to the original creator.
Which is one reason why so much material is supported by ads. Even if you didn't pay a high enough price to cover the cost of printing, it was all paid for well before it got into your hands.
Digital copies which are going unread are something like that kind of surplus. If you save it from the bin you should be able to do whatever you want with it either way, scan it how you see fit.
You just can't say you wrote it. That's what copyright is supposed to be for.
Like at the flea market, when two different vendors are selling the same items but one has legitimately purchased them wholesale and the other vendor obtained theirs as the spoils of a stolen 18-wheeler.
How do you know which ones are the pirated items?
You can tell because the original owners of the pirated cargo suffered a definite loss, and have none of it remaining any more.
OTOH, with things like fake Nikes at the flea market, you can be confident they are counterfeit whether they were stolen from anybody in any way or not.
Which of the following are true?
(a) the legal industry is susceptible to influence and corruption
(b) engineers don't understand how to legally interpret legal text
(c) AI tech is new, and judges aren't technically qualified to decide these scenarios
Most likely option is C, as we've seen this pattern many times before.
Turns out this doesn't quite mitigate downloading them first. (Though frankly, I'm very much against people having to buy 7 million books when someone has already scanned them)
When you do it for a transformative purpose (turning it into an LLM model) it's certainly fair use.
But more importantly, it's ethical to do so, as the agreement you've made with the person you've purchased the book from included permission to do exactly that.
I think the overly liberal, non-tech crowd has become really vocal on HN as of late and your sample is likely biased by these people.
If the author didn't want their work to be included in an LLM, they should not have sold it, just like if an author didn't want their work to inspire someone else's work, they should not have sold it.
It's not an issue because it's not currently illegal because nobody could have foreseen this years ago.
But it is profiting off of the unpaid work of millions. And there's very little chance of change because it's so hard to pass new protection laws when you're not Disney.
For anyone else who wants to do the same thing though this is likely all they need to do.
Cutting up and scanning books is hard work and actually doing the same thing digitally to ebooks isn't labor free either, especially when they have to be downloaded from random sites and cleaned from different formats. Torrenting a bunch of epubs and paying for individual books is probably cheaper
Where are you getting your data from? My conclusions are the exact opposite.
(Also, aren't judges by definition the only ones qualified to declare if it is actually fair use? You could make a case that it shouldn't be fair use, but that's different from it being not fair use.)
We've only dealt with the fairly straight-forward legal questions so far. This legal battle is still far from being settled.
I've fixed your question so that it accurately represents what I said and doesn't put words in my mouth.
If I click on a link and download a document, is that illegal?
I do not know if the person has the right to distribute it or not. IANAL, but when people were getting sued by the RIAA years back, it was never about downloading, but also distribution.
As I said, IANAL, but feel free to correct me, but my understanding is that downloading a document from the internet is not illegal.
https://www.courtlistener.com/docket/67569326/598/kadrey-v-m...
Note: I am not a lawyer.
It would be more clear if you stick to either legal or colloquial variants, instead of switching back and forth. (Tbf, the judge in this case also used the term “piracy” colloquially).
I don't think humans learn via backprop or in rounds/batches, our learning is more "online".
If I input text into an LLM it doesn't learn from that unless the creators consciously include that data in the next round of teaching their model.
Humans also don't require samples of every text in history to learn to read and write well.
Hunter S Thompson didn't need to ingest the Harry Potter books to write.
Stallman places great importance on the words and labels people use to talk about the world, including the relationship between software and freedom. He asks people to say free software and GNU/Linux, and to avoid the terms intellectual property and piracy (in relation to copying not approved by the publisher). One of his criteria for giving an interview to a journalist is that the journalist agrees to use his terminology throughout the article.
If by "what is stored and the manner which it is stored" is intended to signal model weights, I'm not sure what the argument is? The four factors of copyright in no way mention a storage medium for data, lossless or loss-y.
(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work.
In my opinion, this will likely see a supreme court ruling by the end of the decade.
Any reasonable reading of the current state of fair use doctrine makes it obvious that the process between Harry Potter and the Sorcerer's Stone and "A computer program that outputs responses to user prompts about a variety of topics" is wildly transformative, and thus the usage of the copyrighted material is probably covered by fair use.
People don't view moral issues in the abstract.
A better perspective on this is the fact that human individuals have created works which megacorps are training on for free or for the price of a single book and creating models which replace individuals.
The megacorps are only partially replacing individuals now, but when the models get good enough they could replace humans entirely.
When such a future happens will you still be siding with them or with individual creators?
I think this is a fantasy. My father cowrote a Springer book about physics. For the effort, he got like $400 and 6 author copies.
Now, you might say he got a bad deal (or the book was bad), but I don't think hundreds of thousands of authors do significantly better. The reality is, people overwhelmingly write because they want to, not because of money.
People already use pirated software for product creation.
Hypothetical:
I know a guy who learned photoshop on a pirated copy of Photoshop. He went on to be a graphic designer. All his earnings are ‘proceeds from crime’
He never used the pirated software to produce content.
I remember when piracy wasn't theft, and information wanted to be free.
If that were the case then this court case would not be ongoing
Those damn kind readers and libraries. Giving their single copy away when they just paid for the single.
Reasonable minds could debate the ethics of how the material was used, this ruling judged the usage was legal and fair use. The only problem is the material was in effect stolen.
It's a bit surprising that you can suddenly download copyrighted materials for personal use and and it's kosher as long as you don't share them with others.
edit/addendum: considering this a bit more - the extent to which the original party is deprived of the stolen thing is pertinent for awarding damages. For example, imagine a small entity stealing from a large one, like a small creator steals dungeon and dragons rules. That doesn't deprive Hasbro of DnD, but it is still theft (we're assuming a verbatim copy here lifted directly from DnD books)
The example that I was pondering were shows in russia that were almost literally "the sampsons." Did that stop the Simpson's from airing in the US, its primary market? No, but it was still theft, something was taken without permission.
> That doesn't seem to chime with the copyright notices I have read in books.
You shouldn't get your legal advice from someone with skin in the game.
Please, please differentiate between pirating books (which Anthrophic is liable for, and is still illegal) and training on copyrighted material (which was found to be legal, for both corporations and average people).
So here's the thing, I don't think a textbook author going against a purveyor of online courseware has much of a chance, nor do I think it should have much of a chance, because it probably lacks meaningful proof that their works made a contribution to the creation of the courseware. Would I feel differently if the textbook author could prove in court that a substantial amount of their material contributed to the creation of the courseware, and when I say "prove" I mean they had receipts to prove it? I think that's where things get murky. If you can actually prove that your works made a meaningful contribution to the thing that you're competing against, then maybe you have a point. The tricky part is defining meaningful. An individual author doesn't make a meaningful contribution to the training of an LLM, but a large number of popular and/or prolific numbers can.
You bring up a good point, interpretation of fair use is difficult, but at the end of the day I really don't think we should abolish copyright and IP altogether. I think it's a good thing that creative professionals have some security in knowing that they have legal protections against having to "compete against themselves"
And the system set up by society doesn't truly account for this or care.
Are all music creators better off now than before Spotify?
I get the sentiment, but that statement as is, is absurdly reductive. Details matter. Even if someone takes merchandise from a store without paying, their sentence will vary depending on the details.
No, its not.
It's the maximum statutory damages for willful infringement, which this has not be adjudicated to be. it is not a fine, its an alternative to basis of recovery to actual damages + infringers profits attributable to the infringement.
Of course, there's also a very wide range of statutory damages, the minimum (if it is not "innocent" infringement) is $750/work.
> 105B+ is more than Anthropic is worth on paper.
The actual amount of 7 million works times $150,000/work is $1.05 trillion, not $105 billion.
Did you mean to write "but about distribution" here?
(As an aside, it seems pointless to decry it as a "talking point". The reason it was brought up is presumably because the author agrees with it and thinks it's relevant. It's also entirely possible that the author, like me, made this argument without being aware that it was popularized by Richard Stallman. If it makes sense then you can hear the argument without hearing the person and still find it agreeable.)
"Piracy" is used to refer to copyright violation to make it sound scary and dangerous to people who don't know better or otherwise don't think about it too hard. Just imagine if they called it "banditry" instead; now tell me that pirates are not bandits with boats. They may as well have called it banditry and it's worth correcting that. (I also think it's worth ridiculing but that doesn't appear to be Stallman's primary point.) It's not banditry (how ridiculous would it be to call it that?), it's copyright infringement.
Edit:
Reading my comment again in the context of other things you wrote, I suspect the argument will not pass muster because you do not seem to see piracy's change in meaning as manufactured by PR work purchased by media industry leaders. I'm not really trying to convince you that it's true but it may be worth considering that it is the fundamental disagreement you seem to have with others on Stallman's point; again, not saying you're wrong, just that's where the disagreement is.
When I clicked the link, I got an article about a business that was selling millions of dollars of pirated software.
This guy made millions of dollars in profit by selling pirated software. This wasn't a case of transformative works, nor of an individual doing something for themselves. He was plainly stealing and reselling something.
Here's an article explaining in more detail [1].
Most experts say that if Swartz had gone to trial and the prosecution had proved everything they alleged and the judge had decided to make an example of Swartz and sentence harshly it would have been around 7 years.
Swartz's own attorney said that if they had gone to trail and lost he thought it was unlikely that Swartz would get any jail time.
Swartz also had at least two plea bargain offers available. One was for a guilty plea and 4 months. The other was for a guilty plea and the prosecutors would ask for 6 months but Swartz could ask the judge for less or for probation instead and the judge would pick.
[1] https://www.popehat.com/2013/02/05/crime-whale-sushi-sentenc...
If a human uses a voting machine, they still have a right to vote.
Machines don't have rights. The human using the machine does.
> First, Authors argue that using works to train Claude’s underlying LLMs was like using works to train any person to read and write, so Authors should be able to exclude Anthropic from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for training or learning as such. Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory, each time they later draw upon it when writing new things in new ways would be unthinkable. For centuries, we have read and re-read books. We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems.
Couldn't have put it better myself (though $deity knows I tried many times on HN). Glad to see Judge Alsup continues to be the voice of common sense in legal matters around technology.
As an extreme example, consider murder. Obviously it should be illegal, but if it's legal for one group and not for another, the group for which it's illegal will probably be wiped out, having lost the ability to avenge deaths in the group.
It's much more important that laws are applied impartially and equally than that they are even a tiny bit reasonable.
A trillion parameter SOTA model is not substantially comprised of the one copyrighted piece. (If it was a Harry Potter model trained only on Harry Potter books this would be a different story).
Embeddings are not copy paste.
The last point about market impact would be where they make their argument but it's tenuous. It's not the primary use of AI models and built in prompts try to avoid this, so it shouldn't be commonplace unless you're jail breaking the model, most folk aren't.
Yeah, you’re probably right, I’m not a lawyer. The point is that it doesn’t matter what number the law says they should pay, Anthropic can afford real lawyers and will therefore only pay a pittance, if anything.
I’m old enough to remember what the feds did to Aaron Schwarz, and I don’t see what Anthropic did that was so different, ethically speaking.
Sorry for the long quote, but basically this, yeah. A major point of free software is that creators should not have the power to impose arbitrary limits on the users of their works. It is unethical.
It's why the GPL allows the user to disregard any additional conditions, why it's viral, and why the FSF spends so much effort on fighting "open source but..." licenses.
As mentioned in The Fucking Article, there's a legal difference between training an AI which largely doesn't repeat things verbatim (ala Anthropic) and redistributing media as a whole (ala Spotify, Netflix, journal, ad agency).
Isn't that what a lot of companies are doing, just through employees? I read a lot of books, and took a lot of courses, and now a company is profiting off that information.
"Pirates" also transform the works they distribute. They crack it, translate it, compress it to decrease download times, remove unnecessary things, make it easier to download by splitting it in chunks (essential with dial-up, less so nowadays), change distribution formats, offer it trough different channels, bundle extra software and media that they themselves might have coded like trainers, installers, sick chiptunes and so on. Why is the "transformation" done by a big corpo more legal in your views?
So Suno would only really need to buy the physical albums and rip them to be able to generate music at an industrial scale?
The analogy refers to humans using machines to do what would already be legally if they did it manually.
> And LLM developers are not aspiring to become writers and do not learn anything, they just want to profit by making software using copyrighted material.
[Citation needed], and not a legal argument.
> Also people do not read millions of books to become a writer.
But people do hear millions of words as children.
Don’t we already have laws covering this? For example, sometimes excess books can be thrown in the bin. Often, they have the covers removed. Some will say something to the effect that “if you’ve received this without a cover it is a copyright violation.” I think one of the points of the lawsuit is it gives copyright holders discretion as to how their works are used/sold etc. The idea that “if you saved it from the bin you can do with it whatever you want” strips them of that right.
> - buying, physically cutting up, physically digitizing books, and using them for training is fair use
> - pirating the books for their digital library is not fair use.
That seems inconsistent with one another. If it's fair use, how is it piracy?
It also seems pragmatically trash. It doesn't do the authors any good for the AI company to buy one copy of their book (and a used one at that), but it does make it much harder for smaller companies to compete with megacorps for AI stuff, so it's basically the stupidest of the plausible outcomes.
In short the post is bait.
I could, right now in just a few minutes, go download a perfectly functional pirated copy of nearly any Adobe program, nearly any Microsoft program and a whole range of books and movies, yet I see zero real financial troubles affecting any of the companies behind these. All the contrary in fact.
If you give people a claim for damages which is an order of magnitude larger than their actual damages, it encourages litigiousness and becomes a vector for shakedowns because the excessive cost of losing pressures innocent defendants to settle even if there was a 90% chance they would have won.
Meanwhile both parties have the incentive to settle in civil cases when it's obvious who is going to win, because a settlement to pay the damages is cheaper than the cost of going to court and then having to pay the same damages anyway. Which also provides a deterrent to doing it to begin with, because even having to pay lawyers to negotiate a settlement is a cost you don't want to pay when it's clear that what you're doing is going to have that result.
And when the result isn't clear, penalizing the defendant in a case of first impression isn't just either, because it wasn't clear and punitive measures should be reserved for instances of unambiguous wrongdoing.
Not only that but all of us are guilty too because I'm positive we've all clicked on search results that contained copyrighted content that was copied without permission. You may not have even known it was such.
Remember: Intent is irrelevant when it comes to copyright infringement! It's not that kind of law.
Intent can guide a judge when they determine damages but that's about it.
You are allowed to buy and scan books, and then used those scanned books to create products. I guess you are also allowed to pirate books and use the knowledge to create products if you are willing to pay the damages to the rights holders for copyright violations.
If I was China I would buy every lawyer to drown western AI companies in lawsuits, because it's an easy way to win AI race.
Do keep in mind though: this is only for the wealthy. They're still going to send the Pinkertons at your house if you dare copy a Blu-ray.
That's a point I normally use to argue against authors being entitled to royalties on LLM outputs. An individual author's marginal contribution to an LLM is essentially nil, and could be removed from the training set with no meaningful impact on the model. It's only the accumulation of a very large amount of works that turns into a capable LLM.
No it's not. And you ever heard of a publishing house? They don't need to negotiate with every single author individually. That's preposterous.
* They downloaded a massive online library of pirated books that someone else was distributing illegally. This was not fair use.
* They then digitised a bunch of books that they physically owned copies of. This was fair use.
This part of the ruling is pretty much existing law. If you have a physical book (or own a digital copy of a book), you can largely do what you like with it within the confines of your own home, including digitising it. But you are not allowed to distribute those digital copies to others, nor are you allowed to download other people's digital copies that you don't own the rights to.
The interesting part of this ruling is that once Anthropic had a legal digital copy of the books, they could use it for training their AI models and then release the AI models. According to the judge, this counts as fair use (assuming the digital copies were legally sourced).
It's not the only reason fair use exists, but it's the thing that allows e.g. search engines to exist, and that seems pretty important.
> And you ever heard of a publishing house? They don't need to negotiate with every single author individually. That's preposterous.
There are thousands of publishing houses and millions of self-published authors on top of that. Many books are also out of print or have unclear rights ownership.
Seven years for thumbing your nose at Autodesk when armed robbery would get you less time says some interesting things about the state of legal practice.
You'd have to steal the author's ownership of the intellectual property in order for the comparison to be valid, just as you stole ownership of his crop.
Separately, there is a reason why theft and copyright infringement are two distinct concepts in law.
You could split hairs over whether saving an item from the bin occurred after a procedure to remove covers and it was already dumped, or before any contemplation was made about if or when dumping would take place.
Saving either way would be preserving what would otherwise be lost, even if it was well premeditated in advance of any imminent risk.
What if it was the last remaining copy?
Or even the only copy ever in existence of an original manuscript?
It's just not a concept suitable for a black & white judgment.
That's a very good sign that probably an entire book of regulations needs to be thrown out instead, and a new law written to replace it with something more sensible.
“We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems.”
Claude is not doing any of these things. There is no admiration, no internalizing of sweeping themes. There’s a network encoding data.
We’re talking about a machine that accepts content and then produces more content. It’s not a person, it’s owned by a corporation that earns money on literally every word this machine produces. If it didn’t have this large corpus of input data (copyrighted works) it could not produce the output data for which people are willing to pay money. This all happens at a scale no individual could achieve because, as we know, it is a machine.
Can you point me to the US Supreme Court case where this is existing law?
It's pretty clear that if you have a physical copy of a book, you can lend it to someone. It also seems pretty reasonable that the person borrowing it could make fair use of it, e.g. if you borrow a book from the library to write a book review and then quote an excerpt from it. So the only thing that's left is, what if you do the same thing over the internet?
Shouldn't we be able to distinguish this from the case where someone is distributing multiple copies of a work without authorization and the recipients are each making and keeping permanent copies of it?
As a researcher I've been furious that we publish papers where the research data is unknown. To add insult to injury, we have the audacity to start making claims about "zero-shot", "low-shot", "OOD", and other such things. It is utterly laughable. These would be tough claims to make *even if we knew the data*, simply because of its size. But not knowing the data, it is outlandish. Especially because the presumptions are "everything on the internet." It would be like training on all of GitHub and then writing your own simple programming questions to test an LLM[0]. Analyzing that amount of data is just intractable, and we currently do not have the mathematical tools to do so. But this is a much harder problem to crack when we're just conjecturing and ultimately this makes interoperability more difficult.
On top of all of that, we've been playing this weird legal game. Where it seems that every company has had to cheat. I can understand how smaller companies turn to torrenting to compete, but when it is big names like Meta, Google, Nvidia, OpenAI (Microsoft), etc it is just wild. This isn't even following the highly controversial advice of Eric Schmidt "Steal everything, then if you get big, let the lawyers figure it out." This is just "steal everything, even if you could pay for it." We're talking about the richest companies in the entire world. Some of the, if not the, richest companies to ever exist.
Look, can't we just try to be a little ethical? There is, in fact, enough money to go around. We've seen unprecedented growth in the last few years. It was only 2018 when Apple became the first trillion dollar company, 2020 when it became the second two trillion, and 2022 when it became the first three trillion dollar company. Now we have 10 companies north of the trillion dollar mark![3] (5 above $2T and 3 above $3T) These values have exploded in the last 5 years! It feels difficult to say that we don't have enough money to do things better. To at least not completely screw over "the little guy." I am unconvinced that these companies would be hindered if they had to broker some deal for training data. Hell, they're already going to war over data access.
My point here is that these two things align. We're talking about how this technology is so dangerous (every single one of those CEOs has made that statement) and yet we can't remain remotely ethical? How can you shout "ONLY I CAN MAKE SAFE AI" while acting so unethically? There's always moral gray areas but is this really one of them? I even say this as someone who has torrented books myself![4] We are holding back the data needed to make AI safe and interpretable while handing the keys to those who actively demonstrate that they should not hold the power. I don't understand why this is even that controversial.
[0] Yes, this is a snipe at HumanEval. Yes, I will make the strong claim that the dataset was spoiled from day 1. If you doubt it, go read the paper and look at the questions (HuggingFace).
[1] https://www.theverge.com/2024/8/14/24220658/google-eric-schm...
[2] https://en.wikipedia.org/wiki/List_of_public_corporations_by...
[3] https://companiesmarketcap.com/
[4] I can agree it is wrong, but can we agree there is a big difference between a student torrenting a book and a billion/trillion dollar company torrenting millions of books? I even lean on the side of free access to information, and am a fan of Aaron Swartz and SciHub. I make all my works available on ArXiv. But we can recognize there's a big difference between a singular person doing this at a small scale and a huge multi-national conglomerate doing it at a large scale. I can't even believe we so frequently compare these actions!
https://www.computerworld.com/article/1447323/google-reporte...
This argument is more along the lines of: blaming Microsoft Word for someone typing characters into the word processors algorithm, and outputting a copy of an existing book. (Yes, it is a lot easier, but the rationale is the same). In my mind the end user prompting the model would be the one potentially infringing.
Ordinary property is much worse than copyright, which is both time limited and not necessarily obtained through work, and which is much more limited in availability than the number of sequences.
When someone owns land, that's actually a place you stumble upon and can't enter, whereas you're not going to ever stumble upon the story of even 'Nasse hittar en stol' (swedish 'Nasse finds a chair') a very short book for very small children.
Also, I don't quite understand how your example is relevant to the case. If you give a book to a friend, they are now the owner of that book and can do what they like with it. If you photocopy that book and give them the photocopy, they are not the owner of the book and you have reproduced it without permission. The same is, I believe, true of digital copies - this is how ebook libraries work.
In this case, Anthropic were the legal owners of the physical books, and so could do what they wanted with them. They were not the legal owners of the digital books, which means they can get prosecuted for copyright infringement.
This absolutely falls under copyright law as I understand it (not a lawyer). E.g. the disclaimer that rolls before every NFL broadcast. The notice states that the broadcast is copyrighted and any unauthorized use, including pictures, descriptions, or accounts of the game, is prohibited. There is wiggle room for fair use by news organizations, critics, artists, etc.
No, it kinda isn't. Show me anything that supports this idea beyond your own immediate conjecture right now.
>It's not the only reason fair use exists, but it's the thing that allows e.g. search engines to exist, and that seems pretty important.
No, that's the transformative element of what a search engine provides. Search engines are not legal because they can't contact each licensor, they are legal because they are considered hugely transformative features.
>There are thousands of publishing houses and millions of self-published authors on top of that. Many books are also out of print or have unclear rights ownership.
Okay, and? How many customers does Microsoft bill on a monthly basis?
https://copyright.gov/about/1790-copyright-act.html
Specified in dollars because dollars had been invented (in 1789), but in the amount of one half of one dollar, i.e. $0.50. That's 1790 dollars, of course, so a little under $20 today. (There was basically no inflation for the first 100+ years of that because the US dollar was still backed by precious metals then; a dollar was worth slightly more in 1900 than in 1790.)
That seems more like an attempt to codify some amount of plausible actual damages so people aren't arguing endlessly about valuations, rather than an attempt to impose punitive damages. Most notably because -- unlike the current method -- it scales with the number of sheets reproduced.
I do think that a big part of the reason Anthropic downloaded millions of books from pirate torrents was because they needed that input data in order to generate the output, their product.
I don’t know what that is, but, IMHO, not sharing those dollars with the creators of the content is clearly wrong.
Make no mistake, they’re seeking to exploit the contents of that material for profits that are orders of magnitude larger than what any shady pirated-material reseller would make. The world looks the other way because these companies are “visionary” and “transformational.”
Maybe they are, and maybe they should even have a right to these buried works, but what gives them the right to rip up the rule book and (in all likelihood) suffer no repercussions in an act tantamount to grand theft?
There’s certainly an argument to be had about whether this form of research and training is a moral good and beneficial to society. My first impression is that the companies are too opaque in how they use and retain these files, albeit for some legitimate reasons, but nevertheless the archival achievements are hidden from the public, so all that’s left is profit for the company on the backs of all these other authors.
Ie. This is not a big deal. The only difference now is ppl are rapidly frothing to be outraged by the mere sniff of new tech on the horizon. Overton window in effect.
Or even the only copy ever in existence of an original manuscript?
I think these still remove the copyright of the author. As it stands, I have the right to write the best novel about the human condition ever conceived and also the right (if copyrighted) to not allow anyone to read it. I can light it on fire if I wish. I am not obligated to sell it to anyone. In the context of the above, I can stipulate that nobody can distribute excess copies even if they would be otherwise destroyed. You may think that’s wasteful or irrational but we have all kinds of rights that protect our ability to do irrational things with our own property.
>That's a very good sign that probably an entire book of regulations needs to be thrown out instead, and a new law written to replace it with something more sensible.
This sentiment implies that you do not think the owner has those rights. That’s fine, but there are plenty of people (myself included) who think those are reasonable rights. Intellectual property clause is in the first article of the US Constitution for a good reason, although I do think it can be abused.
What you’re saying is like calling Al Capone a tax cheat. Nonsense.
They went after Aaron over copyright.
Some previous discussions:
https://news.ycombinator.com/item?id=44367850
It's inherent in the nature of the test. The most important fair use factor is the effect on the market for the work, so if the use would be uneconomical without fair use then the effect on the market is negligible because the alternative would be that the use doesn't happen rather than that the author gets paid for it.
> No, that's the transformative element of what a search engine provides. Search engines are not legal because they can't contact each licensor, they are legal because they are considered hugely transformative features.
To make a search engine you have to do two things. One is to download a copy of the whole internet, the other is to create a search index. I'm talking about the first one, you're talking about the second one.
> Okay, and? How many customers does Microsoft bill on a monthly basis?
Microsoft does this with an automated system. There is no single automated system where you can get every book ever written, and separately interfacing with all of the many systems needed in order to do it is the source of the overhead.
If you watch a YouTube video to learn something and it's later taken down for using copyrighted images, you learned from illegal content.
I never saw any of these. All the cases I saw were related to people using torrents or other P2P software (which aren't just downloading). These might exist, but I haven't seen them.
> It's a bit surprising that you can suddenly download copyrighted materials for personal use and it's kosher as long as you don't share them with others.
Every click on a link is a risk of downloading copyrighted material you don't have the rights to.
Searching the internet, it appears that it's a civil infraction, but it's also confused with the notion that "piracy" is illegal, a term that's used for many different purposes. I see "It is illegal to download any music or movies that are copyrighted." under legal advice, which I know as a statement is not true.
Hence my confusion.
I should note: I'm not arguing from the perspective of whether it's morally or ethically right. Only that even in the context of this thread, things are phrased that aren't clear.
That the mechanism performing these things is a network encoding data is… well, that description, at that level of abstraction, is a similarity with the way a human does it, not even a difference.
My network is a 3D mess made of pointy bi-lipid bags exchanging protons across gaps moderated by the presence of neurochemicals, rather than flat sheets of silicon exchanging electrons across tuned energy band-gaps moderated by other electrons, but it's still a network.
> We’re talking about a machine that accepts content and then produces more content. It’s not a person, it’s owned by a corporation that earns money on literally every word this machine produces. If it didn’t have this large corpus of input data (copyrighted works) it could not produce the output data for which people are willing to pay money. This all happens at a scale no individual could achieve because, as we know, it is a machine.
My brain is a machine that accepts content in the form of job offers and JIRA tickets (amongst other things), and then produces more content in the form of pull requests (amongst other things). For the sake specifically of this question, do the other things make a difference? While I count as a person and am not owned by any corporation, when I work for one, they do earn money on the words this biological machine produces. (And given all the models which are free to use, the LLMs definitely don't earn money on "literally" every word those models produce). If I didn't have the large corpus of input data — and there absolutely was copyright on a lot of the school textbooks and the TV broadcast educational content of the 80s and 90s when I was at school, and the Java programming language that formed the backbone of my university degree — I could not produce the output data for which people are willing to pay money.
Should corporations who hire me be required to pay Oracle every time I remember and use a solution that I learned from a Java course, even when I'm not writing Java?
That the LLMs do this at a scale no individual could achieve because it is a machine, means it's got the potential to wipe me out economically. Economics threat of automation has been a real issue at least since the luddites if not earlier, and I don't know how the dice will fall this time around, so even though I have one layer of backup plan, I am well aware it may not work, and if it doesn't then government action will have to happen because a lot of other people will be in trouble before trouble gets to me (and recent history shows that this doesn't mean "there won't be trouble").
Copyright law is one example of government action. So is mandatory education. So is UBI, but so too is feudalism.
Good luck to us all.
Statutory damages were added to reduce the burden on plaintiffs. Which encourages people to stay in line. How well this worked out and what it means when some company nobody heard of 4 years ago downloads a billion copyrighted pages and raises $3.5 billion against a $60 billion valuation...
Well suddenly $20/page still sounds about right.
We're talking about lending rather than ownership transfers, though of course you could regard lending as a sort of ownership transfer with an agreement to transfer it back later.
> If you photocopy that book and give them the photocopy, they are not the owner of the book and you have reproduced it without permission.
But then the question is whether the copy is fair use, not who the owner of the original copy was, right? For example, you can make a fair use photocopy of a page from a library book.
> They were not the legal owners of the digital books, which means they can get prosecuted for copyright infringement.
Even if the copy they make falls under fair use and the person who does own that copy of the book has no objection to their doing this?
That suggests otherwise.
Do you mean:
A) It's not a criminal offence?
B) The copyright owner cannot file a civil suit for damages?
C) Something else?
> Statutory damages were added to reduce the burden on plaintiffs. Which encourages people to stay in line.
It encourages people to not spend a lot of resources speculating about damages. That doesn't mean you need the amount to be punitive rather than compensatory.
Claude has been considered transformative given it's not really meant to generate books but Suno or Midjourney are absolutely in another category.
Yep, that name's a blast from the past! He was the judge on the big Google/Oracle case about Android and Java years ago, IIRC. I think he even learned to write some Java so he could better understand the case.
If the output from said model uses the voice of another person, for example, we already have a legal framework in place for determining if it is infringing on their rights, independent of AI.
Courts have heard cases of individual artists copying melodies, because melodies themselves are copyrightable: https://www.hypebot.com/hypebot/2020/02/every-possible-melod...
Copyright law is a lot more nuanced than anyone seems to have the attention span for.
No, that's not the most important factor. The transformative factor is the most important. Effect on market for the work doesn't even support your argument anyway. Your argument is about the cost of making the end product, which is totally distinct from the market effects on the copyright holder when the infringer makes and releases the infringing product.
>To make a search engine you have to do two things. One is to download a copy of the whole internet, the other is to create a search index. I'm talking about the first one, you're talking about the second one.
So? That doesn't make you right. Go read the opinions, dude. This isn't something that's actually up for debate. Search engines are fair uses because of their transformative effect, not because they are really expensive otherwise. Your argument doesn't even make sense. By that logic, anything that's expensive becomes a fair use. It's facially ridiculous. Them being expensive is neither sufficient nor necessary for them to be a fair use. Their transformative nature is both sufficient and necessary to be found a fair use. Full stop.
>Microsoft does this with an automated system. There is no single automated system where you can get every book ever written, and separately interfacing with all of the many systems needed in order to do it is the source of the overhead.
Okay, and? They don't need to get every single book ever written. The libraries they pirated do not consist of "every single book ever written". It's hard to take this argument in good faith because you're being so ridiculous.
But Suno is definitely not training models in their basement for fun.
They are a private company selling music, using music made by humans to train their models, to replace human musicians and artists.
We'll see what the courts say but that doesn't sound like fair use.
In the UK it's a criminal offense if you distribute a copyrighted work with the intent to make gain or with the expectation that the owner will make a loss.
Gain and loss are only financial in this context.
Meaning that in both countries the copyright owner can sue you for copyright infringement.
We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages, actual or statutory (including for willfulness).
This is an uncharitable interpretation. The ostensible point of the comment, or at least a stronger and still-reasonable interpretation, is that they are trying to point out that this specific word choice confuses concepts, which it does. Richard Stallman and the commenter in question are absolutely correct to point that out. You actually seem to be agreeing with Stallman, at least in the abstract.
It's should be acknowledged how/why the meaning of the word changed. As I said, that seems to have been manufactured, which suggests, at least to me, that their (and Richard Stallman's) point is essentially the same as yours. That is to say, the US media industry started paying PR firms to use "piracy" as meaning something other than its normal definition until that became the common definition.
They should not purposely use a different definition like that. That is Stallman's point, and why he refuses to say "piracy" instead of "copyright infringement"; ocean banditry is not copyright infringement and it is confusing -- intentionally so -- to say that it is.
Four factor test seems to be working, even in this case. Don't love it (it goes against my values and what I need to do in my job) but I get it.
Edit: we've triggered HN's patience for this discussion and it's now blocking replies. You do seem a bit long on Google and short on practical experience here. How else would you propose these types of disagreements get sorted? ("Anyone can be sued for anything" notwithstanding.)
There are explicltly no punitive damages in US Copyright law. And the "willful" provision in practice means demonstrating ongoing disregard, after being informed. It's a long walk to the end of that plank.
It's a four factor test because all of the factors are relevant, but if the use has negligible effect on the market for the work then it's pretty hard to get anywhere with the others. For example, for cases like classroom use, even making verbatim copies of the entire work is often still fair use. Buying a separate copy for each student to use for only a few minutes would make that use uneconomical.
> Effect on market for the work doesn't even support your argument anyway. You're argument is about the cost of making the end product, which is totally distinct from the market effects on the copyright holder when the infringer makes and releases the infringing product.
We're talking about the temporary copies they make during training. Those aren't being distributed to anyone else.
> So? That doesn't make you right.
Making a copy of everything on the internet is a prerequisite to making a search engine. It's something you have to do as a step to making the index, which is the transformative step. Are you suggesting that doing the first step is illegal or what do you propose justifies it?
> By that logic, anything that's expensive becomes a fair use. It's facially ridiculous.
Anything with unreasonably high transaction costs. Why is that ridiculous? It doesn't exempt any of the normal stuff like an individual person buying an individual book.
> They don't need to get every single book ever written.
They need to get as many books as possible, with the platonic ideal being every book. Whether or not the ideal is feasible in practice, the question is whether it's socially beneficial to impose a situation with excessively high transaction costs in order to require something with only trivial benefit to authors (potentially selling one extra copy).
You did anything which it's not clear whether it's fair use or not. Willfulness is whether you knew you were doing it, not whether you knew whether it was fair use, which in many cases nobody knows until a court decides it, hence the problem.
You have to do it in order to get into court and find out of you're allowed to do it (a ridiculous prerequisite to begin with), and then if it goes against you, you have to pay punitive damages?