Anthropic cut up millions of used books, and downloaded 7M pirated ones – judge

1. trinsic2 ◴[07 Jul 25 15:16 UTC] No.44491270[source]▶

I'm not seeing how this is fair use in either case.

Someone correct me if I am wrong but aren't these works being digitized and transformed in a way to make a profit off of the information that is included in these works?

It would be one thing for an individual to make person use of one or more books, but you got to have some special blindness not to see that a for-profit company's use of this information to improve a for-profit model is clearly going against what copyright stands for.

replies(8): >>44491399 #>>44491424 #>>44491457 #>>44491657 #>>44492008 #>>44492099 #>>44493528 #>>44495414 #

2. jimbob21 ◴[07 Jul 25 15:30 UTC] No.44491399[source]▶

>>44491270 (TP) #

They clearly were being digitized, but I think its a more philosophical discussion that we're only banging our heads against for the first time to say whether or not it is fair use.

Simply, if the models can think then it is no different than a person reading many books and building something new from their learnings. Digitization is just memory. If the models cannot think then it is meaningless digital regurgitation and plagiarism, not to mention breach of copyright.

The quotes "consistent with copyright's purpose in enabling creativity and fostering scientific progress." and "Like any reader aspiring to be a writer" say, from what I can tell, that the judge has legally ruled the model can think as a human does, and therefore has the legal protections afforded to "creatives."

replies(1): >>44491503 #

3. wrs ◴[07 Jul 25 15:33 UTC] No.44491424[source]▶

>>44491270 (TP) #

Copyright is not on “information”, It’s on the tangible expression (i.e., the actual words). “Transformative use” is a defense in copyright infringement.

4. kristofferR ◴[07 Jul 25 15:36 UTC] No.44491457[source]▶

>>44491270 (TP) #

What do you think fair use is? The whole point of the fair use clauses is that if you transform copyrighted works enough you don't have to pay the original copyright holder.

replies(1): >>44493373 #

5. palmotea ◴[07 Jul 25 15:41 UTC] No.44491503[source]▶

>>44491399 #

> Simply, if the models can think then it is no different than a person reading many books and building something new from their learnings.

No, that's fallacious. Using anthropomorphic words to describe a machine does not give it the same kinds of rights and affordances we give real people.

replies(2): >>44491612 #>>44492174 #

6. jimbob21 ◴[07 Jul 25 15:50 UTC] No.44491612{3}[source]▶

>>44491503 #

Actually, it does, at least for this case. The judge just said so.

replies(2): >>44492115 #>>44494364 #

7. skybrian ◴[07 Jul 25 15:55 UTC] No.44491657[source]▶

>>44491270 (TP) #

Copyright is largely about distributing copies. It’s not about making something vaguely similar or about referencing copyrighted work to make something vaguely similar.

Although, there’s an exception for fictional characters:

https://en.m.wikipedia.org/wiki/Copyright_protection_for_fic...

8. pavon ◴[07 Jul 25 16:29 UTC] No.44492008[source]▶

>>44491270 (TP) #

There is another case where companies slurped up all of the internet and profited off the information, that makes a good comparison - search engines.

Judges consider a four factor when examining fair use[1]. For search engines,

1) The use is transformative, as a tool to find content is very different purpose than the content itself.

2) Nature of the original work runs the full gamut, so search engines don't get points for only consuming factual data, but it was all publicly viewable by anyone as opposed to books which require payment.

3) The search engine store significant portions of the work in the index, but it only redistributes small portions.

4) Search engines, as original devised, don't compete with the original, in fact they can improve potential market of the original by helping more people find them. This has changed over time though, and search engines are increasingly competing with the content they index, and intentionally trying to show the information that people want on the search page itself.

So traditional search which was transformative, only republished small amounts of the originals, and didn't compete with the originals fell firmly on the side of fair use.

Google News and Books on the other hand weren't so clear cut, as they were showing larger portions of the works and were competing with the originals. They had to make changes to those products as a result of lawsuits.

So now lets look at LLMs:

1) LLM are absolutely transformative. Generating new text at users request is a very different purpose and character from the original works.

2) Again runs the full gamut (setting aside the clear copyright infringement downloading of illegally distributed books which is a separate issue)

3) For training purposes, LLMs don't typically preserve entire works, so the model is in a better place legally than a search index, which has precedent that storing entire works privately can be fair use depending on the other factors. For inference, even though they are less likely to reproduce the originals in their outputs than search engines, there are failure cases where an LLM over-trained on a work, and a significant amount the original can be reproduced.

4) LLMs have tons of uses some of which complement the original works and some of which compete directly with them. Because of this, it is likely that whether LLMs are fair use will depend on how they are being used - eg ignore the LLM altogether and consider solely the output and whether it would be infringing if a human created it.

This case was solely about whether training on books is fair use, and did not consider any uses of the LLM. Because LLMs are a very transformative use, and because they don't store original verbatim, it weighs strongly as being fair use.

I think the real problems that LLMs face will be in factors 3 and 4, which is very much context specific. The judge himself said that the plaintiffs are free to file additional lawsuits if they believe the LLM outputs duplicate the original works.

[1] https://fairuse.stanford.edu/overview/fair-use/four-factors/

9. NoMoreNicksLeft ◴[07 Jul 25 16:38 UTC] No.44492099[source]▶

>>44491270 (TP) #

Digitizing the books is the equivalent of a blind person doing something to the book to make it readable to them... the software can't read analog pages.

Learning from the book is, well, learning from the book. Yes, they intended to make money off of that learning... but then I guess a medical student reading medical textbooks intends to profit off of what they learn from them. Guess that's not fair use either (well, it's really just use, as in the intended use for all books since they were first invented).

Once a person has to believe that copyright has any moral weight at all, I guess all rational though becomes impossible for them. Somehow, they're not capable of entertaining the idea that copyright policy was only ever supposed to be this pragmatic thing to incentivize creative works... and that whatever little value it has disappears entirely once the policy is twisted to consolidate control.

10. NoOn3 ◴[07 Jul 25 16:40 UTC] No.44492115{4}[source]▶

>>44491612 #

People have rights, machines don't. Otherwise, maybe give machines the right to vote, for example?...

replies(2): >>44493332 #>>44495432 #

11. pavon ◴[07 Jul 25 16:47 UTC] No.44492174{3}[source]▶

>>44491503 #

The judge did use some language that analogized the training with human learning. I don't read it as basing the legal judgement on anthropomorphizing the LLM though, but rather discussing whether it would be legal for a human to do the same thing, then it is legal for a human to use a computer to do so.

  First, Authors argue that using works to train Claude’s underlying LLMs was like using
  works to train any person to read and write, so Authors should be able to exclude Anthropic
  from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for
  training or learning as such. Everyone reads texts, too, then writes new texts. They may need
  to pay for getting their hands on a text in the first instance. But to make anyone pay
  specifically for the use of a book each time they read it, each time they recall it from memory,
  each time they later draw upon it when writing new things in new ways would be unthinkable.
  For centuries, we have read and re-read books. We have admired, memorized, and internalized
  their sweeping themes, their substantive points, and their stylistic solutions to recurring writing
  problems.

  ...

  In short, the purpose and character of using copyrighted works to train LLMs to generate
  new text was quintessentially transformative. Like any reader aspiring to be a writer,
  Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but
  to turn a hard corner and create something different. If this training process reasonably
  required making copies within the LLM or otherwise, those copies were engaged in a
  transformative use.

[1] https://authorsguild.org/app/uploads/2025/06/gov.uscourts.ca...

12. kube-system ◴[07 Jul 25 18:34 UTC] No.44493332{5}[source]▶

>>44492115 #

This case is more like:

If a human uses a voting machine, they still have a right to vote.

Machines don't have rights. The human using the machine does.

13. kube-system ◴[07 Jul 25 18:38 UTC] No.44493373[source]▶

>>44491457 #

Fair use is not, at its core, about transformation. It's about many types of uses that do not interfere with the reasons for the rights we ascribe to authors. Fair use doesn't require transformation.

14. kenmacd ◴[07 Jul 25 18:55 UTC] No.44493528[source]▶

>>44491270 (TP) #

> to make a profit off of the information that is included in these works?

Isn't that what a lot of companies are doing, just through employees? I read a lot of books, and took a lot of courses, and now a company is profiting off that information.

15. ◴[07 Jul 25 20:30 UTC] No.44494364{4}[source]▶

>>44491612 #

16. protocolture ◴[07 Jul 25 22:55 UTC] No.44495414[source]▶

>>44491270 (TP) #

>clearly going against what copyright stands for.

Copyright isnt a digital moat. Its largely an agreement that the work is available to the public, but the creator has a limited amount of time to exploit it at market.

If you sell an AI model, or access to an AI model, theres usually around 0% of the training data redistributed with the model. You cant decompile it and find the book. As you aren't redistributing the original work copyright is barely relevant.

Imagine suggesting that because you own the design of a hammer, that all works created with the hammer belong to you and cant be sold?

That someone came up with a new method of using books as a tool to create a different work, does not entitle the original book author to a cut of the pie.

17. protocolture ◴[07 Jul 25 22:59 UTC] No.44495432{5}[source]▶

>>44492115 #

If I can use my brain to learn, I as a human can use my computer to learn.

Its like, taking notes, or google image search caching thumbnails. Honestly we dont even need the learning metaphor to see this is obviously not an infringement.