Anthropic cut up millions of used books, and downloaded 7M pirated ones – judge

  First, Authors argue that using works to train Claude’s underlying LLMs 
  was like using works to train any person to read and write, so Authors 
  should be able to exclude Anthropic from this use (Opp. 16). 

  Second, to that last point, Authors further argue that the training was 
  intended to memorize their works’ creative elements — not just their 
  works’ non-protectable ones (Opp. 17).

  Third, Authors next argue that computers nonetheless should not be 
  allowed to do what people do.

https://media.npr.org/assets/artslife/arts/2025/order.pdf

replies(4): >>44492411 #>>44492758 #>>44492890 #>>44493381 #

TeMPOraL ◴[07 Jul 25 18:39 UTC] No.44493381[source]▶

>>44492293 #

The first paragraph sounds absurd, so I looked into the PDF, and here's the full version I found:

> First, Authors argue that using works to train Claude’s underlying LLMs was like using works to train any person to read and write, so Authors should be able to exclude Anthropic from this use (Opp. 16). But Authors cannot rightly exclude anyone from using their works for training or learning as such. Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory, each time they later draw upon it when writing new things in new ways would be unthinkable. For centuries, we have read and re-read books. We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems.

Couldn't have put it better myself (though $deity knows I tried many times on HN). Glad to see Judge Alsup continues to be the voice of common sense in legal matters around technology.

replies(2): >>44493990 #>>44494761 #

1. cmiles74 ◴[07 Jul 25 19:48 UTC] No.44493990[source]▶

>>44493381 #

For everyone arguing that there’s no harm in anthropomorphizing an LLM, witness this rationalization. They talk about training and learning as if this is somehow comparable to human activities. The idea that LLM training is comparable to a person learning seems way out there to me.

“We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems.”

Claude is not doing any of these things. There is no admiration, no internalizing of sweeping themes. There’s a network encoding data.

We’re talking about a machine that accepts content and then produces more content. It’s not a person, it’s owned by a corporation that earns money on literally every word this machine produces. If it didn’t have this large corpus of input data (copyrighted works) it could not produce the output data for which people are willing to pay money. This all happens at a scale no individual could achieve because, as we know, it is a machine.

replies(1): >>44494516 #

2. ben_w ◴[07 Jul 25 20:45 UTC] No.44494516[source]▶

>>44493990 (TP) #

There may be no admiration, but there definitely is an internalising of sweeping themes, and all the other things in your quotation, which anyone can fetch by asking it for the themes/substantive points/stylistic solutions of one of the books it has (for lack of a better verb) read.

That the mechanism performing these things is a network encoding data is… well, that description, at that level of abstraction, is a similarity with the way a human does it, not even a difference.

My network is a 3D mess made of pointy bi-lipid bags exchanging protons across gaps moderated by the presence of neurochemicals, rather than flat sheets of silicon exchanging electrons across tuned energy band-gaps moderated by other electrons, but it's still a network.

> We’re talking about a machine that accepts content and then produces more content. It’s not a person, it’s owned by a corporation that earns money on literally every word this machine produces. If it didn’t have this large corpus of input data (copyrighted works) it could not produce the output data for which people are willing to pay money. This all happens at a scale no individual could achieve because, as we know, it is a machine.

My brain is a machine that accepts content in the form of job offers and JIRA tickets (amongst other things), and then produces more content in the form of pull requests (amongst other things). For the sake specifically of this question, do the other things make a difference? While I count as a person and am not owned by any corporation, when I work for one, they do earn money on the words this biological machine produces. (And given all the models which are free to use, the LLMs definitely don't earn money on "literally" every word those models produce). If I didn't have the large corpus of input data — and there absolutely was copyright on a lot of the school textbooks and the TV broadcast educational content of the 80s and 90s when I was at school, and the Java programming language that formed the backbone of my university degree — I could not produce the output data for which people are willing to pay money.

Should corporations who hire me be required to pay Oracle every time I remember and use a solution that I learned from a Java course, even when I'm not writing Java?

That the LLMs do this at a scale no individual could achieve because it is a machine, means it's got the potential to wipe me out economically. Economics threat of automation has been a real issue at least since the luddites if not earlier, and I don't know how the dice will fall this time around, so even though I have one layer of backup plan, I am well aware it may not work, and if it doesn't then government action will have to happen because a lot of other people will be in trouble before trouble gets to me (and recent history shows that this doesn't mean "there won't be trouble").

Good luck to us all.

↑