Anthropic cut up millions of used books, and downloaded 7M pirated ones – judge

1. guywithahat ◴[07 Jul 25 16:23 UTC] No.44491931[source]▶

If you own a book, it should be legal for your computer to take a picture of it. I honestly feel bad for some of these AI companies because the rules around copyright are changing just to target them. I don't owe copyright to every book I read because I may subconsciously incorporate their ideas into my future work.

replies(6): >>44491968 #>>44491997 #>>44492019 #>>44492128 #>>44492134 #>>44492187 #

2. raincole ◴[07 Jul 25 16:27 UTC] No.44491968[source]▶

>>44491931 (TP) #

Are we reading the same article? The article explicitly states that it's okay to cut up and scan the books you own to train a model from them.

> I honestly feel bad for some of these AI companies because the rules around copyright are changing just to target them

The ruling would be a huge win for AI companies if held. It's really weird that you reached the opposite conclusion.

3. rapind ◴[07 Jul 25 16:29 UTC] No.44491997[source]▶

>>44491931 (TP) #

Everything is different at scale. I'm not giving a specific opinion on copyright here, but it just doesn't make sense when we try to apply individual rights and rules to systems of massive scale.

I really think we need to understand this as a society and also realize that moneyed interests will downplay this as much as possible. A lot of the problems we're having today are due to insufficient regulation differentiating between individuals and systems at scale.

4. organsnyder ◴[07 Jul 25 16:31 UTC] No.44492019[source]▶

>>44491931 (TP) #

The difference here is that an LLM is a mechanical process. It may not be deterministic (at least, in a way that my brain understands determinism), but it's still a machine.

What you're proposing is considering LLMs to be equal to humans when considering how original works are created. You could make the argument that LLM training data is no different from a human "training" themself over a lifetime of consuming content, but that's a philosophical argument that is at odds with our current legal understanding of copyright law.

replies(2): >>44492057 #>>44492121 #

5. ◴[07 Jul 25 16:34 UTC] No.44492057[source]▶

>>44492019 #

6. kevinpet ◴[07 Jul 25 16:40 UTC] No.44492121[source]▶

>>44492019 #

That's not a philosophical argument at odds with our current understanding of copyright law. That's exactly what this judge found copyright law currently is and it's quoted in the article being discussed.

replies(1): >>44492563 #

7. zerotolerance ◴[07 Jul 25 16:41 UTC] No.44492128[source]▶

>>44491931 (TP) #

"Judge says training Claude on books was fair use, but piracy wasn't."

8. atomicnumber3 ◴[07 Jul 25 16:43 UTC] No.44492134[source]▶

>>44491931 (TP) #

The core problem here is that copyright already doesn't actually follow any consistent logical reasoning. "Information wants to be free" and so on. So our own evaluation of whether anything is fair use or copyrighted or infringement thereof is always going to be exclusively dictated by whatever a judge's personal take on the pile of logical contradictions is. Remember, nominally, the sole purpose of copyright is not rooted in any notions of fairness or profitability or anything. It's specifically to incentivize innovation.

So what is the right interpretation of the law with regards to how AI is using it? What better incentivizes innovation? Do we let AI companies scan everything because AI is innovative? Or do we think letting AI vacuum up creative works to then stochastically regurgitate tiny (or not so tiny) slices of them at a time will hurt innovation elsewhere?

But obviously the real answer here is money. Copyright is powerful because monied interests want it to be. Now that copyright stands in the way of monied interests for perhaps the first time, we will see how dedicated we actually were to whatever justifications we've been seeing for DRM and copyright for the last several decades.

9. Bjorkbat ◴[07 Jul 25 16:48 UTC] No.44492187[source]▶

>>44491931 (TP) #

Something missed in arguments such as these is that in measuring fair use there's a consideration of impact on the potential market for a rightsholder's present and future works. In other words, can it be proven that what you are doing is meaningfully depriving the author of future income.

Now, in theory, you learning from an author's works and competing with them in the same market could meaningfully deprive them of income, but it's a very difficult argument to prove.

On the other hand, with AI companies it's an easier argument to make. If Anthropic trained on all of your books (which is somewhat likely if you're a fairly popular author) and you saw a substantial loss of income after the release of one of their better models (presumably because people are just using the LLM to write their own stories rather than buy your stuff), then it's a little bit easier to connect the dots. A company used your works to build a machine that competes with you, which arguably violates the fair use principle.

Gets to the very principle of copyright, which is that you shouldn't have to compete against "yourself" because someone copied you.

replies(1): >>44492431 #

10. parliament32 ◴[07 Jul 25 17:09 UTC] No.44492431[source]▶

>>44492187 #

> a consideration of impact on the potential market for a rightsholder's present and future works

This is one of those mental gymnastics exercises that makes copyright law so obtuse and effectively unenforceable.

As an alternative, imagine a scriptwriter buys a textbook on orbital mechanics, while writing Gravity (2013). A large number of people watch the finished film, and learn something about orbital mechanics, therefore not needing the textbook anymore, causing a loss of revenue for the textbook author. Should the author be entitled to a percentage of Gravity's profit?

We'd be better off abolishing everything related to copyright and IP law alltogether. These laws might've made sense back in the days of the printing press but they're just nonsensical nowadays.

replies(1): >>44493199 #

11. organsnyder ◴[07 Jul 25 17:22 UTC] No.44492563{3}[source]▶

>>44492121 #

Thanks for pointing that out. Obviously I hadn't read the whole article. That is an interesting determination the judge made:

> Alsup ruled that Anthropic's use of copyrighted books to train its AI models was "exceedingly transformative" and qualified as fair use, a legal doctrine that allows certain uses of copyrighted works without the copyright owner's permission.

replies(1): >>44492606 #

12. JoeAltmaier ◴[07 Jul 25 17:25 UTC] No.44492606{4}[source]▶

>>44492563 #

There are still questions: is an AI a 'user' in the copyright sense?

Or even, is an individual operating within the law as fair use, the same as a voracious all-consuming AI training bot consuming everything the same in spirit?

Consider a single person in a National Park, allowed to pick and eat berries, compared to bringing a combine harvester to take it all.

13. Bjorkbat ◴[07 Jul 25 18:19 UTC] No.44493199{3}[source]▶

>>44492431 #

Personally I think a more effective analogy would be if someone used a textbook and created an online course / curriculum effective enough that colleges stop recommending the purchase of said textbook. It's honestly pretty difficult to imagine a movie having a meaningful impact on the sale of textbooks since they're required for high school / college courses.

So here's the thing, I don't think a textbook author going against a purveyor of online courseware has much of a chance, nor do I think it should have much of a chance, because it probably lacks meaningful proof that their works made a contribution to the creation of the courseware. Would I feel differently if the textbook author could prove in court that a substantial amount of their material contributed to the creation of the courseware, and when I say "prove" I mean they had receipts to prove it? I think that's where things get murky. If you can actually prove that your works made a meaningful contribution to the thing that you're competing against, then maybe you have a point. The tricky part is defining meaningful. An individual author doesn't make a meaningful contribution to the training of an LLM, but a large number of popular and/or prolific numbers can.

You bring up a good point, interpretation of fair use is difficult, but at the end of the day I really don't think we should abolish copyright and IP altogether. I think it's a good thing that creative professionals have some security in knowing that they have legal protections against having to "compete against themselves"

replies(1): >>44493871 #

14. TeMPOraL ◴[07 Jul 25 19:35 UTC] No.44493871{4}[source]▶

>>44493199 #

> An individual author doesn't make a meaningful contribution to the training of an LLM, but a large number of popular and/or prolific numbers can.

That's a point I normally use to argue against authors being entitled to royalties on LLM outputs. An individual author's marginal contribution to an LLM is essentially nil, and could be removed from the training set with no meaningful impact on the model. It's only the accumulation of a very large amount of works that turns into a capable LLM.