Most active commenters

KoolKat23(6)
j_w(3)
pyman(3)

Anthropic cut up millions of used books, and downloaded 7M pirated ones – judge

(www.businessinsider.com)

Show context

pyman ◴[07 Jul 25 09:20 UTC] No.44488332[source]▶

Anthropic's cofounder, Ben Mann, downloaded million copies of books from Library Genesis in 2021, fully aware that the material was pirated.

Stealing is stealing. Let's stop with the double standards.

replies(8): >>44488391 #>>44488540 #>>44488816 #>>44490720 #>>44491032 #>>44491583 #>>44492035 #>>44493242 #

originalvichy ◴[07 Jul 25 09:55 UTC] No.44488540[source]▶

>>44488332 #

At least most pirates just consume for personal use. Profiting from piracy is a whole other level beyond just pirating a book.

replies(4): >>44488621 #>>44488853 #>>44489003 #>>44490718 #

KoolKat23 ◴[07 Jul 25 11:13 UTC] No.44489003[source]▶

>>44488540 #

This isn't really profiting from piracy. They don't make money off the raw input data. It's no different to consuming for personal use.

They make money off the model weights, which is fair use (as confirmed by recent case law).

replies(1): >>44489216 #

1. j_w ◴[07 Jul 25 11:36 UTC] No.44489216[source]▶

>>44489003 #

This is absurd. Remove all of the content from the training data that was pirated and what is the quality of the end product now?

replies(2): >>44489279 #>>44489283 #

2. pyman ◴[07 Jul 25 11:43 UTC] No.44489279[source]▶

>>44489216 (TP) #

With Claude, people are paying Anthropic to access answers that are generated from pirated books, without the authors permission, credit, or compensation.

replies(1): >>44489304 #

3. KoolKat23 ◴[07 Jul 25 11:43 UTC] No.44489283[source]▶

>>44489216 (TP) #

That's the law.

Please keep in mind, copyright is intended as a compromise between benefit to society and to the individual.

A thought experiment, students pirating textbooks and applying that knowledge later on in their work?

replies(2): >>44489587 #>>44495512 #

4. KoolKat23 ◴[07 Jul 25 11:45 UTC] No.44489304[source]▶

>>44489279 #

There is no copyright on knowledge.

If it outputs parts of the book verbatim then that's a different story.

replies(2): >>44489612 #>>44492025 #

5. j_w ◴[07 Jul 25 12:19 UTC] No.44489587[source]▶

>>44489283 #

When you say that's the law, as far as I'm aware a single ruling by a lower court has been issued which upholds that application. Hardly settled case law.

replies(1): >>44489760 #

6. pyman ◴[07 Jul 25 12:22 UTC] No.44489612{3}[source]▶

>>44489304 #

Let's don't change the focus of the debate.

Pirating 7 million books, remixing their content, and using that to power Claude.ai is like counterfeiting 7 million branded products and selling them on your personal website. The original creators don't get credit or payment, and someone’s profiting off their work.

All this happens while authors, many of them teachers, are left scratching their heads with four kids to feed

replies(1): >>44489775 #

7. KoolKat23 ◴[07 Jul 25 12:38 UTC] No.44489760{3}[source]▶

>>44489587 #

True, until then best to act as if it is the case.

In my opinion, it will be upheld.

Looking at what is stored and the manner which it is stored. It makes sense that it's fair use.

replies(1): >>44492896 #

8. KoolKat23 ◴[07 Jul 25 12:40 UTC] No.44489775{4}[source]▶

>>44489612 #

That may be the case, but you'd have to have laws changed.

9. SirMaster ◴[07 Jul 25 16:31 UTC] No.44492025{3}[source]▶

>>44489304 #

>If it outputs parts of the book verbatim then that's a different story.

But it does...

10. j_w ◴[07 Jul 25 17:50 UTC] No.44492896{4}[source]▶

>>44489760 #

We're talking about a summary judgement issued that has not yet been appealed. That doesn't make it "settled."

If by "what is stored and the manner which it is stored" is intended to signal model weights, I'm not sure what the argument is? The four factors of copyright in no way mention a storage medium for data, lossless or loss-y.

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work.

In my opinion, this will likely see a supreme court ruling by the end of the decade.

replies(1): >>44493389 #

11. KoolKat23 ◴[07 Jul 25 18:40 UTC] No.44493389{5}[source]▶

>>44492896 #

The use is to train an AI model.

A trillion parameter SOTA model is not substantially comprised of the one copyrighted piece. (If it was a Harry Potter model trained only on Harry Potter books this would be a different story).

Embeddings are not copy paste.

The last point about market impact would be where they make their argument but it's tenuous. It's not the primary use of AI models and built in prompts try to avoid this, so it shouldn't be commonplace unless you're jail breaking the model, most folk aren't.

replies(1): >>44495528 #

12. nwienert ◴[07 Jul 25 23:12 UTC] No.44495512[source]▶

>>44489283 #

Its the law (for now, very early on this in the process of deciding the law, untested, appealable, likely to be appealed and tested many times in many ways).

Meanwhile other cases have been less friendly to it being fair use, AI companies are already paying vast sums to publishers who presumably they wouldn’t if they felt confident it was “the law”, and on and on.

I don’t like arguing from “it’s the law”. A lot of law is terrible. What’s right? It’s clear to me that if AI gets good enough, as it nearly is now, it sucks a lot of profit away from creators. That is unbalanced. The AI doesn’t exist without the creators, the creators need to exist for our society to be great (we want new creative works, more if anything). Law tends to start conservatively based on historical precedent, and when a new technology comes along it often errs on letting it do some damage to avoid setting a bad precedent. In time it catches up as society gets a better view of things.

The right thing is likely not to let our creative class be decimated so a few tech companies become fantastically wealthy - in the long run, it’s the right thing even for the techies.

13. nwienert ◴[07 Jul 25 23:16 UTC] No.44495528{6}[source]▶

>>44493389 #

I bet it’s pretty easy to reproduce enough of Harry Potter from these models that any judge would see it as not fair use - you’d just have to prompt it in the right way. I’d bet a large sum that when this eventually shakes through the Supreme Court, it won’t be deemed fair use entirely, for the better of the world.

↑