Anthropic cut up millions of used books, and downloaded 7M pirated ones – judge

(www.businessinsider.com)

419 points pyman | 1 comments | 07 Jul 25 09:20 UTC | HN request time: 0s | source

Show context

dehrmann ◴[07 Jul 25 16:02 UTC] No.44491718[source]▶

The important parts:

> Alsup ruled that Anthropic's use of copyrighted books to train its AI models was "exceedingly transformative" and qualified as fair use

> "All Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies"

It was always somewhat obvious that pirating a library would be copyright infringement. The interesting findings here are that scanning and digitizing a library for internal use is OK, and using it to train models is fair use.

replies(6): >>44491820 #>>44491944 #>>44492844 #>>44494100 #>>44494132 #>>44494944 #

6gvONxR4sf7o ◴[07 Jul 25 16:25 UTC] No.44491944[source]▶

>>44491718 #

You skipped quotes about the other important side:

> But Alsup drew a firm line when it came to piracy.

> "Anthropic had no entitlement to use pirated copies for its central library," Alsup wrote. "Creating a permanent, general-purpose library was not itself a fair use excusing Anthropic's piracy."

That is, he ruled that

- buying, physically cutting up, physically digitizing books, and using them for training is fair use

- pirating the books for their digital library is not fair use.

replies(6): >>44492103 #>>44492512 #>>44492665 #>>44493580 #>>44493641 #>>44495079 #

pier25 ◴[07 Jul 25 19:00 UTC] No.44493580[source]▶

>>44491944 #

> buying, physically cutting up, physically digitizing books, and using them for training is fair use

So Suno would only really need to buy the physical albums and rip them to be able to generate music at an industrial scale?

replies(7): >>44493615 #>>44493850 #>>44494405 #>>44494753 #>>44494779 #>>44495203 #>>44496071 #

conradev ◴[07 Jul 25 21:20 UTC] No.44494779[source]▶

>>44493580 #

Yes! Training and generation are fair use. You are free to train and generate whatever you want in your basement for whatever purpose you see fit. Build a music collection, go ham.

If the output from said model uses the voice of another person, for example, we already have a legal framework in place for determining if it is infringing on their rights, independent of AI.

Courts have heard cases of individual artists copying melodies, because melodies themselves are copyrightable: https://www.hypebot.com/hypebot/2020/02/every-possible-melod...

replies(1): >>44494822 #

pier25 ◴[07 Jul 25 21:25 UTC] No.44494822[source]▶

>>44494779 #

> Yes!

But Suno is definitely not training models in their basement for fun.

They are a private company selling music, using music made by humans to train their models, to replace human musicians and artists.

We'll see what the courts say but that doesn't sound like fair use.

replies(1): >>44495390 #

conradev ◴[07 Jul 25 22:51 UTC] No.44495390[source]▶

>>44494822 #

My understanding is that Suno does not sell music, but instead makes a tool for musicians to generate music and sells access to this tool.

The law doesn't distinguish between basement and cloud – it's a service. You can sell access to the service without selling songs to consumers.

replies(4): >>44495595 #>>44495608 #>>44495707 #>>44496942 #

pyman ◴[07 Jul 25 23:28 UTC] No.44495608[source]▶

>>44495390 #

What does "fair use" even mean in a world where models can memorise and remix every book and song ever written? Are we erasing ownership?

The problem is, copyright law wasn't written for machines. It was written for humans who create things.

In the case of songs (or books, paintings, etc), only humans and companies can legally own copyright, a machine can't. If an AI-powered tool generates a song, there’s no author in the legal sense, unless the person using the tool claims authorship by saying they operated the tool.

So we're stuck in a grey zone: the input is human, the output is AI generated, and the law doesn't know what to do with that.

For me the real debate is: Do we need new rules for non-human creation?

replies(1): >>44495950 #

markhahn ◴[08 Jul 25 00:47 UTC] No.44495950[source]▶

>>44495608 #

why are you saying "memorize"? are people training AIs to regurgitate exact copies? if so, that's just copying. if they return something that is not a literal copy of the whole work, then there is established caselaw about how much is permitted. some clearly is, but not entire works.

when you buy a book, you are not acceding to a license to only ever read it with human eyes, forbearing to memorize it, never to quote it, never to be inspired by it.

replies(2): >>44496065 #>>44496175 #

1. mwarkentin ◴[08 Jul 25 01:09 UTC] No.44496065[source]▶

>>44495950 #

> Specifically, the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time. (I’ll unpack how this was measured in the next section.)

> Interestingly, Llama 1 65B, a similar-sized model released in February 2023, had memorized only 4.4 percent of Harry Potter and the Sorcerer's Stone. This suggests that despite the potential legal liability, Meta did not do much to prevent memorization as it trained Llama 3. At least for this book, the problem got much worse between Llama 1 and Llama 3.

> Harry Potter and the Sorcerer's Stone was one of dozens of books tested by the researchers. They found that Llama 3.1 70B was far more likely to reproduce popular books—such as The Hobbit and George Orwell’s 1984—than obscure ones. And for most books, Llama 3.1 70B memorized more than any of the other models.

↑