←back to thread

Anthropic cut up millions of used books, and downloaded 7M pirated ones – judge

(www.businessinsider.com)

397 points pyman | 1 comments | 07 Jul 25 09:20 UTC | HN request time: 0.21s | source

Show context

nickpsecurity ◴[07 Jul 25 15:00 UTC] No.44491093[source]▶

>>44488331 (OP) #

Buying, scanning, and discarding was in my proposal to train under copyright restrictions.

You are often allowed to nake a digital copy of a physical work you bought. There are tons of used, physical works thay would be good for training LLM's. They'd also be good for training OCR which could do many things, including improve book scanning for training.

This could be reduced to a single act of book destruction per copyrighted work or made unnecessary if copyright law allowed us to share others' works digitally with their licensed customers. Ex: people who own a physical copy or a license to one. Obviously, the implementation could get complex but we wouldn't have to destroy books very often.

replies(1): >>44491433 #

asadotzler[dead post] ◴[07 Jul 25 15:33 UTC] No.44491433[source]▶

[flagged]

1. nickpsecurity ◴[07 Jul 25 17:28 UTC] No.44492655[source]▶

That's true and was the distinction I was making. In my proposal, and maybe part of what Anthropic did, the digitized copies are used as training data for a new work, the model. That reduces the risk of legal rulings against using the copyrighted works.

From there, the cases would likely focus on whether that fits in established criteria for digitized copies, whether they're allowed in the training process itself, and the copyright status of the resulting model. Some countries allow all of that if you legally obtained the material in the first place. Also, they might factor whether it's for commercial use or not.