←back to thread

989 points acomjean | 1 comments | | HN request time: 1.146s | source
Show context
aeon_ai ◴[] No.45143392[source]
To be very clear on this point - this is not related to model training.

It’s important in the fair use assessment to understand that the training itself is fair use, but the pirating of the books is the issue at hand here, and is what Anthropic “whoopsied” into in acquiring the training data.

Buying used copies of books, scanning them, and training on it is fine.

Rainbows End was prescient in many ways.

replies(36): >>45143460 #>>45143461 #>>45143507 #>>45143513 #>>45143567 #>>45143731 #>>45143840 #>>45143861 #>>45144037 #>>45144244 #>>45144321 #>>45144837 #>>45144843 #>>45144845 #>>45144903 #>>45144951 #>>45145884 #>>45145907 #>>45146038 #>>45146135 #>>45146167 #>>45146218 #>>45146268 #>>45146425 #>>45146773 #>>45146935 #>>45147139 #>>45147257 #>>45147558 #>>45147682 #>>45148227 #>>45150324 #>>45150567 #>>45151562 #>>45151934 #>>45153210 #
1. nickpsecurity ◴[] No.45150324[source]
I don't believe that's true. Most work I've read on fair use suggests it has to be a small amount, selectively used, substantially transformed, and not compete with content creators. These AI's training are the opposite of all that. I was surprised of a ruling like this but Alsup is a unique judge.

Additionally, sharing copyrighted works without permission... the data sets or data lakes... is its own tort. You're guilty just for sharing copies before even training. Some copyrighted works are also commercial, copyright with ban on others' commercial use, and patented. Some are NDA'd but 3rd party leaked them. Sources like Common Crawl probably have plenty of such content.

Additionally, there's often contractual terms of use on accessing the content. Even Singapore's and others laws allowing training on copyrighted content usually require that you lawfully accessed that content in the first place. The terms of use are the weakest link there.

I'd like to see these two issues turned by law into a copyright exception that no contract can override. It needs to specifically allow sharing scraped, publicly-visible content. Anything you can just view or download which the copyright owner put up. The law might impose or allow limits on daily scraping quantity, volume, etc to avoid damage scrapers are doing.