←back to thread

989 points acomjean | 2 comments | | HN request time: 0.411s | source
Show context
petralithic ◴[] No.45143482[source]
This is sad for open source AI, piracy for the purpose of model training should also be fair use because otherwise only the big companies who can afford to pay off publishers like Anthropic will be able to do so. There is no way to buy billions of books just for model training, it simply can't happen.
replies(9): >>45143523 #>>45143780 #>>45143876 #>>45144861 #>>45145004 #>>45145076 #>>45146993 #>>45147328 #>>45148584 #
sefrost ◴[] No.45143780[source]
I wonder how much it would cost to buy every book that you'd want to train a model.
replies(1): >>45144140 #
GMoromisato ◴[] No.45144140[source]
500,000 x $20 = $10 million

Obviously there would be handling costs + scanning costs, so that’s the floor.

Maybe $20 million total? Plus, of course, the time it would take to execute.

replies(1): >>45152288 #
1. riskable ◴[] No.45152288[source]
The real expense is in the data centers/hardware.

The cost of the books is negligible in comparison.

replies(1): >>45152990 #
2. Scoundreller ◴[] No.45152990[source]
Somewhere a gritty warehouse in a developing country is receiving shipping containers of old books, massive teams manually flipping each page as a 2nd hand Canon digicam takes a pic of each page, to be OCR’d by the same AI being trained.

Once the book is done, 99% of them go into the furnace at the district heating boiler next door. The other 1% back to a developed country for resale.