Anthropic agrees to pay $1.5B to settle lawsuit with book authors

I think western companies will be just fine -- Anthropic is settling because they illegally pirated books from LibGen back in 2021 and subsequently trained models on them. They realized this was an issue internally and pivoted to buying books en masse and scanning them into digital formats, destroying the original copies in the process (they actually hired a former lead in the Google Books project to help them in this endeavor!). And a federal judge ruled a couple months ago that training on these legally-acquired scanned copies does not constitute fair use -- that the LLM training process is sufficiently transformative.

So the data/copyright issue that you might be worried about is actually completely solved already! Anthropic is just paying a settlement here for the illegal pirating that they did way in the past. Anthropic is allowed to train on books that they legally acquire.

And sure, Chinese AI companies could probably scrape from LibGen just like Anthropic did without getting in hot water, and potentially access a bit more data that way for cheap, but it doesn't really seem like the buying/scanning process really costs that much in the grand scheme of things. And Anthropic likely already has legally acquired most of the useful texts on LibGen and scanned them into its internal library anyways.

(Furthermore, the scanning setup might actually give Anthropic an advantage, as they're able to digitize more niche texts that might be hard to find outside of print form)