So the data/copyright issue that you might be worried about is actually completely solved already! Anthropic is just paying a settlement here for the illegal pirating that they did way in the past. Anthropic is allowed to train on books that they legally acquire.
And sure, Chinese AI companies could probably scrape from LibGen just like Anthropic did without getting in hot water, and potentially access a bit more data that way for cheap, but it doesn't really seem like the buying/scanning process really costs that much in the grand scheme of things. And Anthropic likely already has legally acquired most of the useful texts on LibGen and scanned them into its internal library anyways.
(Furthermore, the scanning setup might actually give Anthropic an advantage, as they're able to digitize more niche texts that might be hard to find outside of print form)
You mean does constitute fair use?