This weirdly seems like its the best mechanism to buy this much data.
Imagine going to 500k publishers to buy it individually. 3k per book is way cheaper. The copyright system is turning into a data marketplace in front of our eyes
replies(2):
Imagine going to 500k publishers to buy it individually. 3k per book is way cheaper. The copyright system is turning into a data marketplace in front of our eyes
The main cost of doing this would be the time - even if you bought up all the available scanning capacity it would probably take months. In the meantime your competition who just torrented everything would have more high-quality training data than you. There are probably also a fair number of books in libgen which are out of print and difficult to find used.