Book authors may see some settlement checks down the line. So might newspapers and other parties that can organize and throw enough $$$ at the problem. But I'll eat my hat if your average blogger ever sees a single cent.
Book authors may see some settlement checks down the line. So might newspapers and other parties that can organize and throw enough $$$ at the problem. But I'll eat my hat if your average blogger ever sees a single cent.
More broadly, I think that's a goofy argument. The books were "freely available" too. Just because something is out there, doesn't necessarily mean you can use it however you want, and that's the crux of the debate.
Other people have said that Anthropic bought the books later on, but I haven't found any official records for that. Where would I find that?
Also, does anyone know which Anthropic models were NOT trained on the pirated books. I want to avoid such models.
https://storage.courtlistener.com/recap/gov.uscourts.cand.43....
"Similarly, different sets or “subsets” or “parts of” or “portions” of the collections sourced from Books3, LibGen, and PiLiMi were used to train different LLMs..." Page 5
"In sum, the copies of books pirated or purchased-and-destructively-scanned were placed into a central “research library” or “generalized data area,” sets or subsets were copied again to create training copies for data mixes, the training copies were successively copied to be cleaned, tokenized, and compressed into any given trained LLM, and once trained an LLM did not output through Claude to the public any further copies." Page 7
The phrase "Finally, once Anthropic decided a copy of a pirated or scanned book in the library would not be used for training at all or ever again, Anthropic still retained that work as a “hard resource” for other uses or future uses" implies to me Anthropic excluded certain books from training, not that they excluded all the pirated books from training.