Anthropic cut up millions of used books, and downloaded 7M pirated ones – judge

There is another case where companies slurped up all of the internet and profited off the information, that makes a good comparison - search engines.

Judges consider a four factor when examining fair use[1]. For search engines,

1) The use is transformative, as a tool to find content is very different purpose than the content itself.

2) Nature of the original work runs the full gamut, so search engines don't get points for only consuming factual data, but it was all publicly viewable by anyone as opposed to books which require payment.

3) The search engine store significant portions of the work in the index, but it only redistributes small portions.

4) Search engines, as original devised, don't compete with the original, in fact they can improve potential market of the original by helping more people find them. This has changed over time though, and search engines are increasingly competing with the content they index, and intentionally trying to show the information that people want on the search page itself.

So traditional search which was transformative, only republished small amounts of the originals, and didn't compete with the originals fell firmly on the side of fair use.

Google News and Books on the other hand weren't so clear cut, as they were showing larger portions of the works and were competing with the originals. They had to make changes to those products as a result of lawsuits.

So now lets look at LLMs:

1) LLM are absolutely transformative. Generating new text at users request is a very different purpose and character from the original works.

2) Again runs the full gamut (setting aside the clear copyright infringement downloading of illegally distributed books which is a separate issue)

3) For training purposes, LLMs don't typically preserve entire works, so the model is in a better place legally than a search index, which has precedent that storing entire works privately can be fair use depending on the other factors. For inference, even though they are less likely to reproduce the originals in their outputs than search engines, there are failure cases where an LLM over-trained on a work, and a significant amount the original can be reproduced.

4) LLMs have tons of uses some of which complement the original works and some of which compete directly with them. Because of this, it is likely that whether LLMs are fair use will depend on how they are being used - eg ignore the LLM altogether and consider solely the output and whether it would be infringing if a human created it.

This case was solely about whether training on books is fair use, and did not consider any uses of the LLM. Because LLMs are a very transformative use, and because they don't store original verbatim, it weighs strongly as being fair use.

I think the real problems that LLMs face will be in factors 3 and 4, which is very much context specific. The judge himself said that the plaintiffs are free to file additional lawsuits if they believe the LLM outputs duplicate the original works.

[1] https://fairuse.stanford.edu/overview/fair-use/four-factors/