And this is just the stuff you notice.
1) Verbatin copy is first-order plagiarism.
2a) Second-order plagiarism of written text would be replacing words with synonyms. Or taking a book paragraph by paragraph and for each one of them, rephrasing it in your own words. Yes, it might fool automated checkers but the structure would still be a copy of the original book. And most importantly, it would not contain any new information. No new positive-sum work was done. It would have no additional value.
Before LLMs almost nobody did this because the chance that it would help in a lawsuit vs the amount of work was not a good tradeoff. Now it is. But LLMs can do "better":
2b) A different kind of second-order plagiarism is using multiple sources and plagiarizing each of them only in part. Find multiple books on the same topic, take 1 chapter from each and order them in a coherent manner. Make it more granular. Find paragraphs or phrases which fit into the structure of your new book but are verbatim from other books. See how granular you can make it.
The trick here is that doing this by hand is more work than just writing your own book. So nobody did it and copyright law does not really address this well. But with LLMs, it can be automated. You can literally instruct an LLM to do this and it will do it cheaper than any human could. However, how LLMs work internally is yet different:
n) Higher-order plagiarism is taking multiple source books, identifying patterns, and then reproducing them in your "new" book.
If the patterns are sufficiently complex, nobody will ever be able to prove what specifically you did. What previously took creative human work now became a mechanical transformation of input data.
The point is this ability to detect and reproduce patterns is an impressive innovation but it's built on top of the work of hundreds of millions[0] of humans whose work was used without consent. The work done by those employed by the LLM companies is minuscule compared to that. Yet all of the reward goes to them.
Not to mention LLMs completely defear the purpose of (A)GPL. If you can take AGPL code and pass it through a sufficiently complex mechanical transformation that the output does the same thing but copyright no longer applies, then free software is dead. No more freedom to inspect and modify.
[0]: Github alone has 100 million users ( https://expandedramblings.com/index.php/github-statistics/ ) and we have reason to believe all of their data was used in training.