←back to thread

290 points nobody9999 | 1 comments | | HN request time: 0s | source
Show context
jawns ◴[] No.45187038[source]
I'm an author, and I've confirmed that 3 of my books are in the 500K dataset.

Thus, I stand to receive about $9,000 as a result of this settlement.

I think that's fair, considering that two of those books received advances under $20K and never earned out. Also, while I'm sure that Anthropic has benefited from training its models on this dataset, that doesn't necessarily mean that those models are a lasting asset.

replies(22): >>45187319 #>>45187366 #>>45187519 #>>45187839 #>>45188602 #>>45189683 #>>45189684 #>>45190184 #>>45190223 #>>45190237 #>>45190555 #>>45190731 #>>45191633 #>>45192016 #>>45192191 #>>45192348 #>>45192404 #>>45192630 #>>45193043 #>>45195516 #>>45201246 #>>45218895 #
visarga ◴[] No.45187519[source]
How is it fair? Do you expect 9,000 from Google, Meta, OpenAI, and everyone else? Were your books imitated by AI?

Infringement was supposed to imply substantial similarity. Now it is supposed to mean statistical similarity?

replies(4): >>45187577 #>>45187677 #>>45187811 #>>45187853 #
gruez ◴[] No.45187577[source]
>Were your books imitated by AI?

Given that books can be imitated by humans with no compensation, this isn't as strong as an argument as you think. Moreover AFAIK the training itself has been ruled legal, so Anthropic could have theoretically bought the book for $20 (or whatever) and be in the clear, which would obviously bring less revenue than the $9k settlement.

replies(2): >>45187621 #>>45188044 #
visarga ◴[] No.45187621[source]
Copyright should be about copying rights, not statistical similarities. Similarity vs causal link - a different standard all together.
replies(3): >>45187751 #>>45187806 #>>45187851 #
Retric ◴[] No.45187751[source]
The entire purpose of training materials is to copy aspects of them. That’s the causal link.
replies(2): >>45187830 #>>45193880 #
Dylan16807 ◴[] No.45187830[source]
The aspect it's supposed to copy is the statistics of how words work.

And in general, when an LLM is able to recreate text that's a training error. Recreating text is not the purpose. Which is not to excuse it happening, but the distinction matters.

replies(1): >>45188144 #
program_whiz ◴[] No.45188144[source]
In training, the model is trained to predict the exact sequence of words of a text. In other words, it is reproducing the text repeatedly for its own trainings. The by-product of this training is that it influences model weights to make the text more likely to be produced by the model -- that is its explicit goal. A perfect model would be able to reproduce the text perfectly (0 loss).

Real-world absurd example: A company hires a bunch of workers. They then give them access to millions of books and have the workers reading the books all day. The workers copy the books word by word, but after each word try to guess the next word that will appear. Eventually, they collectively become quite good at guessing the next word given a prompt text, even reproducing large swaths of text almost verbatim. The owner of company Y claims they owe nothing to the book owners, because it doesn't count as reading the book, and any reproduction is "coincidental" (even though this is the explicit task of the readers). They then use these workers to produce works to compete with the authors of the books, which they never paid for.

It seems many people feel this is "fair use" when it happens on a computer, but would call it "stealing" if I pirated all the books of JK Rowling to train myself to be a better mimicker of her style. If you feel this is still fair use, then you should agree all books should be free to everyone (as well as art, code, music, and any other training material).

replies(2): >>45188179 #>>45188571 #
gruez ◴[] No.45188179[source]
>but would call it "stealing" if I pirated all the books of JK Rowling to train myself to be a better mimicker of her style

Can you provide an example of someone being successfully sued for "mimicking style", presumably in the US judicial system?

replies(3): >>45188472 #>>45188557 #>>45189798 #
1. Retric ◴[] No.45188557{3}[source]
Style in an ambiguous term here as it doesn’t directly map to what’s being considered. The case between “Blurred Lines” and “Got to Give It Up” is often considered one of style and the Court of Appeals for the Ninth Circuit upheld copyright infringement.

However, AI has been show to copy a lot more than what people consider style.