Most active commenters
  • gruez(5)
  • Workaccount2(4)
  • empath75(3)

←back to thread

452 points croes | 24 comments | | HN request time: 1.853s | source | bottom
Show context
Workaccount2 ◴[] No.43963737[source]
I have yet to see someone explain in detail how transformer model training works (showing they understand the technical nitty gritty and the overall architecture of transformers) and also layout a case for why it is clearly a violation of copyright.

You can find lots of people talking about training, and you can find lots (way more) of people talking about AI training being a violation of copyright, but you can't find anyone talking about both.

Edit: Let me just clarify that I am talking about training, not inference (output).

replies(10): >>43963777 #>>43963792 #>>43963801 #>>43963816 #>>43963830 #>>43963874 #>>43963886 #>>43963955 #>>43964102 #>>43965360 #
1. jfengel ◴[] No.43963816[source]
I'm not sure I understand your question. It's reasonably clear that transformers get caught reproducing material that they have no right to. The kind of thing that would potentially result in a lawsuit if you did it by hand.

It's less clear whether taking vast amounts of copyrighted material and using it to generate other things rises to the level of copyright violation or not. It's the kind of thing that people would have prevented if it had occurred to them, by writing terms of use that explicitly forbid it. (Which probably means that the Web becomes a much smaller place.)

Your comment seems to suggest that writers and artists have absolutely no conceivable stake in products derived from their work, and that it's purely a misunderstanding on their part. But I'm both a computer scientist and an artist and I don't see how you could reach that conclusion. If my work is not relevant then leave it out.

replies(4): >>43963887 #>>43963911 #>>43964402 #>>43969383 #
2. gruez ◴[] No.43963887[source]
>I'm not sure I understand your question. It's reasonably clear that transformers get caught reproducing material that they have no right to. The kind of thing that would potentially result in a lawsuit if you did it by hand.

Is that a problem with the tool, or the person using it? A photocopier can copy an entire book verbatim. Should that be illegal? Or is it the problem that the "training" process can produce a model that has the ability to reproduce copyrighted work? If so, what implication does that hold for human learning? Many people can recite an entire song's lyrics from scratch, and reproducing an entire song's lyrics verbatim is probably enough to be considered copyright infringement. Does that mean the process of a human listening to music counts as copyright infringement?

replies(1): >>43964178 #
3. Workaccount2 ◴[] No.43963911[source]
My comment is about training models, not model inference.

Most artists can readily violate copyright, that doesn't me we block them from seeing copyright.

replies(1): >>43963993 #
4. gitremote ◴[] No.43963993[source]
The judgement was about model inference, not training.
replies(1): >>43964817 #
5. empath75 ◴[] No.43964178[source]
Let's start with I think a case that everyone agrees with.

If I were to take an image, and compress it or encrypt it, and then show you data file, you would not be able to see the original copyrighted material anywhere in the data.

But if you had the right computer program, you could use it to regenerate the original image flawlessly.

I think most people would easily agree that distributing the encrypted file without permission is still a distribution of a copyrighted work and against the law.

What if you used _lossy_ encryption, and can merely reproduce a poor quality jpeg of the original image? I think still copyright infringement, right?

Would it matter if you distributed it with an executable that only rendered the image non-deterministically? Maybe one out of 10 times? Or if the command to reproduce it was undocumented?

Okay, so now we have AI. We can ignore the algorithm entirely and how it works, because it's not relevant. There is a large amount of data that it operates on, the weights of the model and so on. You _can_ with the correct prompts, sometimes generate a copy of a copyrighted work, to some degree of fidelity or another.

I do not think it is meaningfully different from the simpler example, just with a lot of extra steps.

I think, legally, it's pretty clear that it is illegally distributing copyrighted material without permission. I think calling it an "ai" just needlessly anthropomorphizes everything. It's a computer program that distributes copyrighted work without permission. It doesn't matter if it's the primary purpose or not.

I think probably there needs to be some kind of new law to fix this situation, but under the current law as it exists, it seems to me to be clearly illegal.

replies(4): >>43964545 #>>43964933 #>>43965230 #>>43969413 #
6. tensor ◴[] No.43964402[source]
If I write a math book, and you read it, then tell someone about the math within it. You are not violating copyright. In fact, you could write your OWN math book, or history book, or whatever, and as long as you're not copying my actual text, you are not violating copyright.

However, when an LLM does the same, people now what it to be illegal. It seems pretty straightforward to apply existing copyright law to LLMs in the same way we apply them to humans. If the actual text they generate is substantially similar to a source material that it would constitute a copyright violation if a human were to have done it, then it should be illegal. Otherwise it should not.

edit: and in fact it's not even whether an LLM reproduces text, it's wether someone subsequently publishes that text. The person publishing that text should be the one taking on the legal hit.

replies(1): >>43965025 #
7. gruez ◴[] No.43964545{3}[source]
>Okay, so now we have AI. We can ignore the algorithm entirely and how it works, because it's not relevant. There is a large amount of data that it operates on, the weights of the model and so on. You _can_ with the correct prompts, sometimes generate a copy of a copyrighted work, to some degree of fidelity or another.

Suppose we accept all of the above. What does that hold for human learning?

replies(1): >>43965098 #
8. Workaccount2 ◴[] No.43964817{3}[source]
>"But making commercial use of vast troves of copyrighted works to produce expressive content"

This can only be referring to training, the models themselves are a rounding error in size compared to their training sets.

9. Workaccount2 ◴[] No.43964933{3}[source]
The crux of the debate is a motte and bailey.

AI is capable of reproducing copyright (motte) therefore training on copyright is illegal (bailey).

replies(2): >>43968228 #>>43969157 #
10. rrook ◴[] No.43965025[source]
That mathematical formulas already cannot be copyrighted makes this a kinda nonsense example?
replies(1): >>44055775 #
11. empath75 ◴[] No.43965098{4}[source]
If a human were to reproduce, from memory, a copyrighted work, that would be illegal as well, and multiple people have been sued over it, even doing it unintentionally.

I'm not talking about learning. I'm talking about the complete reproduction of a copyrighted work. It doesn't matter how it happens.

replies(1): >>43965454 #
12. halkony ◴[] No.43965230{3}[source]
> I do not think it is meaningfully different from the simpler example, just with a lot of extra steps.

Those extra steps are meaningfully different. In your description, a casual observer could compare the two JPEGs and recognize the inferior copy. However, AI has become so advanced that such detection is becoming impossible. It is clearly voodoo.

13. gruez ◴[] No.43965454{5}[source]
>I'm not talking about learning. I'm talking about the complete reproduction of a copyrighted work. It doesn't matter how it happens.

In that case I don't think there's anything controversial here? Nobody thinks that if you ask AI to reproduce something verbatim, that you should get a pass because it's AI. All the controversy in this thread seems to be around the training process and whether that breaks copyright laws.

replies(2): >>43966775 #>>43969193 #
14. empath75 ◴[] No.43966775{6}[source]
No -- the controversy is also over whether distributing the weights and software is a copyright violation. I believe that is. The copyrighted material is present in the software in some form, even if the process for regenerating it is quite convoluted.
replies(1): >>43967161 #
15. gruez ◴[] No.43967161{7}[source]
It's not as clear-cut as you think. The courts have held that both google thumbnails and google books are fair use, even though they're far closer to verbatim copies than an AI model.
replies(1): >>43967616 #
16. const_cast ◴[] No.43967616{8}[source]
The reason those are allowed is because they don't compete with the source material. A thumbnail of a movie is never a substitute for a movie.

LLMs seek to be a for-profit replacement for a variety of paid sources. They say "hey, you can get the same thing as Service X for less money with us!"

That's a problem, regardless of how you go about it. It's probably fine if I watch a movie with my friends, who cares. But distributing it over the internet for free is a different issue.

replies(1): >>43967807 #
17. gruez ◴[] No.43967807{9}[source]
>The reason those are allowed is because they don't compete with the source material. A thumbnail of a movie is never a substitute for a movie.

>LLMs seek to be a for-profit replacement for a variety of paid sources. They say "hey, you can get the same thing as Service X for less money with us!"

What's an LLM supposed to be a substitute for? Are people using them to generate entire books or news articles, rather than buying a book or an issue of the new york times? Same goes for movies. No one is substituting marvel movies with sora video.

replies(1): >>43967863 #
18. const_cast ◴[] No.43967863{10}[source]
> Are people using them to generate entire books or news articles, rather than buying a book or an issue of the new york times?

Yes.

> No one is substituting marvel movies with sora video.

Yeah because sora kind of sucks. It's great technology, but turns out text is just a little bit easier to generate than 3D videos.

Once sora gets good, you bet your ass they will.

19. kevlened ◴[] No.43968228{4}[source]
This critique deserves more attention.

Humans are capable of reproducing copyright illegally, but we allow them to train on copyrighted material legally.

Perhaps measures should be taken to prevent illegal reproduction, but if that's impossible, or too onerous, there should be utilitarian considerations.

Then the crux becomes a debate over utility, which often becomes a religious debate.

20. nickpsecurity ◴[] No.43969157{4}[source]
That's just the reproducing part. They also shared copies of scraped web sites, etc without the authors' permission. Unauthorized copying has been widely known to be illegal for a long time. They've already broken the law before the training process even begins.
21. nickpsecurity ◴[] No.43969193{6}[source]
Whereas, my report showed they were breaking copyright before the training process. Meta was sued for what I said they'd be sued for, too.

Like Napster et al, their data sets make copies of hundreds of GB of copyrighted works without authors' permission. Ex: The Pile, Commons Crawl, Refined Web, Github Pages. Many copyrighted works on the Internet also have strict terms of use. Some have copyright licenses that say personal use only or non-commercial use.

So, like many prior cases, just posting what isn't yours on HughingFace is already infringement. Copying it from HF to your training cluster is also infringement. It's already illegal until we get laws like Singapore's that allow copyrighted works. Even they have a weakness in the access requirement which might require following terms of use or licenses in the sources.

Only safe routes are public domain, permissive code, and explicit licenses from copyright holders (or those with sub-license permissions).

So, what do you think about the argument that making copies of copyrighted works violates copyright law? That these data sets are themselves copyright violations?

22. mr_toad ◴[] No.43969383[source]
> It's the kind of thing that people would have prevented if it had occurred to them, by writing terms of use that explicitly forbid it.

The AI companies will likely be arguing that they don’t need a license, so any terms of use in the license are irrelevant.

23. mr_toad ◴[] No.43969413{3}[source]
The model is not compressed data, it’s the compression algorithm. The prompt is compressed data. When you feed it a prompt it produces the uncompressed result (usually with some loss). This is not an analogy by the way, it’s a mathematical equivalence.

You can try and argue that a compression algorithm is some kind of copy of the training data, but that’s an untested legal theory.

24. tensor ◴[] No.44055775{3}[source]
Math textbooks have words in them. Do you think math books are just bunches of formulas?