←back to thread

451 points croes | 1 comments | | HN request time: 0.205s | source
Show context
Workaccount2 ◴[] No.43963737[source]
I have yet to see someone explain in detail how transformer model training works (showing they understand the technical nitty gritty and the overall architecture of transformers) and also layout a case for why it is clearly a violation of copyright.

You can find lots of people talking about training, and you can find lots (way more) of people talking about AI training being a violation of copyright, but you can't find anyone talking about both.

Edit: Let me just clarify that I am talking about training, not inference (output).

replies(10): >>43963777 #>>43963792 #>>43963801 #>>43963816 #>>43963830 #>>43963874 #>>43963886 #>>43963955 #>>43964102 #>>43965360 #
jfengel ◴[] No.43963816[source]
I'm not sure I understand your question. It's reasonably clear that transformers get caught reproducing material that they have no right to. The kind of thing that would potentially result in a lawsuit if you did it by hand.

It's less clear whether taking vast amounts of copyrighted material and using it to generate other things rises to the level of copyright violation or not. It's the kind of thing that people would have prevented if it had occurred to them, by writing terms of use that explicitly forbid it. (Which probably means that the Web becomes a much smaller place.)

Your comment seems to suggest that writers and artists have absolutely no conceivable stake in products derived from their work, and that it's purely a misunderstanding on their part. But I'm both a computer scientist and an artist and I don't see how you could reach that conclusion. If my work is not relevant then leave it out.

replies(4): >>43963887 #>>43963911 #>>43964402 #>>43969383 #
gruez ◴[] No.43963887[source]
>I'm not sure I understand your question. It's reasonably clear that transformers get caught reproducing material that they have no right to. The kind of thing that would potentially result in a lawsuit if you did it by hand.

Is that a problem with the tool, or the person using it? A photocopier can copy an entire book verbatim. Should that be illegal? Or is it the problem that the "training" process can produce a model that has the ability to reproduce copyrighted work? If so, what implication does that hold for human learning? Many people can recite an entire song's lyrics from scratch, and reproducing an entire song's lyrics verbatim is probably enough to be considered copyright infringement. Does that mean the process of a human listening to music counts as copyright infringement?

replies(1): >>43964178 #
empath75 ◴[] No.43964178[source]
Let's start with I think a case that everyone agrees with.

If I were to take an image, and compress it or encrypt it, and then show you data file, you would not be able to see the original copyrighted material anywhere in the data.

But if you had the right computer program, you could use it to regenerate the original image flawlessly.

I think most people would easily agree that distributing the encrypted file without permission is still a distribution of a copyrighted work and against the law.

What if you used _lossy_ encryption, and can merely reproduce a poor quality jpeg of the original image? I think still copyright infringement, right?

Would it matter if you distributed it with an executable that only rendered the image non-deterministically? Maybe one out of 10 times? Or if the command to reproduce it was undocumented?

Okay, so now we have AI. We can ignore the algorithm entirely and how it works, because it's not relevant. There is a large amount of data that it operates on, the weights of the model and so on. You _can_ with the correct prompts, sometimes generate a copy of a copyrighted work, to some degree of fidelity or another.

I do not think it is meaningfully different from the simpler example, just with a lot of extra steps.

I think, legally, it's pretty clear that it is illegally distributing copyrighted material without permission. I think calling it an "ai" just needlessly anthropomorphizes everything. It's a computer program that distributes copyrighted work without permission. It doesn't matter if it's the primary purpose or not.

I think probably there needs to be some kind of new law to fix this situation, but under the current law as it exists, it seems to me to be clearly illegal.

replies(4): >>43964545 #>>43964933 #>>43965230 #>>43969413 #
1. mr_toad ◴[] No.43969413[source]
The model is not compressed data, it’s the compression algorithm. The prompt is compressed data. When you feed it a prompt it produces the uncompressed result (usually with some loss). This is not an analogy by the way, it’s a mathematical equivalence.

You can try and argue that a compression algorithm is some kind of copy of the training data, but that’s an untested legal theory.