Most active commenters
  • Workaccount2(5)
  • gruez(5)
  • nickpsecurity(4)
  • gitremote(3)
  • empath75(3)

←back to thread

451 points croes | 38 comments | | HN request time: 2.388s | source | bottom
1. Workaccount2 ◴[] No.43963737[source]
I have yet to see someone explain in detail how transformer model training works (showing they understand the technical nitty gritty and the overall architecture of transformers) and also layout a case for why it is clearly a violation of copyright.

You can find lots of people talking about training, and you can find lots (way more) of people talking about AI training being a violation of copyright, but you can't find anyone talking about both.

Edit: Let me just clarify that I am talking about training, not inference (output).

replies(10): >>43963777 #>>43963792 #>>43963801 #>>43963816 #>>43963830 #>>43963874 #>>43963886 #>>43963955 #>>43964102 #>>43965360 #
2. anhner ◴[] No.43963777[source]
because people who understand how training works also understand that it's not a violation of copyright...
3. autobodie ◴[] No.43963792[source]
I have yet to see someone explain in detail how writing the same words as another person works (showing they understand the technical nitty gritty and the overall architecture of the human mind) and also layout a case for why it is clearly a violation of copyright. You can find lots of people talking about reading, and you can find lots (way more) of people talking about plagarism being a violation of copyright, but you can't find anyone talking about both.
replies(1): >>43963965 #
4. jsiepkes ◴[] No.43963801[source]
This isn't about training AI on a book, but AI companies never paying for the book at all. As in: They "downloaded the e-book from a warez site" and then used it for training.
replies(1): >>43964081 #
5. jfengel ◴[] No.43963816[source]
I'm not sure I understand your question. It's reasonably clear that transformers get caught reproducing material that they have no right to. The kind of thing that would potentially result in a lawsuit if you did it by hand.

It's less clear whether taking vast amounts of copyrighted material and using it to generate other things rises to the level of copyright violation or not. It's the kind of thing that people would have prevented if it had occurred to them, by writing terms of use that explicitly forbid it. (Which probably means that the Web becomes a much smaller place.)

Your comment seems to suggest that writers and artists have absolutely no conceivable stake in products derived from their work, and that it's purely a misunderstanding on their part. But I'm both a computer scientist and an artist and I don't see how you could reach that conclusion. If my work is not relevant then leave it out.

replies(4): >>43963887 #>>43963911 #>>43964402 #>>43969383 #
6. dmoy ◴[] No.43963830[source]
Not a ton of expert programmer + copyright lawyers, but I bet they're out there

You can probably find a good number of expert programmer + patent lawyers. And presumably some of those osmose enough copyright knowledge from their coworkers to give a knowledgeable answer.

At the end of the day though, the intersection of both doesn't matter. The lawyers win, so what really matters is who has the pulse on how the Fed Circuit will rule on this

Also in this specific case from the article, it's irrelevant?

7. nickpsecurity ◴[] No.43963874[source]
I did here with proofs of infingement:

https://gethisword.com/tech/exploringai/

8. belorn ◴[] No.43963886[source]
I would also like to see such explanation, especially one that explains how it differ from regular transformers found in video codecs. Why is a lossy compression a clear violation of copyright, but not a generative AI?
9. gruez ◴[] No.43963887[source]
>I'm not sure I understand your question. It's reasonably clear that transformers get caught reproducing material that they have no right to. The kind of thing that would potentially result in a lawsuit if you did it by hand.

Is that a problem with the tool, or the person using it? A photocopier can copy an entire book verbatim. Should that be illegal? Or is it the problem that the "training" process can produce a model that has the ability to reproduce copyrighted work? If so, what implication does that hold for human learning? Many people can recite an entire song's lyrics from scratch, and reproducing an entire song's lyrics verbatim is probably enough to be considered copyright infringement. Does that mean the process of a human listening to music counts as copyright infringement?

replies(1): >>43964178 #
10. Workaccount2 ◴[] No.43963911[source]
My comment is about training models, not model inference.

Most artists can readily violate copyright, that doesn't me we block them from seeing copyright.

replies(1): >>43963993 #
11. gitremote ◴[] No.43963955[source]
They never said model training is a violation of copyright. The ruling says model training on copyrighted material for analysis and research is NOT copyright infringement, but the commercial use of the resulting model is:

"When a model is deployed for purposes such as analysis or research… the outputs are unlikely to substitute for expressive works used in training. But making commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets, especially where this is accomplished through illegal access, goes beyond established fair use boundaries."

replies(1): >>43964872 #
12. xhkkffbf ◴[] No.43963965[source]
A big part of copyright law is protecting the market for the original creator. Not guaranteeing them anything. Just preventing someone else from coming along and copying someone else's work in a way that hurts their sales.

While AIs don't reproduce things verbatim like pirates, I can see how they really undermine the market, especially for non-fiction books. If people can get the facts without buying the original book, there's much less incentive for the original author to do the hard research and writing.

13. gitremote ◴[] No.43963993{3}[source]
The judgement was about model inference, not training.
replies(1): >>43964817 #
14. xhkkffbf ◴[] No.43964081[source]
This is what's most offensive about it. At least buy one friggin copy.
15. kranke155 ◴[] No.43964102[source]
It doesn’t matter how they work, it only matters what they do.
16. empath75 ◴[] No.43964178{3}[source]
Let's start with I think a case that everyone agrees with.

If I were to take an image, and compress it or encrypt it, and then show you data file, you would not be able to see the original copyrighted material anywhere in the data.

But if you had the right computer program, you could use it to regenerate the original image flawlessly.

I think most people would easily agree that distributing the encrypted file without permission is still a distribution of a copyrighted work and against the law.

What if you used _lossy_ encryption, and can merely reproduce a poor quality jpeg of the original image? I think still copyright infringement, right?

Would it matter if you distributed it with an executable that only rendered the image non-deterministically? Maybe one out of 10 times? Or if the command to reproduce it was undocumented?

Okay, so now we have AI. We can ignore the algorithm entirely and how it works, because it's not relevant. There is a large amount of data that it operates on, the weights of the model and so on. You _can_ with the correct prompts, sometimes generate a copy of a copyrighted work, to some degree of fidelity or another.

I do not think it is meaningfully different from the simpler example, just with a lot of extra steps.

I think, legally, it's pretty clear that it is illegally distributing copyrighted material without permission. I think calling it an "ai" just needlessly anthropomorphizes everything. It's a computer program that distributes copyrighted work without permission. It doesn't matter if it's the primary purpose or not.

I think probably there needs to be some kind of new law to fix this situation, but under the current law as it exists, it seems to me to be clearly illegal.

replies(4): >>43964545 #>>43964933 #>>43965230 #>>43969413 #
17. tensor ◴[] No.43964402[source]
If I write a math book, and you read it, then tell someone about the math within it. You are not violating copyright. In fact, you could write your OWN math book, or history book, or whatever, and as long as you're not copying my actual text, you are not violating copyright.

However, when an LLM does the same, people now what it to be illegal. It seems pretty straightforward to apply existing copyright law to LLMs in the same way we apply them to humans. If the actual text they generate is substantially similar to a source material that it would constitute a copyright violation if a human were to have done it, then it should be illegal. Otherwise it should not.

edit: and in fact it's not even whether an LLM reproduces text, it's wether someone subsequently publishes that text. The person publishing that text should be the one taking on the legal hit.

replies(1): >>43965025 #
18. gruez ◴[] No.43964545{4}[source]
>Okay, so now we have AI. We can ignore the algorithm entirely and how it works, because it's not relevant. There is a large amount of data that it operates on, the weights of the model and so on. You _can_ with the correct prompts, sometimes generate a copy of a copyrighted work, to some degree of fidelity or another.

Suppose we accept all of the above. What does that hold for human learning?

replies(1): >>43965098 #
19. Workaccount2 ◴[] No.43964817{4}[source]
>"But making commercial use of vast troves of copyrighted works to produce expressive content"

This can only be referring to training, the models themselves are a rounding error in size compared to their training sets.

20. Workaccount2 ◴[] No.43964872[source]
The vast trove of copyright work has to refer to training. ChatGPT is likely on the order of 5-10TB in size. (Yes, Terabyte).

There are college kids with bigger "copyright collections" than that...

replies(1): >>43965192 #
21. Workaccount2 ◴[] No.43964933{4}[source]
The crux of the debate is a motte and bailey.

AI is capable of reproducing copyright (motte) therefore training on copyright is illegal (bailey).

replies(2): >>43968228 #>>43969157 #
22. rrook ◴[] No.43965025{3}[source]
That mathematical formulas already cannot be copyrighted makes this a kinda nonsense example?
23. empath75 ◴[] No.43965098{5}[source]
If a human were to reproduce, from memory, a copyrighted work, that would be illegal as well, and multiple people have been sued over it, even doing it unintentionally.

I'm not talking about learning. I'm talking about the complete reproduction of a copyrighted work. It doesn't matter how it happens.

replies(1): >>43965454 #
24. gitremote ◴[] No.43965192{3}[source]
No. The paragraph as a whole refers to the "outputs" of vast troves of copyrighted work.

Disk size is irrelevant. If you lossy-compress a copyrighted bitmap image to small JPEG image and then sell the JPEG image, it's still copyright infringement.

replies(1): >>43969201 #
25. halkony ◴[] No.43965230{4}[source]
> I do not think it is meaningfully different from the simpler example, just with a lot of extra steps.

Those extra steps are meaningfully different. In your description, a casual observer could compare the two JPEGs and recognize the inferior copy. However, AI has become so advanced that such detection is becoming impossible. It is clearly voodoo.

26. moralestapia ◴[] No.43965360[source]
Because it's a machine that reproduces other people's work, who are copyrighted. Copyright protects the essence of original work even after its present in or turned into derivative work.

Some try to make the argument of "but that's what humans do and it's allowed", but that's not a real argument as it has not been proven, nor it is easy to prove, that machine learning equates human reasoning. In the absence of evidence, the law assumes NO.

27. gruez ◴[] No.43965454{6}[source]
>I'm not talking about learning. I'm talking about the complete reproduction of a copyrighted work. It doesn't matter how it happens.

In that case I don't think there's anything controversial here? Nobody thinks that if you ask AI to reproduce something verbatim, that you should get a pass because it's AI. All the controversy in this thread seems to be around the training process and whether that breaks copyright laws.

replies(2): >>43966775 #>>43969193 #
28. empath75 ◴[] No.43966775{7}[source]
No -- the controversy is also over whether distributing the weights and software is a copyright violation. I believe that is. The copyrighted material is present in the software in some form, even if the process for regenerating it is quite convoluted.
replies(1): >>43967161 #
29. gruez ◴[] No.43967161{8}[source]
It's not as clear-cut as you think. The courts have held that both google thumbnails and google books are fair use, even though they're far closer to verbatim copies than an AI model.
replies(1): >>43967616 #
30. const_cast ◴[] No.43967616{9}[source]
The reason those are allowed is because they don't compete with the source material. A thumbnail of a movie is never a substitute for a movie.

LLMs seek to be a for-profit replacement for a variety of paid sources. They say "hey, you can get the same thing as Service X for less money with us!"

That's a problem, regardless of how you go about it. It's probably fine if I watch a movie with my friends, who cares. But distributing it over the internet for free is a different issue.

replies(1): >>43967807 #
31. gruez ◴[] No.43967807{10}[source]
>The reason those are allowed is because they don't compete with the source material. A thumbnail of a movie is never a substitute for a movie.

>LLMs seek to be a for-profit replacement for a variety of paid sources. They say "hey, you can get the same thing as Service X for less money with us!"

What's an LLM supposed to be a substitute for? Are people using them to generate entire books or news articles, rather than buying a book or an issue of the new york times? Same goes for movies. No one is substituting marvel movies with sora video.

replies(1): >>43967863 #
32. const_cast ◴[] No.43967863{11}[source]
> Are people using them to generate entire books or news articles, rather than buying a book or an issue of the new york times?

Yes.

> No one is substituting marvel movies with sora video.

Yeah because sora kind of sucks. It's great technology, but turns out text is just a little bit easier to generate than 3D videos.

Once sora gets good, you bet your ass they will.

33. kevlened ◴[] No.43968228{5}[source]
This critique deserves more attention.

Humans are capable of reproducing copyright illegally, but we allow them to train on copyrighted material legally.

Perhaps measures should be taken to prevent illegal reproduction, but if that's impossible, or too onerous, there should be utilitarian considerations.

Then the crux becomes a debate over utility, which often becomes a religious debate.

34. nickpsecurity ◴[] No.43969157{5}[source]
That's just the reproducing part. They also shared copies of scraped web sites, etc without the authors' permission. Unauthorized copying has been widely known to be illegal for a long time. They've already broken the law before the training process even begins.
35. nickpsecurity ◴[] No.43969193{7}[source]
Whereas, my report showed they were breaking copyright before the training process. Meta was sued for what I said they'd be sued for, too.

Like Napster et al, their data sets make copies of hundreds of GB of copyrighted works without authors' permission. Ex: The Pile, Commons Crawl, Refined Web, Github Pages. Many copyrighted works on the Internet also have strict terms of use. Some have copyright licenses that say personal use only or non-commercial use.

So, like many prior cases, just posting what isn't yours on HughingFace is already infringement. Copying it from HF to your training cluster is also infringement. It's already illegal until we get laws like Singapore's that allow copyrighted works. Even they have a weakness in the access requirement which might require following terms of use or licenses in the sources.

Only safe routes are public domain, permissive code, and explicit licenses from copyright holders (or those with sub-license permissions).

So, what do you think about the argument that making copies of copyrighted works violates copyright law? That these data sets are themselves copyright violations?

36. nickpsecurity ◴[] No.43969201{4}[source]
I won't say it's irrelevant. How much you use is part of fair use considerations. Their huge collections of copyrighted works make them look worse in legal analyses.
37. mr_toad ◴[] No.43969383[source]
> It's the kind of thing that people would have prevented if it had occurred to them, by writing terms of use that explicitly forbid it.

The AI companies will likely be arguing that they don’t need a license, so any terms of use in the license are irrelevant.

38. mr_toad ◴[] No.43969413{4}[source]
The model is not compressed data, it’s the compression algorithm. The prompt is compressed data. When you feed it a prompt it produces the uncompressed result (usually with some loss). This is not an analogy by the way, it’s a mathematical equivalence.

You can try and argue that a compression algorithm is some kind of copy of the training data, but that’s an untested legal theory.