Most active commenters
  • CamperBob2(3)

←back to thread

297 points rntn | 26 comments | | HN request time: 0.001s | source | bottom
Show context
ankit219 ◴[] No.44608660[source]
Not just Meta, 40 EU companies urged EU to postpone roll out of the ai act by two years due to it's unclear nature. This code of practice is voluntary and goes beyond what is in the act itself. EU published it in a way to say that there would be less scrutiny if you voluntarily sign up for this code of practice. Meta would anyway face scrutiny on all ends, so does not seem to a plausible case to sign something voluntary.

One of the key aspects of the act is how a model provider is responsible if the downstream partners misuse it in any way. For open source, it's a very hard requirement[1].

> GPAI model providers need to establish reasonable copyright measures to mitigate the risk that a downstream system or application into which a model is integrated generates copyright-infringing outputs, including through avoiding overfitting of their GPAI model. Where a GPAI model is provided to another entity, providers are encouraged to make the conclusion or validity of the contractual provision of the model dependent upon a promise of that entity to take appropriate measures to avoid the repeated generation of output that is identical or recognisably similar to protected works.

[1] https://www.lw.com/en/insights/2024/11/european-commission-r...

replies(7): >>44610592 #>>44610641 #>>44610669 #>>44611112 #>>44612330 #>>44613357 #>>44617228 #
1. zizee ◴[] No.44611112[source]
It doesn't seem unreasonable. If you train a model that can reliably reproduce thousands/millions of copyrighted works, you shouldn't be distributibg it. If it were just regular software that had that capability, would it be allowed? Just because it's a fancy Ai model it is ok?
replies(2): >>44611371 #>>44611463 #
2. CamperBob2 ◴[] No.44611371[source]
I have a Xerox machine that can reliably reproduce copyrighted works. Is that a problem, too?

Blaming tools for the actions of their users is stupid.

replies(4): >>44611396 #>>44611501 #>>44612409 #>>44614295 #
3. threetonesun ◴[] No.44611396[source]
If the Xerox machine had all of the copyrighted works in it and you just had to ask it nicely to print them I think you'd say the tool is in the wrong there, not the user.
replies(5): >>44611403 #>>44611469 #>>44611489 #>>44613191 #>>44616639 #
4. CamperBob2 ◴[] No.44611403{3}[source]
You'd think wrong.
5. Aurornis ◴[] No.44611463[source]
> that can reliably reproduce thousands/millions of copyrighted works, you shouldn't be distributibg it. If it were just regular software that had that capability, would it be allowed?

LLMs are hardly reliable ways to reproduce copyrighted works. The closest examples usually involve prompting the LLM with a significant portion of the copyrighted work and then seeing it can predict a number of tokens that follow. It’s a big stretch to say that they’re reliably reproducing copyrighted works any more than, say, a Google search producing a short excerpt of a document in the search results or a blog writer quoting a section of a book.

It’s also interesting to see the sudden anti-LLM takes that twist themselves into arguing against tools or platforms that might reproduce some copyrighted content. By this argument, should BitTorrent also be banned? If someone posts a section of copyrighted content to Hacker News as a comment, should YCombinator be held responsible?

replies(3): >>44611545 #>>44612224 #>>44614212 #
6. Aurornis ◴[] No.44611469{3}[source]
LLMs do not have all copyrighted works in them.

In some cases they can be prompted to guess a number of tokens that follow an excerpt from another work.

They do not contain all copyrighted works, though. That’s an incorrect understanding.

7. monetus ◴[] No.44611489{3}[source]
Are there any LLMs available with a, "give me copyrighted material" button? I don't think that is how they work.

Commercial use of someone's image also already has laws concerning that as far as I know, don't they?

8. zeta0134 ◴[] No.44611501[source]
Helpfully the law already disagrees. That Xerox machine tampers with the printed result, leaving a faint signature that is meant to help detect forgeries. You know, for when users copy things that are actually illegal to copy. Xerox machine (and every other printer sold today) literally leaves a paper trail to trace it back to them.

https://en.wikipedia.org/wiki/Printer_tracking_dots

replies(1): >>44611509 #
9. ChadNauseam ◴[] No.44611509{3}[source]
i believe only color printers are known to have this functionality, and it’s typically used for detecting counterfeit, not for enforcing copyright
replies(1): >>44611532 #
10. zeta0134 ◴[] No.44611532{4}[source]
You're quite right. Still, it's a decent example of blaming the tool for the actions of its users. The law clearly exerted enough pressure to convince the tool maker to modify that tool against the user's wishes.
replies(1): >>44611561 #
11. Jensson ◴[] No.44611545[source]
> LLMs are hardly reliable ways to reproduce copyrighted works

Only because the companies are intentionally making it so. If they weren't trained to not reproduce copyrighted works they would be able to.

replies(3): >>44611747 #>>44612219 #>>44613477 #
12. justinclift ◴[] No.44611561{5}[source]
> Still, it's a decent example of blaming the tool for the actions of its users.

They're not really "blaming" the tool though. They're using a supply chain attack against the subset of users they're interested in.

13. terminalshort ◴[] No.44611747{3}[source]
LLMs even fail on tasks like "repeat back to me exactly the following text: ..." To say they can exactly and reliably reproduce copyrighted work is quite a claim.
replies(1): >>44613620 #
14. jazzyjackson ◴[] No.44612219{3}[source]
it's like these people never tried asking for song lyrics
15. ◴[] No.44612224[source]
16. fodkodrasz ◴[] No.44612409[source]
According to the law in some jurisdictions it is. (notably most EU Member States, and several others worldwide).

In those places actually fees are included ("reprographic levy") in the appliance, and the needed supply prices, or public operators may need to pay additionally based on usage. That money goes towards funds created to compensate copyright holders for loss of profit due to copyright infringement carries out through the use of photocopiers.

Xerox is in no way singled out and discriminated against. (Yes, I know this is an Americanism)

17. zettabomb ◴[] No.44613191{3}[source]
Xerox already went through that lawsuit and won, which is why photocopiers still exist. The tool isn't in the wrong for being told to print out the copyrighted works. The user still had to make the conscious decision to copy that particular work. Hence, still the user's fault.
replies(1): >>44615490 #
18. ben_w ◴[] No.44613477{3}[source]
They're probably training them to refuse, but fundamentally the models are obviously too small to usually memorise content, and can only do it when there's many copies in the training set. Quotation is a waste of parameters better used for generalisation.

The other thing is that approximately all of the training set is copyrighted, because that's the default even for e.g. comments on forums like this comment you're reading now.

The other other thing is that at least two of the big model makers went and pirated book archives on top of crawling the web.

19. tomschwiha ◴[] No.44613620{4}[source]
You can also ask people to repeat a text and some will fail. What I want to say is that even if some LLMs (probably only older ones) will fail doesn't mean future ones will fail (in the majority). Especially if benchmarks indicate they are becoming smarter over time.
20. zizee ◴[] No.44614212[source]
Then they should easily fall within the regulation section posted earlier.

If you cannot see the difference between BitTorrent and Ai models, then it's probably not worth engaging with you.

But Ai model have been shown to reproduce the training data

https://gizmodo.com/ai-art-generators-ai-copyright-stable-di...

https://arxiv.org/abs/2301.13188

21. saghm ◴[] No.44614295[source]
If I've copied someone else's copyrighted work on my Xerox machine, then give it to you, you can't reproduce the work I copied. If I leave a copy of it in the scanner when I give it to you, that's another story. The issue here isn't the ability of an LLM to produce it when I provide it with the copyrighted work as an input, it's whether or not there's an input baked-in at the time of distribution that gives it the ability to continue producing it even if the person who receives it doesn't have access to the work to provide it in the first place.

To be clear, I don't have any particular insight on whether this is possible right now with LLMs, and I'm not taking a stance on copyright law in general with this comment. I don't think your argument makes sense though because there's a clear technical difference that seems like it would be pretty significant as a matter of law. There are plenty of reasonable arguments against things like the agreement mentioned in the article, but in my opinion, your objection isn't one of the.

replies(1): >>44614740 #
22. visarga ◴[] No.44614740{3}[source]
You can train a LLM on completely clean data, creative commons and legally licensed text, and at inference time someone will just put a whole article or chapter in the model and has full access to regenerate it however they like.
replies(1): >>44616884 #
23. 1718627440 ◴[] No.44615490{4}[source]
You take the copyrighted work to the printer, you don't upload data to an LLM first, it is already in the machine. If you got LLMs without training data (however that works) and the user needs to provide the data, then it would be ok.
replies(1): >>44616586 #
24. CamperBob2 ◴[] No.44616586{5}[source]
You don't "upload" data to an LLM, but that's already been explained multiple times, and evidently it didn't soak in.

LLMs extract semantic information from their training data and store it at extremely low precision in latent space. To the extent original works can be recovered from them, those works were nothing intrinsically special to begin with. At best such works simply milk our existing culture by recapitulating ancient archetypes, a la Harry Potter or Star Wars.

If the copyright cartels choose to fight AI, the copyright cartels will and must lose. This isn't Napster Part 2: Electric Boogaloo. There is too much at stake this time.

25. rpdillon ◴[] No.44616639{3}[source]
One of the reasons the New York Times didn't supply the prompts in their lawsuit is because it takes an enormous amount of effort to get LLMs to produce copyrighted works. In particular, you have to actually hand LLMs copyrighted works in the prompt to get them to continue it.

It's not like users are accidentally producing copies of Harry Potter.

26. saghm ◴[] No.44616884{4}[source]
Re-quoting the section the parent comment included from this agreement:

> > GPAI model providers need to establish reasonable copyright measures to mitigate the risk that a downstream system or application into which a model is integrated generates copyright-infringing outputs, including through avoiding overfitting of their GPAI model. Where a GPAI model is provided to another entity, providers are encouraged to make the conclusion or validity of the contractual provision of the model dependent upon a promise of that entity to take appropriate measures to avoid the repeated generation of output that is identical or recognisably similar to protected works.

It sounds to me like an LLM you describe would be covered if they people distributing it put in a clause in the license saying that people can't do that.