←back to thread

297 points rntn | 8 comments | | HN request time: 0.017s | source | bottom
Show context
ankit219 ◴[] No.44608660[source]
Not just Meta, 40 EU companies urged EU to postpone roll out of the ai act by two years due to it's unclear nature. This code of practice is voluntary and goes beyond what is in the act itself. EU published it in a way to say that there would be less scrutiny if you voluntarily sign up for this code of practice. Meta would anyway face scrutiny on all ends, so does not seem to a plausible case to sign something voluntary.

One of the key aspects of the act is how a model provider is responsible if the downstream partners misuse it in any way. For open source, it's a very hard requirement[1].

> GPAI model providers need to establish reasonable copyright measures to mitigate the risk that a downstream system or application into which a model is integrated generates copyright-infringing outputs, including through avoiding overfitting of their GPAI model. Where a GPAI model is provided to another entity, providers are encouraged to make the conclusion or validity of the contractual provision of the model dependent upon a promise of that entity to take appropriate measures to avoid the repeated generation of output that is identical or recognisably similar to protected works.

[1] https://www.lw.com/en/insights/2024/11/european-commission-r...

replies(7): >>44610592 #>>44610641 #>>44610669 #>>44611112 #>>44612330 #>>44613357 #>>44617228 #
zizee ◴[] No.44611112[source]
It doesn't seem unreasonable. If you train a model that can reliably reproduce thousands/millions of copyrighted works, you shouldn't be distributibg it. If it were just regular software that had that capability, would it be allowed? Just because it's a fancy Ai model it is ok?
replies(2): >>44611371 #>>44611463 #
1. Aurornis ◴[] No.44611463[source]
> that can reliably reproduce thousands/millions of copyrighted works, you shouldn't be distributibg it. If it were just regular software that had that capability, would it be allowed?

LLMs are hardly reliable ways to reproduce copyrighted works. The closest examples usually involve prompting the LLM with a significant portion of the copyrighted work and then seeing it can predict a number of tokens that follow. It’s a big stretch to say that they’re reliably reproducing copyrighted works any more than, say, a Google search producing a short excerpt of a document in the search results or a blog writer quoting a section of a book.

It’s also interesting to see the sudden anti-LLM takes that twist themselves into arguing against tools or platforms that might reproduce some copyrighted content. By this argument, should BitTorrent also be banned? If someone posts a section of copyrighted content to Hacker News as a comment, should YCombinator be held responsible?

replies(3): >>44611545 #>>44612224 #>>44614212 #
2. Jensson ◴[] No.44611545[source]
> LLMs are hardly reliable ways to reproduce copyrighted works

Only because the companies are intentionally making it so. If they weren't trained to not reproduce copyrighted works they would be able to.

replies(3): >>44611747 #>>44612219 #>>44613477 #
3. terminalshort ◴[] No.44611747[source]
LLMs even fail on tasks like "repeat back to me exactly the following text: ..." To say they can exactly and reliably reproduce copyrighted work is quite a claim.
replies(1): >>44613620 #
4. jazzyjackson ◴[] No.44612219[source]
it's like these people never tried asking for song lyrics
5. ◴[] No.44612224[source]
6. ben_w ◴[] No.44613477[source]
They're probably training them to refuse, but fundamentally the models are obviously too small to usually memorise content, and can only do it when there's many copies in the training set. Quotation is a waste of parameters better used for generalisation.

The other thing is that approximately all of the training set is copyrighted, because that's the default even for e.g. comments on forums like this comment you're reading now.

The other other thing is that at least two of the big model makers went and pirated book archives on top of crawling the web.

7. tomschwiha ◴[] No.44613620{3}[source]
You can also ask people to repeat a text and some will fail. What I want to say is that even if some LLMs (probably only older ones) will fail doesn't mean future ones will fail (in the majority). Especially if benchmarks indicate they are becoming smarter over time.
8. zizee ◴[] No.44614212[source]
Then they should easily fall within the regulation section posted earlier.

If you cannot see the difference between BitTorrent and Ai models, then it's probably not worth engaging with you.

But Ai model have been shown to reproduce the training data

https://gizmodo.com/ai-art-generators-ai-copyright-stable-di...

https://arxiv.org/abs/2301.13188