Most active commenters
  • strogonoff(7)
  • raggi(3)

←back to thread

728 points freetonik | 28 comments | | HN request time: 0.7s | source | bottom
1. jedbrown ◴[] No.44980180[source]
Provenance matters. An LLM cannot certify a Developer Certificate of Origin (https://en.wikipedia.org/wiki/Developer_Certificate_of_Origi...) and a developer of integrity cannot certify the DCO for code emitted by an LLM, certainly not an LLM trained on code of unknown provenance. It is well-known that LLMs sometimes produce verbatim or near-verbatim copies of their training data, most of which cannot be used without attribution (and may have more onerous license requirements). It is also well-known that they don't "understand" semantics: they never make changes for the right reason.

We don't yet know how courts will rule on cases like Does v Github (https://githubcopilotlitigation.com/case-updates.html). LLM-based systems are not even capable of practicing clean-room design (https://en.wikipedia.org/wiki/Clean_room_design). For a maintainer to accept code generated by an LLM is to put the entire community at risk, as well as to endorse a power structure that mocks consent.

replies(5): >>44980234 #>>44980300 #>>44980455 #>>44982369 #>>44990599 #
2. jojobas ◴[] No.44980234[source]
There are only so many ways to code quite a few things. My classmate and I once got in trouble in high school for having identical code for one of the tasks at a coding competition, down to variable names and indentation. There is no way he could or would steal my code, and I sure didn't steal his.
3. raggi ◴[] No.44980300[source]
For a large LLM I think the science in the end will demonstrate that verbatim reproduction is not coming from verbatim recording, as the structure really isn’t setup that way in the models under question here.

This is similar to the ruling by Alsup in the Anthropic books case that the training is “exceedingly transformative”. I would expect a reinterpretation or disagreement on this front from another case to be both problematic and likely eventually overturned.

I don’t actually think provenance is a problem on the axis you suggest if Alsups ruling holds. That said I don’t think that’s the only copyright issue afoot - the copyright office writing on copyrightability of outputs from the machine essentially requires that the output fails the Feist tests for human copyrightability.

More interesting to me is how this might realign the notion of copyrightability of human works further as time goes on, moving from every trivial derivative bit of trash potentially being copyrightable to some stronger notion of, to follow the feist test, independence and creativity. Further it raises a fairly immediate question in an open source setting if many individual small patch contributions themselves actually even pass those tests - they may well not, although the general guidance is to set the bar low - but is a typo fix either? There is so far to go on this rabbit hole.

replies(4): >>44980456 #>>44980801 #>>44981672 #>>44982112 #
4. Borealid ◴[] No.44980455[source]
An LLM can be used for a clean room design so long as all (ALL) of its training data is in the clean room (and consequently does not contain the copyrighted work being reverse engineered).

An LLM trained on the Internet-at-large is also presumably suitable for a clean room design if it can be shown that its training completed prior to the existence of the work being duplicated, and thus could not have been contaminated.

This doesn't detract from the core of your point, that LLM output may be copyright-contaminated by LLM training data. Yes, but that doesn't necessarily mean that an LLM output cannot be a valid clean-room reverse engineer.

replies(1): >>44982092 #
5. j4coh ◴[] No.44980456[source]
So if you can get an LLM to produce music lyrics, for example, or sections from a book, those would be considered novel works given the encoding as well?
replies(2): >>44980611 #>>44980990 #
6. GCUMstlyHarmls ◴[] No.44980611{3}[source]
Depends if the music is represented by the RIAA or not :)
7. strogonoff ◴[] No.44980801[source]
In the West you are free to make something that everyone thinks is a “derivative piece of trash” and still call it yours; and sometimes it will turn out to be a hit because, well, it turns out that in real life no one can reliably tell what is and what isn’t trash[0]—if it was possible, art as we know it would not exist. Sometimes what is trash to you is a cult experimental track to me, because people are different.

On that note, I am not sure why creators in so many industries are sitting around while they are being more or less ripped off by massive corporations, when music has got it right.

— Do you want to make a cover song? Go ahead. You can even copyright it! The original composer still gets paid.

— Do you want to make a transformative derivative work (change the composition, really alter the style, edit the lyrics)? Go ahead, just damn better make sure you license it first. …and you can copyright your derivative work, too. …and the original composer still gets credit in your copyright.

The current wave of LLM-induced AI hype really made the tech crowd bend itself in knots trying to paint this as an unsolvable problem that requires IP abuse, or not a problem because it’s all mostly “derivative bits of trash” (at least the bits they don’t like, anyway), argue in courts how it’s transformative, etc., while the most straightforward solution keeps staring them in the face. The only problem is that this solution does not scale, and if there’s anything the industry in which “Do Things That Don’t Scale” is the title of a hit essay hates then that would be doing things that don’t scale.

[0] It should be clarified that if art is considered (as I do) fundamentally a mechanism of self-expression then there is, of course, no trash and the whole point is moot.

replies(1): >>44981703 #
8. raggi ◴[] No.44980990{3}[source]
"an LLM" could imply an LLM of any size, for sufficiently small or focused training sets an LLM may not be transformative. There is some scale at which the volume and diversity of training data and intricacy of abstraction moves away from something you could reasonably consider solely memorization - there's a separate issue of reproduction though.

"novel" here depends on what you mean. Could an LLM produce output that is unique that both it and no one else has seen before, possibly yes. Could that output have perceived or emotional value to people, sure. Related challenge: Is a random encryption key generated by a csprng novel?

In the case of the US copyright office, if there wasn't sufficient human involvement in the production then the output is not copyrightable and how "novel" it is does not matter - but that doesn't necessarily impact a prior production by a human that is (whether a copy or not). Novel also only matters in a subset of the many fractured areas of copyright laws affecting the space of this form of digital replication. The copyright office wrote: https://www.copyright.gov/ai/Copyright-and-Artificial-Intell....

Where I imagine this approximately ends up is some set of tests that are oriented around how relevant to the whole the "copy" is, that is, it may not matter whether the method of production involved "copying", but may more matter if the whole works in which it is included are at large a copy, or, if the area contested as a copy, if it could be replaced with something novel, and it is a small enough piece of the whole, then it may not be able to meet some bar of material value to the whole to be relevant - that there is no harmful infringement, or similarly could cross into some notion of fair use.

I don't see much sanity in a world where small snippets become an issue. I think if models were regularly producing thousands of tokens of exactly duplicate content that's probably an issue.

I've not seen evidence of the latter outside of research that very deliberately performs active search for high probability cases (such as building suffix tree indices over training sets then searching for outputs based on guidance from the index). That's very different from arbitrary work prompts doing the same, and the models have various defensive trainings and wrappings attempting to further minimize reproductive behavior. On the one hand you have research metrics like 3.6 bits per parameter of recoverable input, on the other hand that represents a very small slice of the training set, and many such reproductions requiring strongly crafted and long prompts - meaning that for arbitrary real world interaction the chance of large scale overlap is small.

replies(1): >>44981233 #
9. j4coh ◴[] No.44981233{4}[source]
By novel, I mean if I ask a model to write some lyrics or code and it produces pre-existing code or lyrics, is it novel and legally safe to use because the pre-existing code or lyrics aren’t precisely encoded in a large enough model, and therefore legally not a reproduction just coincidentally identical.
replies(1): >>44981379 #
10. raggi ◴[] No.44981379{5}[source]
No. I don't think "novelty" would be relevant in such a case. How much risk you have depends on many factors, including what you mean by "use". If you mean sell, and you're successful, you're at risk. That would be true even if it's not the same as other content but just similar. Copyright provides little to no protection from legal costs if someone is motivated to bring a case at you.
11. snickerbockers ◴[] No.44981672[source]
I'd be fine with that if that was the way copyright law had been applied to humans for the last 30+ years but it's not. Look into the OP's link on clean room reverse engineering, I come from an RE background and people are terrified of accidentally absorbing "tainted" information through extremely indirect means because it can potentially used against them in court.

I swear the ML community is able to rapidly change their mind as to whether "training" an AI is comparable to human cognition based on whichever one is beneficial to them at any given instant.

12. 0points ◴[] No.44981703{3}[source]
There's an whole genre of musicians focusing only on creating royalty free covers of popular songs so the music can be used in suggestive ways while avoiding royalties.

It's not art. It's parasitism of art.

replies(2): >>44982176 #>>44982746 #
13. account42 ◴[] No.44982092[source]
> An LLM trained on the Internet-at-large is also presumably suitable for a clean room design if it can be shown that its training completed prior to the existence of the work being duplicated, and thus could not have been contaminated.

This is assuming that you are only concerned with a particular work when you need to be sure that you are not copying any work that might be copyrighted without making sure to have a valid license that you are abiding by.

replies(1): >>44982617 #
14. camgunz ◴[] No.44982112[source]
> For a large LLM I think the science in the end will demonstrate that verbatim reproduction is not coming from verbatim recording

We don't need all this (seemingly pretty good) analysis. We already know what everyone thinks: no relevant AI company has had their codebase or other IP scraped by AI bots they don't control, and there's no way they'd allow that to happen, because they don't want an AI bot they don't control to reproduce their IP without constraint. But they'll turn right around and be like, "for the sake of the future, we have to ingest all data... except no one can ingest our data, of course". :rolleyes:

15. strogonoff ◴[] No.44982176{4}[source]
> There's an whole genre of musicians focusing only on creating royalty free covers

There is no such thing as a “royalty free cover”. Either it is a full on faithful cover, which you can perform as long as license fees are paid, and in which case both the performer and the original songwriter get royalties, or it is a “transformative cover” which requires negotiation with the publisher/rights owner (and in that case IP ownership will probably be split between songwriter and performer depending on their agreement).

(Not an IP lawyer myself so someone can correct me.)

Furthermore, in countries where I know how it works as a venue owner you pay the rights organization a fixed sum per month or year and you are good to go and play any track you want. It thus makes no difference to you whether you play the original or a cover.

Have you considered that it is simply singers-performers who like to sing and would like to earn a bit of money from it, but don’t have many original songs if their own?

> It's parasitism of art

If we assume covers are parasitism of art, by that logic would your comment, which is very similar to dozens I have seen on this topic in recent months, be parasitism of discourse?

Jokes aside, a significant number of covers I have heard at cafes over years are actually quite decent, and I would certainly not call that parasitic in any way.

Even pretending they were, if you compare between artists specialising in covers and big tech trying to expropriate IP, insert itself as a middleman and arbiter for information access, devalue art for profit, etc., I am not sure they are even close in terms of the scale of parasitism.

replies(1): >>44982751 #
16. Aeolun ◴[] No.44982369[source]
Or you know, they just feel like code should be free. Like beer should be free.

We didn't have this whole issue 20 years ago because nobody gave a shit. If your code was public, and on the internet, it was free for everyone to use by definition.

17. Borealid ◴[] No.44982617{3}[source]
The "clean room" in "clean room reverse engineering" refers to a particular set of trade secrets, yes. You could have a clean room and still infringe if an employee in the room copied any work they had ever seen.

The clean room has to do with licenses and trade secrets, not copyright.

18. withinboredom ◴[] No.44982746{4}[source]
There's several sides of music copyright:

1. The lyrics

2. The composition

3. The recording

These can all be owned by different people or the same person. The "royalty free covers" you mention are people abusing the rights of one of those. They're not avoiding royalties, they just havn't been caught yet.

replies(1): >>44983639 #
19. 0points ◴[] No.44982751{5}[source]
> Have you considered that it is simply singers-performers who like to sing and would like to earn a bit of money from it, but don’t have many original songs if their own?

Or, maybe you start to pay attention?

They are selling their songs cheaper for TV, radio or ads.

> Even pretending they were, if you compare between artists specialising in covers and big tech trying to expropriate IP

They're literally working for spotify.

replies(1): >>44983459 #
20. strogonoff ◴[] No.44983459{6}[source]
> They are selling their songs cheaper for TV, radio or ads.

I guess that somehow refutes the points I made, I just can’t see how.

Radio stations, like the aforementioned venue owners, pay the rights organizations a flat annual fee. TV programs do need to license these songs (as unlike simple cover here the use is substantially transformative), but again: 1) it does not rip off songwriters (holder of songwriter rights for a song gets royalties for performance of its covers, songwriter has a say in any such licensing agreement), and 2) often a cover is a specifically considered and selected choice: it can be miles better fitting for a scene than the original (just remember Motion Picture Soundtrack in that Westworld scene), and unlike the original it does not tend to make the scene all about itself so much. It feels like you are yet to demonstrate how it is particularly parasitic.

Edit: I mean honest covers; modifying a song a little bit and passing it as original should be very sueable by the rights holder and I would be very surprised if Spotify decided to do that even if they fired their entire legal department and replaced it with one LLM chatbot.

replies(1): >>44984804 #
21. strogonoff ◴[] No.44983639{5}[source]
I believe performance of a cover still results in relevant royalties paid to the original songwriter, just sans the performance fee, which does not strike me as a terrible ripoff (after all, a cover did take effort to arrange and perform).
replies(1): >>44984639 #
22. withinboredom ◴[] No.44984639{6}[source]
What this person is talking about is they write “tvinkle tvinkle ittle stawr” instead of the real lyrics (basically just writing the words phonetically and/or misspelled) to try and bypass the law through “technicalities” that wouldn’t stand up in court.
replies(1): >>44984699 #
23. strogonoff ◴[] No.44984699{7}[source]
I doubt so for a few reasons based on how they described this alleged parasitic activity, but mainly because the commenter alluded to Spotify doing this. Would be very surprising if they decided to do something so blatantly illegal, when they could keep extracting money by the truckload with their regular shady shenanigans that do not cross that legality line so obviously.

Regarding what you described, I don’t think I encountered this in the wild enough to remember. IANAL but if not cleared/registered properly as a cover it doesn’t seem to be a workaround or abuse, but would probably be found straight up illegal if the rights holder or relevant rights organization cares to sue. In this case, all I can say is “yes, some people do illegal stuff”. The system largely works.

24. zvr ◴[] No.44984804{7}[source]
I know of restaurants and bars that choose to play cover versions of well-known songs because the costs are so much less.
replies(1): >>44984940 #
25. strogonoff ◴[] No.44984940{8}[source]
I really doubt you would ever license any specific songs as a cafe business. You should be able to pay a fixed fee to a PRO and have a blanket license to play almost anything. Is it so expensive in the US, or perhaps they do not know that this is an option? If the former, and those cover artists help those bars keep their expenses low and offer you better experience while charging less—working with the system, without ripping off the original artists who still get paid their royalty—does it seem particularly parasitic?
replies(1): >>44994942 #
26. rovr138 ◴[] No.44990599[source]
This is how sqlite handles it,

> Contributed Code

> In order to keep SQLite completely free and unencumbered by copyright, the project does not accept patches. If you would like to suggest a change and you include a patch as a proof-of-concept, that would be great. However, please do not be offended if we rewrite your patch from scratch.

source, https://www.sqlite.org/copyright.html

27. zvr ◴[] No.44994942{9}[source]
The example I was referring to was not in the US.

A restaurant / cafe may pay a fixed fee and get access to a specific catalog of songs (performances). The fee depends on what the catalog contains. As you can imagine, paying for the right to only play instrumental versions of songs (no singers, no lyrics) is significantly cheaper. Or, having performances of songs by unknown people.

replies(1): >>44995575 #
28. strogonoff ◴[] No.44995575{10}[source]
Two countries where I know how it works from a venue business owner perspective work this way. The fees seemed pretty mild, that’s why I asked if it’s too expensive in your country (which I guess is not US).