←back to thread

728 points freetonik | 1 comments | | HN request time: 0s | source
Show context
jedbrown ◴[] No.44980180[source]
Provenance matters. An LLM cannot certify a Developer Certificate of Origin (https://en.wikipedia.org/wiki/Developer_Certificate_of_Origi...) and a developer of integrity cannot certify the DCO for code emitted by an LLM, certainly not an LLM trained on code of unknown provenance. It is well-known that LLMs sometimes produce verbatim or near-verbatim copies of their training data, most of which cannot be used without attribution (and may have more onerous license requirements). It is also well-known that they don't "understand" semantics: they never make changes for the right reason.

We don't yet know how courts will rule on cases like Does v Github (https://githubcopilotlitigation.com/case-updates.html). LLM-based systems are not even capable of practicing clean-room design (https://en.wikipedia.org/wiki/Clean_room_design). For a maintainer to accept code generated by an LLM is to put the entire community at risk, as well as to endorse a power structure that mocks consent.

replies(5): >>44980234 #>>44980300 #>>44980455 #>>44982369 #>>44990599 #
Borealid ◴[] No.44980455[source]
An LLM can be used for a clean room design so long as all (ALL) of its training data is in the clean room (and consequently does not contain the copyrighted work being reverse engineered).

An LLM trained on the Internet-at-large is also presumably suitable for a clean room design if it can be shown that its training completed prior to the existence of the work being duplicated, and thus could not have been contaminated.

This doesn't detract from the core of your point, that LLM output may be copyright-contaminated by LLM training data. Yes, but that doesn't necessarily mean that an LLM output cannot be a valid clean-room reverse engineer.

replies(1): >>44982092 #
account42 ◴[] No.44982092[source]
> An LLM trained on the Internet-at-large is also presumably suitable for a clean room design if it can be shown that its training completed prior to the existence of the work being duplicated, and thus could not have been contaminated.

This is assuming that you are only concerned with a particular work when you need to be sure that you are not copying any work that might be copyrighted without making sure to have a valid license that you are abiding by.

replies(1): >>44982617 #
1. Borealid ◴[] No.44982617[source]
The "clean room" in "clean room reverse engineering" refers to a particular set of trade secrets, yes. You could have a clean room and still infringe if an employee in the room copied any work they had ever seen.

The clean room has to do with licenses and trade secrets, not copyright.