AI tooling must be disclosed for contributions

(github.com)

728 points freetonik | 1 comments | 21 Aug 25 18:49 UTC | HN request time: 0s | source

Show context

jedbrown ◴[22 Aug 25 01:30 UTC] No.44980180[source]▶

Provenance matters. An LLM cannot certify a Developer Certificate of Origin (https://en.wikipedia.org/wiki/Developer_Certificate_of_Origi...) and a developer of integrity cannot certify the DCO for code emitted by an LLM, certainly not an LLM trained on code of unknown provenance. It is well-known that LLMs sometimes produce verbatim or near-verbatim copies of their training data, most of which cannot be used without attribution (and may have more onerous license requirements). It is also well-known that they don't "understand" semantics: they never make changes for the right reason.

We don't yet know how courts will rule on cases like Does v Github (https://githubcopilotlitigation.com/case-updates.html). LLM-based systems are not even capable of practicing clean-room design (https://en.wikipedia.org/wiki/Clean_room_design). For a maintainer to accept code generated by an LLM is to put the entire community at risk, as well as to endorse a power structure that mocks consent.

replies(5): >>44980234 #>>44980300 #>>44980455 #>>44982369 #>>44990599 #

raggi ◴[22 Aug 25 01:57 UTC] No.44980300[source]▶

>>44980180 #

For a large LLM I think the science in the end will demonstrate that verbatim reproduction is not coming from verbatim recording, as the structure really isn’t setup that way in the models under question here.

This is similar to the ruling by Alsup in the Anthropic books case that the training is “exceedingly transformative”. I would expect a reinterpretation or disagreement on this front from another case to be both problematic and likely eventually overturned.

I don’t actually think provenance is a problem on the axis you suggest if Alsups ruling holds. That said I don’t think that’s the only copyright issue afoot - the copyright office writing on copyrightability of outputs from the machine essentially requires that the output fails the Feist tests for human copyrightability.

More interesting to me is how this might realign the notion of copyrightability of human works further as time goes on, moving from every trivial derivative bit of trash potentially being copyrightable to some stronger notion of, to follow the feist test, independence and creativity. Further it raises a fairly immediate question in an open source setting if many individual small patch contributions themselves actually even pass those tests - they may well not, although the general guidance is to set the bar low - but is a typo fix either? There is so far to go on this rabbit hole.

replies(4): >>44980456 #>>44980801 #>>44981672 #>>44982112 #

1. snickerbockers ◴[22 Aug 25 06:51 UTC] No.44981672[source]▶

>>44980300 #

I'd be fine with that if that was the way copyright law had been applied to humans for the last 30+ years but it's not. Look into the OP's link on clean room reverse engineering, I come from an RE background and people are terrified of accidentally absorbing "tainted" information through extremely indirect means because it can potentially used against them in court.

I swear the ML community is able to rapidly change their mind as to whether "training" an AI is comparable to human cognition based on whichever one is beneficial to them at any given instant.

↑