AI tooling must be disclosed for contributions

(github.com)

728 points freetonik | 1 comments | 21 Aug 25 18:49 UTC | HN request time: 0.222s | source

Show context

neilv ◴[21 Aug 25 19:26 UTC] No.44976959[source]▶

There is also IP taint when using "AI". We're just pretending that there's not.

If someone came to you and said "good news: I memorized the code of all the open source projects in this space, and can regurgitate it on command", you would be smart to ban them from working on code at your company.

But with "AI", we make up a bunch of rationalizations. ("I'm doing AI agentic generative AI workflow boilerplate 10x gettin it done AI did I say AI yet!")

And we pretend the person never said that they're just loosely laundering GPL and other code in a way that rightly would be existentially toxic to an IP-based company.

replies(6): >>44976975 #>>44977217 #>>44977317 #>>44980292 #>>44980599 #>>44980775 #

ineedasername ◴[21 Aug 25 19:57 UTC] No.44977317[source]▶

>>44976959 #

Courts (at least in the US) have already ruled that use of ingested data for training is transformative. There’s lots of details to figure, but the genie is out of the bottle.

Sure it’s a big hill to climb in rethinking IP laws to align with a societal desire that generating IP continue to be a viable economic work product, but that is what’s necessary.

replies(9): >>44977525 #>>44978041 #>>44978412 #>>44978589 #>>44979766 #>>44979930 #>>44979934 #>>44980167 #>>44980236 #

alfalfasprout ◴[21 Aug 25 20:17 UTC] No.44977525[source]▶

>>44977317 #

> Courts (at least in the US) have already ruled that use of ingested data for training is transformative

This is far from settled law. Let's not mischaracterize it.

Even so, an AI regurgitating proprietary code that's licensed in some other way is a very real risk.

replies(1): >>44977685 #

popalchemist ◴[21 Aug 25 20:30 UTC] No.44977685[source]▶

>>44977525 #

No more so than regurgitating an entire book. While it could technically be possible in the case of certain repos that are ubiquitous on the internet (and therefore overrepresented in training data to the point that they are "regurgitated" verbatim, in whole), it is extremely unlikely and would only occur after deliberate prompting. The NYT suit against Open AI shows (in discovery) that the NYT was only able to get partial results after deliberately prompting the model with portions of the text they were trying to force it to regurgitate.

So. Yes, technically possible. But impossible by accident. Furthermore when you make this argument you reveal that you don't understand how these models work. They do not simply compress all the data they were trained on into a tiny storable version. They are effectively multiplication matrices that allow math to be done to predict the most likely next token (read: 2-3 Unicode characters) given some input.

So the model does not "contain" code. It "contains" a way of doing calculations for predicting what text comes next.

Finally, let's say that it is possible that the model does spit out not entire works, but a handful of lines of code that appear in some codebase.

This does not constitute copyright infringement, as the lines in question a) represent a tiny portion of the whole work (and copyright only protecst against the reduplication of whole works or siginficant portions of the work), and B) there are a limited number of ways to accomplish a certain function and it is not only possible but inevitable that two devs working independently could arrive at the same implementation. Therefore using an identical implementation (which is what this case would be) of a part of a work is no more illegal than the use of a certain chord progression or melodic phrasing or drum rhythm. Courts have ruled about this thoroughly.

replies(2): >>44980078 #>>44981038 #

aspenmayer ◴[22 Aug 25 04:38 UTC] No.44981038[source]▶

>>44977685 #

> No more so than regurgitating an entire book.

Like this?

Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book - https://news.ycombinator.com/context?id=44972296 - 67 days ago (313 comments)

replies(1): >>44992990 #

popalchemist ◴[23 Aug 25 03:53 UTC] No.44992990[source]▶

>>44981038 #

Yes, that is one of those works that is over-represented in the training data, as I explained in the part of the comment you clearly did not comprehend.

replies(1): >>44993001 #

1. aspenmayer ◴[23 Aug 25 03:55 UTC] No.44993001[source]▶

>>44992990 #

> you clearly did not comprehend

I comprehend it just fine, I was adding context for those who may not comprehend.

↑