Most active commenters

jacquesm(24)
martin-t(9)
measurablefunc(5)
Mtinie(4)
DangitBobby(4)
ineedasername(4)
handoflixue(4)
tovej(4)
sholain(3)
fc417fc802(3)

Popular/hot comments

>>46185478 #
>>46185791 #
>>46185592 #
>>46185822 #
>>46185404 #
>>46186728 #
>>46187201 #
>>46187456 #

←back to thread

I failed to recreate the 1996 Space Jam website with Claude

(j0nah.com)

Show context

thuttinger ◴[07 Dec 25 19:44 UTC] No.46184466[source]▶

>>46183294 (OP) #

Claude/LLMs in general are still pretty bad at the intricate details of layouts and visual things. There are a lot of problems that are easy to get right for a junior web dev but impossible for an LLM. On the other hand, I was able to write a C program that added gamma color profile support to linux compositors that don't support it (in my case Hyprland) within a few minutes! A - for me - seemingly hard task, which would have taken me at least a day or more if I didn't let Claude write the code. With one prompt Claude generated C code that compiled on first try that:

- Read an .icc file from disk

- parsed the file and extracted the VCGT (video card gamma table)

- wrote the VCGT to the video card for a specified display via amdgpu driver APIs

The only thing I had to fix was the ICC parsing, where it would parse header strings in the wrong byte-order (they are big-endian).

replies(3): >>46184840 #>>46185379 #>>46185476 #

1. jacquesm ◴[07 Dec 25 21:33 UTC] No.46185379[source]▶

>>46184466 #

Claude didn't write that code. Someone else did and Claude took that code without credit to the original author(s), adapted it to your use case and then presented it as its own creation to you and you accepted this. If a human did this we probably would have a word for them.

replies(16): >>46185404 #>>46185408 #>>46185442 #>>46185473 #>>46185478 #>>46185791 #>>46185885 #>>46185911 #>>46186086 #>>46186326 #>>46186420 #>>46186759 #>>46187004 #>>46187058 #>>46187235 #>>46188771 #

2. Mtinie ◴[07 Dec 25 21:36 UTC] No.46185404[source]▶

>>46185379 (TP) #

> If a human did this we probably would have a word for them.

I don’t think it’s fair to call someone who used Stack Overflow to find a similar answer with samples of code to copy to their project an asshole.

replies(3): >>46185427 #>>46185437 #>>46185517 #

3. giancarlostoro ◴[07 Dec 25 21:37 UTC] No.46185408[source]▶

>>46185379 (TP) #

You mean like copying and pasting code from Stack Overflow?

4. jacquesm ◴[07 Dec 25 21:39 UTC] No.46185427[source]▶

>>46185404 #

Who brought Stack Overflow up? Stack Overflow does not magically generate code, someone has to actually provide it first.

replies(1): >>46185532 #

5. sublinear ◴[07 Dec 25 21:39 UTC] No.46185437[source]▶

>>46185404 #

Using stack overflow recklessly is definitely asshole behavior.

replies(1): >>46187257 #

6. idiotsecant ◴[07 Dec 25 21:40 UTC] No.46185442[source]▶

>>46185379 (TP) #

Yes, the word for that is software developer.

7. FanaHOVA ◴[07 Dec 25 21:43 UTC] No.46185473[source]▶

>>46185379 (TP) #

Are you saying that every piece of code you have ever written contains a full source list of every piece of code you previously read to learn specific languages, patterns, etc?

Or are you saying that every piece of code you ever wrote was 100% original and not adapted from any previous codebase you ever worked in or any book / reference you ever read?

replies(2): >>46185606 #>>46188678 #

8. bsaul ◴[07 Dec 25 21:43 UTC] No.46185478[source]▶

>>46185379 (TP) #

That's an interesting hypothesis : that LLM are fundamentally unable to produce original code.

Do you have papers to back this up ? That was also my reaction when i saw some really crazy accurate comments on some vibe coded piece of code, but i couldn't prove it, and thinking about it now i think my intuition was wrong (ie : LLMs do produce original complex code).

replies(7): >>46185592 #>>46185822 #>>46186708 #>>46187030 #>>46187456 #>>46188840 #>>46191020 #

9. bluedino ◴[07 Dec 25 21:49 UTC] No.46185517[source]▶

>>46185404 #

It has been for the last 15 years.

10. Mtinie ◴[07 Dec 25 21:50 UTC] No.46185532{3}[source]▶

>>46185427 #

I generally agree with your underlying point concerning attribution and intellectual property ownership but your follow-up comment reframes your initial statement: LLMs generate recombinations of code from code created by humans, without giving credit.

Stack Overflow offers access to other peoples’ work, and developers combined those snippets and patterns into their own projects. I suspect attribution is low.

replies(1): >>46185601 #

11. jacquesm ◴[07 Dec 25 21:56 UTC] No.46185592[source]▶

>>46185478 #

We can solve that question in an intuitive way: if human input is not what is driving the output then it would be sufficient to present it with a fraction of the current inputs, say everything up to 1970 and have it generate all of the input data from 1970 onwards as output.

If that does not work then the moment you introduce AI you cap their capabilities unless humans continue to create original works to feed the AI. The conclusion - to me, at least - is that these pieces of software regurgitate their inputs, they are effectively whitewashing plagiarism, or, alternatively, their ability to generate new content is capped by some arbitrary limit relative to the inputs.

replies(5): >>46185770 #>>46185916 #>>46185934 #>>46186728 #>>46188343 #

12. jacquesm ◴[07 Dec 25 21:57 UTC] No.46185601{4}[source]▶

>>46185532 #

Stack Overflow deals with that issue by having a license agreement.

replies(2): >>46186720 #>>46187287 #

13. jacquesm ◴[07 Dec 25 21:57 UTC] No.46185606[source]▶

>>46185473 #

What's with the bad takes in this thread. That's two strawmen in one comment, it's getting a bit crowded.

replies(1): >>46185838 #

14. andrepd ◴[07 Dec 25 22:15 UTC] No.46185770{3}[source]▶

>>46185592 #

Excellent observation.

15. mlinsey ◴[07 Dec 25 22:16 UTC] No.46185791[source]▶

>>46185379 (TP) #

Certainly if a human wrote code that solved this problem, and a second human copied and tweaked it slightly for their use case, we would have a word for them.

Would we use the same word if two different humans wrote code that solved two different problems, but one part of each problem was somewhat analogous to a different aspect of a third human's problem, and the third human took inspiration from those parts of both solutions to create code that solved a third problem?

What if it were ten different humans writing ten different-but-related pieces of code, and an eleventh human piecing them together? What if it were 1,000 different humans?

I think "plagiarism", "inspiration", and just "learning from" fall on some continuous spectrum. There are clear differences when you zoom out, but they are in degree, and it's hard to set a hard boundary. The key is just to make sure we have laws and norms that provide sufficient incentive for new ideas to continue to be created.

replies(6): >>46186125 #>>46186199 #>>46187063 #>>46188272 #>>46189797 #>>46194087 #

16. fpoling ◴[07 Dec 25 22:19 UTC] No.46185822[source]▶

>>46185478 #

Pick up a book about programming from seventies or eighties that was unlikely to be scanned and feed into LLM. Take a task from it and ask LLM to write a program from it that even a student can solve within 10 minutes. If the problem was not really published before, LLM fails spectacularly.

replies(4): >>46185881 #>>46185976 #>>46186648 #>>46187999 #

17. DangitBobby ◴[07 Dec 25 22:20 UTC] No.46185838{3}[source]▶

>>46185606 #

Or the original point doesn't actually hold up to basic scrutiny and is indistinguishable from straw itself.

replies(2): >>46186158 #>>46189150 #

18. crawshaw ◴[07 Dec 25 22:23 UTC] No.46185881{3}[source]▶

>>46185822 #

This does not appear to be true. Six months ago I created a small programming language. I had LLMs write hundreds of small programs in the language, using the parser, interpreter, and my spec as a guide for the language. The vast majority of these programs were either very close or exactly what I wanted. No prior source existed for the programming language because I created it whole cloth days earlier.

replies(2): >>46186205 #>>46186214 #

19. ekropotin ◴[07 Dec 25 22:24 UTC] No.46185885[source]▶

>>46185379 (TP) #

> If a human did this we probably would have a word for them.

What do you mean? The programmers work is literally combining the existing patterns into solutions for problems.

20. Aeolun ◴[07 Dec 25 22:26 UTC] No.46185911[source]▶

>>46185379 (TP) #

Software engineer? You think I cite all the code I’ve ever seen before when I reproduce it? That I even remember where it comes from?

replies(1): >>46189125 #

21. andsoitis ◴[07 Dec 25 22:29 UTC] No.46185934{3}[source]▶

>>46185592 #

I like your test. Should we also apply to specific humans?

We all stand on the shoulders of giants and learn by looking at others’ solutions.

replies(2): >>46186146 #>>46186413 #

22. anjel ◴[07 Dec 25 22:32 UTC] No.46185976{3}[source]▶

>>46185822 #

Sometimes its generated, and many times its not. Trivial to denote, but its been deemed non of your business.

23. fooker ◴[07 Dec 25 22:46 UTC] No.46186086[source]▶

>>46185379 (TP) #

> If a human did this we probably would have a word for them.

Humans do this all the time.

24. jacquesm ◴[07 Dec 25 22:50 UTC] No.46186123{4}[source]▶

>>46185916 #

I think my track record belies your very low value and frankly cowardly comment. If you have something to say at least do it under your real username instead of a throwaway.

25. whatshisface ◴[07 Dec 25 22:50 UTC] No.46186125[source]▶

>>46185791 #

They key difference between plagarism and building on someone's work is whether you say, "this based on code by linsey at github.com/socialnorms" or "here, let me write that for you."

replies(2): >>46186302 #>>46187094 #

26. jacquesm ◴[07 Dec 25 22:52 UTC] No.46186146{4}[source]▶

>>46185934 #

That's true. But if we take your implied rebuttal then current level AI would be able to learn from current AI as well as it would learn from humans, just like humans learn from other humans. But so far that does not seem to be the case, in fact, AI companies do everything they can to avoid eating their own tail. They'd love eating their own tail if it was worth it.

To me that's proof positive they know their output is mangled inputs, they need that originality otherwise they will sooner or later drown in nonsense and noise. It's essentially a very complex game of Chinese whispers.

replies(2): >>46186385 #>>46187981 #

27. jacquesm ◴[07 Dec 25 22:54 UTC] No.46186158{4}[source]▶

>>46185838 #

HN has guidelines for a reason.

replies(1): >>46186295 #

28. nextos ◴[07 Dec 25 22:57 UTC] No.46186199[source]▶

>>46185791 #

In case of LLMs, due to RAG, very often it's not just learning but almost direct real-time plagiarism from concrete sources.

replies(2): >>46186877 #>>46186903 #

29. jazzyjackson ◴[07 Dec 25 22:57 UTC] No.46186205{4}[source]▶

>>46185881 #

Obviously you accidentally recreated a language from the 70s :P

(I created a template language for JSON and added branching and conditionals and realized I had a whole programming language. Really proud of my originality until i was reading Ted Nelson's Computer Lib/Dream Machines and found out I reinvented TRAC, and to some extent, XSLT. Anyway LLMs are very good at reasoning about it because it can be constrained by a JSON schema. People who think LLMs only regurgitate haven't given it a fair shot)

replies(1): >>46186346 #

30. fpoling ◴[07 Dec 25 22:58 UTC] No.46186214{4}[source]▶

>>46185881 #

Languages with reasonable semantics are rather similar and LLMs are good at detecting that and adapting from other languages.

replies(1): >>46188712 #

31. incr_me ◴[07 Dec 25 23:06 UTC] No.46186295{5}[source]▶

>>46186158 #

You're adhering to an excess of rules, methinks!

32. CognitiveLens ◴[07 Dec 25 23:06 UTC] No.46186302{3}[source]▶

>>46186125 #

but as mlinsey suggests, what if it's influenced in small, indirect ways by 1000 different people, kind of like the way every 'original' idea from trained professionals is? There's a spectrum, and it's inaccurate to claim that Claude's responses are comparable to adapting one individual's work for another use case - that's not how LLMs operate on open-ended tasks, although they can be instructed to do that and produce reasonable-looking output.

Programmers are not expected to add an addendum to every file listing all the books, articles, and conversations they've had that have influenced the particular code solution. LLMs are trained on far more sources that influence their code suggestions, but it seems like we actually want a higher standard of attribution because they (arguably) are incapable of original thought.

replies(2): >>46186363 #>>46186951 #

33. zahlman ◴[07 Dec 25 23:11 UTC] No.46186346{5}[source]▶

>>46186205 #

FWIW, I think a JSON-based XSLT-like thing sounds far more enjoyable to use than actual XSLT, so I'd encourage you to show it off.

34. sarchertech ◴[07 Dec 25 23:15 UTC] No.46186363{4}[source]▶

>>46186302 #

If the problem you ask it to solve has only one or a few examples, or if there are many cases of people copy pasting the solution, LLMs can and will produce code that would be called plagiarism if a human did it.

replies(1): >>46186668 #

35. andsoitis ◴[07 Dec 25 23:16 UTC] No.46186385{5}[source]▶

>>46186146 #

I share that perspective.

36. ◴[07 Dec 25 23:20 UTC] No.46186413{4}[source]▶

>>46185934 #

37. nvllsvm ◴[07 Dec 25 23:20 UTC] No.46186420[source]▶

>>46185379 (TP) #

> Someone else did

Who?

38. ahepp ◴[07 Dec 25 23:51 UTC] No.46186648{3}[source]▶

>>46185822 #

You've done this? I would love to read more about it

39. _heimdall ◴[07 Dec 25 23:58 UTC] No.46186708[source]▶

>>46185478 #

I have a very anecdotal, but interesting, counterexample.

I recently asked Gemini 3 Pro to create an RSS feed reader type of experience by using XSLT to style and layout an OPML file. I specifically wanted it to use a server-side proxy for CORS, pass through caching headers in the proxy to leverage standard HTTP caching, and I needed all feed entries for any feed in the OPML to be combined into a single chronological feed.

It initially told multiple times that it wasn't possible (it also reminded me that Google is getting rid of XSLT). Regardless, after reiterating that it is possible multiple times it finally decided to make a temporary POC. That POC worked on the first try, with only one follow up to standardize date formatting with support for Atom and RSS.

I obviously can't say the code was novel, though I would be a bit surprised if it trained on that task enough for it to remember roughly the full implementation and still claimed it was impossible.

replies(1): >>46186744 #

40. mbesto ◴[07 Dec 25 23:59 UTC] No.46186720{5}[source]▶

>>46185601 #

To be fair, their license agreement is pretty much impossible to enforce.

41. measurablefunc ◴[08 Dec 25 00:00 UTC] No.46186728{3}[source]▶

>>46185592 #

This is known as the data processing inequality. Non-invertible functions can not create more information than what is available in their inputs: https://blog.blackhc.net/2023/08/sdpi_fsvi/. Whatever arithmetic operations are involved in laundering the inputs by stripping original sources & references can not lead to novelty that wasn't already available in some combination of the inputs.

Neural networks can at best uncover latent correlations that were already available in the inputs. Expecting anything more is basically just wishful thinking.

replies(3): >>46187544 #>>46188783 #>>46191426 #

42. jacquesm ◴[08 Dec 25 00:02 UTC] No.46186744{3}[source]▶

>>46186708 #

Why do you believe that to be a counterexample? In fragmentary form all of these elements must have been present in the input, the question is really how large the largest re-usable fragment was and whether or not barring some transformations you could trace it back to the original. I've done some experiments along the same lines to see what it spits out and what I noticed is that from example to example the programming style changed drastically, to the point that I suspect that it was mimicking even the style and not just the substance of the input data, and this over chunks of code long enough that it would definitely clear the bar for plagiarism.

replies(1): >>46188006 #

43. kevinsync ◴[08 Dec 25 00:03 UTC] No.46186759[source]▶

>>46185379 (TP) #

I've been struggling with this throughout the entire LLM-generated-code arc we're currently living -- I agree that it is wack in theory to take existing code and adapt it to your use-case without proper accreditation, but I've also been writing code since Pulp Fiction was in theaters and a lot of it is taking existing code and adapting it to my use-case, sometimes without a fully-documented paper trail.

Not to mention the moral vagaries of "if you use a library, is the complete articulation of your thing actually 100% your code?"

Is there a difference between loading and using a function from ImageMagick, and a standalone copycat function that mimics a function from ImageMagick?

What if you need it transliterated from one language to another?

Is it really that different than those 1200 page books from the 90's that walk you through implementing a 3D engine from scratch (or whatever the topic might be)? If you make a game on top of that book's engine, is your game truly yours?

If you learn an algorithm in some university class and then just write it again later, is that code yours? What if your code is 1-for-1 a copy of the code you were taught?

It gets very murky very quick!

Obviously I would encourage proper citation, but I also recognize the reality of this stuff -- what if you're fully rewriting something you learned decades ago and don't know who to cite? What if you have some code snippet from a website long forgotten that you saved and used? What if you use a library that also uses a library that you're not aware of because you didn't bother to check, and you either cite the wrapper lib or cite nothing at all?

I don't have some grand theory or wise thoughts about this shit, and I enjoy the anthropological studies trying to ascertain provenance / assign moral authority to remarkable edge cases, but end of the day I also find it exhausting to litigate the use of a tool that exploited the fact that your code got hoovered up by a giant robot because it was public, and might get regurgitated elsewhere.

To me, this is the unfortunate and unfair story of Gregory Coleman [0] -- drummer for The Winstons, who recorded "Amen, Brother" in 1969 (which gave us the most-sampled drum break in the world, spawned multiple genres of music, and changed human history) -- the man never made a dime from it, never even knew, and died completely destitute, despite his monumental contribution to culture. It's hard to reconcile the unjustness of it all, yet not that hard to appreciate the countless positive things that came out of it.

I don't know. I guess at the end of the day, does the end justify the means? Feels pretty subjective!

[0] https://en.wikipedia.org/wiki/Amen_break

replies(1): >>46187109 #

44. sholain ◴[08 Dec 25 00:19 UTC] No.46186877{3}[source]▶

>>46186199 #

RAG and LLMs are not the same thing, but 'Agents' incorporate both.

Maybe we could resolve the bit of a conundrum by the op in requiring 'agents' to give credit for things if they did rag them or pull them off the web?

It still doesn't resolve the 'inherent learning' problem.

It's reasonable to suggest that if 'one person did it, we should give credit' - at least in some cases, and also reasonable that if 1K people have done similar things ad the AI learns from that, well, I don't think credit is something that should apply.

But a couple of considerations:

- It may not be that common for an LLM to 'see one thing one time' and then have such an accurate assessment of the solution. It helps, but LLMs tend not to 'learn' things that way.

- Some people might consider this the OSS dream - any code that's public is public and it's in the public domain. We don't need to 'give credit' to someone because they solved something relatively arbitrary - or - if they are concerned with that, then we can have a separate mechanism for that, aka they can put it on Github or Wikipedia even, and then we can worry about 'who thought of it first' as a separate consideration. But in terms of Engineering application, that would be a bit of a detractor.

replies(1): >>46187380 #

45. doix ◴[08 Dec 25 00:24 UTC] No.46186903{3}[source]▶

>>46186199 #

Isn't RAG used for your code rather than other people's code? If I ask it to implement some algorithm, I'd be very surprised if RAG was involved.

46. saalweachter ◴[08 Dec 25 00:31 UTC] No.46186951{4}[source]▶

>>46186302 #

It's not uncommon, in a well-written code base, to see documentation on different functions or algorithms with where they came from.

This isn't just giving credit; it's valuable documentation.

If you're later looking at this function and find a bug or want to modify it, the original source might not have the bug, might have already fixed it, or might have additional functionality that is useful when you copy it to a third location that wasn't necessary in the first copy.

replies(1): >>46189813 #

47. martin-t ◴[08 Dec 25 00:39 UTC] No.46187004[source]▶

>>46185379 (TP) #

Programmers are willingly blind to this, at least until it's their code being stolen or they lose their job.

_LLMs are lossily compressed archives of stolen code_.

Trying to achieve AI through compression is nothing new.[0] The key innovation[1] is that the model[2] does not output only the first order input data but also the higher order patterns from the input data.

That is certainly one component of intelligence but we need to recognize that the tech companies didn't build AI, they build a compression algorithm which, combined with the stolen input text, can reproduce the input data and its patterns in an intelligent-looking way.

[0]: http://prize.hutter1.net/

[1]: Oh, god, this phrase is already triggering my generated-by-LLM senses.

[2]: Model of what? Of the stolen text. If 99.9999% of the work to achieve AI wasn't done by people whose work was stolen, they wouldn't be called models.

48. martin-t ◴[08 Dec 25 00:43 UTC] No.46187030[source]▶

>>46185478 #

The whole "reproduces training data vebatim" is a red herring.

It reproduces _patterns from the training data_, sometimes including verbatim phrases.

The work (to discover those patterns, to figure out what works and what does not, to debug some obscure heisenbug and write a blog post about it, ...) was done by humans. Those humans should be compensated for their work, not owners of mega-corporations who found a loophole in copyright.

49. ineedasername ◴[08 Dec 25 00:48 UTC] No.46187058[source]▶

>>46185379 (TP) #

>we probably would have a word for them

Student? Good learner? Pretty much what everyone does can be boiled down to reading lots of other code that’s been written and adapting it to a use case. Sure, to some extent models are regurgitating memorized information, but for many tasks they’re regurgitating a learned method of doing something and backfilling the specifics as needed— the memorization has been generalized.

50. nitwit005 ◴[08 Dec 25 00:49 UTC] No.46187063[source]▶

>>46185791 #

Ask for something like "a first person shooter using software rendering", and search github for the function names for the rendering functions. Using Copilot I found code simply lifted from implementations of Doom, except that "int" was replaced with "int32_t" and similar.

It's also fun to tell Copilot that the code will violate a license. It will seemingly always tell you it's fine. Safe legal advice.

replies(2): >>46187330 #>>46190083 #

51. ineedasername ◴[08 Dec 25 00:54 UTC] No.46187094{3}[source]▶

>>46186125 #

Do you have a source for that being the key difference? Where did you learn your words, I don’t see the names of your teachers cited here. The English language has existed a while, why aren’t you giving a citation every time you use a word that already exists in a lexicon somewhere? We have a name for people who don’t coin their own words for everything and rip off the words that other painstakingly evolved over a millennia of history. Find your own graphemes.

replies(1): >>46187201 #

52. jacquesm ◴[08 Dec 25 00:56 UTC] No.46187109[source]▶

>>46186759 #

What amazes me is how many programmers have absolutely no concept about copyright at all. This should be taught as a basic component of any programming course.

replies(1): >>46192736 #

53. latexr ◴[08 Dec 25 01:13 UTC] No.46187201{4}[source]▶

>>46187094 #

What a profoundly bad faith argument. We all understand that singular words are public domain, they belong to everyone. Yet when you arrange them in a specific pattern, of which there are infinite possibilities, you create something unique. When someone copies that arrangement wholesale and claims they were the first, that’s what we refer to as plagiarism.

https://www.youtube.com/watch?v=K9huNI5sBd8

replies(3): >>46187434 #>>46188381 #>>46189676 #

54. FeepingCreature ◴[08 Dec 25 01:18 UTC] No.46187235[source]▶

>>46185379 (TP) #

This is not how LLMs work.

55. Mtinie ◴[08 Dec 25 01:23 UTC] No.46187257{3}[source]▶

>>46185437 #

Recklessly is a strong word. I’ll give you the benefit of the doubt and assume your comment in good faith.

How do you describe the “reckless” use of information?

56. Mtinie ◴[08 Dec 25 01:27 UTC] No.46187287{5}[source]▶

>>46185601 #

GitHub, Bitbucket, GCE, AWS…all have licensing agreements for user contributions which the user flagged as “public” so I’m not exactly clear of your point if you are holding SO up as a bastion of intellectual property rights different from the other places LLM training sets were scraped from.

replies(1): >>46187464 #

57. martin-t ◴[08 Dec 25 01:34 UTC] No.46187330{3}[source]▶

>>46187063 #

And this is just the stuff you notice.

1) Verbatin copy is first-order plagiarism.

2a) Second-order plagiarism of written text would be replacing words with synonyms. Or taking a book paragraph by paragraph and for each one of them, rephrasing it in your own words. Yes, it might fool automated checkers but the structure would still be a copy of the original book. And most importantly, it would not contain any new information. No new positive-sum work was done. It would have no additional value.

Before LLMs almost nobody did this because the chance that it would help in a lawsuit vs the amount of work was not a good tradeoff. Now it is. But LLMs can do "better":

2b) A different kind of second-order plagiarism is using multiple sources and plagiarizing each of them only in part. Find multiple books on the same topic, take 1 chapter from each and order them in a coherent manner. Make it more granular. Find paragraphs or phrases which fit into the structure of your new book but are verbatim from other books. See how granular you can make it.

The trick here is that doing this by hand is more work than just writing your own book. So nobody did it and copyright law does not really address this well. But with LLMs, it can be automated. You can literally instruct an LLM to do this and it will do it cheaper than any human could. However, how LLMs work internally is yet different:

n) Higher-order plagiarism is taking multiple source books, identifying patterns, and then reproducing them in your "new" book.

If the patterns are sufficiently complex, nobody will ever be able to prove what specifically you did. What previously took creative human work now became a mechanical transformation of input data.

The point is this ability to detect and reproduce patterns is an impressive innovation but it's built on top of the work of hundreds of millions[0] of humans whose work was used without consent. The work done by those employed by the LLM companies is minuscule compared to that. Yet all of the reward goes to them.

Not to mention LLMs completely defear the purpose of (A)GPL. If you can take AGPL code and pass it through a sufficiently complex mechanical transformation that the output does the same thing but copyright no longer applies, then free software is dead. No more freedom to inspect and modify.

[0]: Github alone has 100 million users ( https://expandedramblings.com/index.php/github-statistics/ ) and we have reason to believe all of their data was used in training.

replies(2): >>46187445 #>>46190385 #

58. martin-t ◴[08 Dec 25 01:44 UTC] No.46187380{4}[source]▶

>>46186877 #

> if 1K people have done similar things ad the AI learns from that, well, I don't think credit is something that should apply.

I think it should.

Sure, if you make a small amount of money and divide it among the 1000 people who deserve credit due to their work being used to create ("train") the model, it might be too small to bother.

But if actual AGI is achieved, then it has nearly infinite value. If said AGI is built on top of the work of the 1000 people, then almost infinity divided by 1000 is still a lot of money.

Of course, the real numbers are way larger, LLMs were trained on the work of at least 100M but perhaps over a billion of people. But the value they provide over a long enough timespan is also claimed to be astronomical (evidenced by the valuations of those companies). It's not just their employees who deserve a cut but everyone whose work was used to train them.

> Some people might consider this the OSS dream

I see the opposite. Code that was public but protected by copyleft can now be reused in private/proprietary software. All you need to do it push it through enough matmuls and some nonlinearities.

replies(1): >>46191218 #

59. jacquesm ◴[08 Dec 25 01:54 UTC] No.46187434{5}[source]▶

>>46187201 #

This particular user does that all the time. It's really tiresome.

replies(1): >>46188474 #

60. jacquesm ◴[08 Dec 25 01:56 UTC] No.46187445{4}[source]▶

>>46187330 #

If a human did 2a or 2b we would think that a larger infraction than (1) because it shows intent to obfuscate the origins.

As for your free software is dead argument: I think it is worse than that: it takes away the one payment that free software authors get: recognition. If a commercial entity can take the code, obfuscate it and pass it off as their own copyrighted work to then embrace and extend it then that is the worst possible outcome.

replies(1): >>46187588 #

61. moron4hire ◴[08 Dec 25 01:58 UTC] No.46187456[source]▶

>>46185478 #

No, the thing needing proof is the novel idea: that LLMs can produce original code.

replies(3): >>46187777 #>>46188015 #>>46194023 #

62. jacquesm ◴[08 Dec 25 01:59 UTC] No.46187464{6}[source]▶

>>46187287 #

I was not the person that introduced SO to the discussion.

63. xyzzy123 ◴[08 Dec 25 02:12 UTC] No.46187544{4}[source]▶

>>46186728 #

Using this reasoning, would you argue that a new proof of a theorem adds no new information that was not present in the axioms, rules of inference and so on?

If so, I'm not sure it's a useful framing.

For novel writing, sure, I would not expect much truly interesting progress from LLMs without human input because fundamentally they are unable to have human experiences, and novels are a shadow or projection of that.

But in math – and a lot of programming – the "world" is chiefly symbolic. The whole game is searching the space for new and useful arrangements. You don’t need to create new information in an information-theoretic sense for that. Even for the non-symbolic side (say diagnosing a network issue) of computing, AIs can interact with things almost as directly as we can by running commands so they are not fundamentally disadvantaged in terms of "closing the loop" with reality or conducting experiments.

replies(1): >>46187895 #

64. martin-t ◴[08 Dec 25 02:20 UTC] No.46187588{5}[source]▶

>>46187445 #

> shows intent to obfuscate the origins

Good point. Reminds me of how if you poison one person, you go to prison, but when a company poisons thousands, it gets a fine... sometimes.

> it takes away the one payment that free software authors get: recognition

I keep flip-flopping on this. I did most of my open source work not caring about recognition but about the principles of GPL and later AGPL. However, I came to realize it was a mistake - people don't judge you by the work you actually do but by the work you appear to do. I have zero respect for people who do something just for the approval of others but I am aware of the necessity of making sure people know your value.

One thing is certain: credit/recognition affect all open source code, user rights (e.g. to inspect and modify) affect only the subset under (A)GPL.

Both are bad in their own right.

65. marcus_holmes ◴[08 Dec 25 02:50 UTC] No.46187777{3}[source]▶

>>46187456 #

LLMs can definitely produce original other stuff: ask it to create an original poem and on an extremely specific niche subject and it will do so. You can specify the niche subject to the point where it is incredibly unlikely that there is a poem on that subject in its training data, and it will still produce an original poem on that subject [0]. The well-known "otter using wifi on a plane" series of images [1] is another example: this is not in the training data (well, it is now, because well-known, but you get the idea).

Is there something unique about code, that is different from language (or images), that would make it impossible for an LLM to produce original code? I don't believe so, but I'm willing to be convinced.

I think this switches the burden of proof: we know LLMs can produce original content in other contexts. Why would they not be able to create original code?

[0] Ever curious, I tested this assumption. I got Claude to write an original limerick about goats oiling their beards with olive oil, which was the first reasonable thing I could think of as a suitably niche subject. I googled the result and could not find anything close to it. I then asked it to produce another limerick on the same subject, and it produced a different limerick, so obviously not just repeating training data.

[1] https://www.oneusefulthing.org/p/the-recent-history-of-ai-in...

replies(1): >>46189117 #

66. measurablefunc ◴[08 Dec 25 03:14 UTC] No.46187895{5}[source]▶

>>46187544 #

Sound deductive rules of logic can not create novelty that exceeds the inherent limits of their foundational axiomatic assumptions. You can not expect novel results from neural networks that exceed the inherent information capacity of their training corpus & the inherent biases of the neural network (encoded by its architecture). So if the training corpus is semantically unsound & inconsistent then there is no reason to expect that it will produce logically sound & semantically coherent outputs (i.e. garbage inputs → garbage outputs).

replies(1): >>46189246 #

67. handoflixue ◴[08 Dec 25 03:32 UTC] No.46187981{5}[source]▶

>>46186146 #

Equally, of course, all six year olds need to be trained by other six year olds; we must stop this crutch of using adult teachers

replies(1): >>46190648 #

68. handoflixue ◴[08 Dec 25 03:34 UTC] No.46187999{3}[source]▶

>>46185822 #

It's telling that you can't actually provide a single concrete example - because, of course, anyone skilled with LLMs would be able to trivially solve any such example within 10 minutes.

Perhaps the occasional program that relies heavily on precise visual alignment will fail - but I dare say if we give the LLM the same grace we'd give a visually impaired designer, it can do exactly as well.

replies(1): >>46189065 #

69. handoflixue ◴[08 Dec 25 03:36 UTC] No.46188006{4}[source]▶

>>46186744 #

> In fragmentary form all of these elements must have been present in the input

Yes, and Shakespeare merely copied the existing 26 letters of the English alphabet. What magical process do you think students are using when they read and re-combine learned examples to solve assignments?

replies(1): >>46192987 #

70. handoflixue ◴[08 Dec 25 03:38 UTC] No.46188015{3}[source]▶

>>46187456 #

What's your proof that the average college student can produce original code? I'm reasonably certain I can get an LLM to write something that will pass any test that the average college student can, as far as that goes.

replies(1): >>46194796 #

71. geniium ◴[08 Dec 25 04:22 UTC] No.46188272[source]▶

>>46185791 #

Thanks for writing this - love the way u explain the pov. I wish people would consider this angle more

72. ineedasername ◴[08 Dec 25 04:38 UTC] No.46188381{5}[source]▶

>>46187201 #

It’s not bad faith argument. It’s an attempt to shake thinking that is profoundly stuck by taking that thinking to an absurd extreme. Until that’s done, quite a few people aren’t able to see past the assumptions they don’t know they making. And by quite a few people I mean everyone, at different times. A strong appreciation for the absurd will keep a person’s thinking much sharper.

replies(1): >>46191241 #

73. ineedasername ◴[08 Dec 25 04:48 UTC] No.46188474{6}[source]▶

>>46187434 #

It’s tiresome to see unexamined assumptions and self-contradictions tossed out by a community that can and often does do much better. Some light absurdism often goes further and makes clear that I’m not just trying to setup a strawman since I’ve already gone and made a parody of my own point.

74. pests ◴[08 Dec 25 05:22 UTC] No.46188678[source]▶

>>46185473 #

While I generally agree with you, this "LLM is a human" comparisons really are tiresome I feel. It hasn't been proven and I don't know how many other legal issued could have solved if adding "like a human" made it okay. Google v Oracle? "oh, you've never learned an API??!?" or take the original Google Books controversy - "its reading books and memorizing them, like humans can". I do agree its different but I don't like this line of argument at all.

replies(1): >>46188983 #

75. pertymcpert ◴[08 Dec 25 05:28 UTC] No.46188712{5}[source]▶

>>46186214 #

Sounds like creativity and intelligence to me.

replies(1): >>46192341 #

76. raincole ◴[08 Dec 25 05:38 UTC] No.46188771[source]▶

>>46185379 (TP) #

This is why ragebait is chosen as the word of 2025.

> took that code without credit to the original author(s), adapted it to your use case

Aka software engineering.

77. cornel_io ◴[08 Dec 25 05:40 UTC] No.46188783{4}[source]▶

>>46186728 #

Theoretical "proofs" of limitations like this are always unhelpful because they're too broad, and apply just as well to humans as they do to LLMs. The result is true but it doesn't actually apply any limitation that matters.

replies(1): >>46189062 #

78. checker659 ◴[08 Dec 25 05:52 UTC] No.46188840[source]▶

>>46185478 #

I think the burden of proof is on the people making the original claim (that LLMs are indeed spitting out original code).

79. FanaHOVA ◴[08 Dec 25 06:17 UTC] No.46188983{3}[source]▶

>>46188678 #

I agree, that's why I was trying to point out that saying "if a person did that we'd have a word for them" is useless. They are not people, and people don't behave like that anyway. It adds nothing to the discussion.

80. measurablefunc ◴[08 Dec 25 06:28 UTC] No.46189062{5}[source]▶

>>46188783 #

You're confused about what applies to people & what applies to formal systems. You will continue to be confused as long as you keep thinking formal results can be applied in informal contexts.

81. tovej ◴[08 Dec 25 06:28 UTC] No.46189065{4}[source]▶

>>46187999 #

I recently asked an LLM to give me one of the most basic and well-documented algorithms in the world: a blocked matrix multiply. It's essentially a few nested loops and some constants for the block size.

It failed massively, spitting out garbage code, where the comments claimed to use blocking access patterns, but the code did not actually use them at all.

LLMs are, frankly, nearly useless for programming. They may solve a problem every once in a while, but once you look at the code, you notice it's either directly plagiarized or bad quality (or both, I suppose, in the latter case).

82. jacquesm ◴[08 Dec 25 06:35 UTC] No.46189117{4}[source]▶

>>46187777 #

No, it transformed your prompt. Another person giving it the same prompt will get the same result when starting from the same state. f('your prompt here') is a transformation of your prompt based on hidden state.

replies(1): >>46190854 #

83. tovej ◴[08 Dec 25 06:37 UTC] No.46189125[source]▶

>>46185911 #

You don't?

If you reproduce something, usually you have to check the earlier implementation for it and copy it over. This would inevitably require you to look at the license and author of said code.

Assuming of course, you're talking about nontrivial functionality, because obviously we're not talking about trivial one-liners etc.

84. tovej ◴[08 Dec 25 06:42 UTC] No.46189150{4}[source]▶

>>46185838 #

The original point, that LLMs are plagiarising inputs, is a very common and common sense opinion.

There are court cases where this is being addressed currently, and if you think about how LLMs operate, a reasonable person typically sees that it looks an awful lot like plagiarism.

If you want to claim it is not plagiarism, that requires a good argument, because it is unclear that LLMs can produce novelty, since they're literally trying to recreate the input data as faithfully as possible.

replies(1): >>46192004 #

85. xyzzy123 ◴[08 Dec 25 06:58 UTC] No.46189246{6}[source]▶

>>46187895 #

Maybe? But it also seems like you are that you are not accounting for new information at inference time. Let's pretend I agree the LLM is a plagiarism machine that can produce no novelty in and of itself that didn't come from what it was trained on, and produces mostly garbage (I only half agree lol, and I think "novelty" is under-specified here).

When I apply that machine (with its giant pool of pirated knowledge) _to my inputs and context_ I can get results applicable to my modestly novel situation which is not in the training data. Perhaps the output is garbage. Naturally if my situation is way out of distribution I cannot expect very good results.

But I often don't care if the results are garbage some (or even most!) of the time if I have a way to ground-truth whether they are useful to me. This might be via running a compile, a test suite, a theorem prover or mk1 eyeball. Of course the name of the game is to get agents to do this themselves and this is now fairly standard practice.

replies(1): >>46189394 #

86. measurablefunc ◴[08 Dec 25 07:24 UTC] No.46189394{7}[source]▶

>>46189246 #

I'm not here to convince you whether Markov chains are helpful for your use cases or not. I know from personal experience that even in cases where I have a logically constrained query I will receive completely nonsensical responses¹.

¹https://chatgpt.com/share/69367c7a-8258-8009-877c-b44b267a35...

replies(1): >>46189749 #

87. tscherno ◴[08 Dec 25 08:12 UTC] No.46189676{5}[source]▶

>>46187201 #

It is possible that the concept of intellectual property could be classified as a mistake of our era by the history teachers of future generations.

replies(1): >>46190024 #

88. jacquesm ◴[08 Dec 25 08:23 UTC] No.46189749{8}[source]▶

>>46189394 #

> Here is a correct, standard correction:

It does this all the time, but as often as not then outputs nonsense again, just different nonsense, and if you keep it running long enough it starts repeating previous errors (presumably because some sliding window is exhausted).

replies(1): >>46190162 #

89. jacquesm ◴[08 Dec 25 08:30 UTC] No.46189797[source]▶

>>46185791 #

> we have laws and norms that provide sufficient incentive for new ideas to continue to be created

Indeed, and up until the advent of 'AI' we did. But that incentive is being killed right now and I don't see any viable replacement on the horizon.

90. jacquesm ◴[08 Dec 25 08:32 UTC] No.46189813{5}[source]▶

>>46186951 #

This is why I'm still, even after decades of seeing it fail in the marketplace, a fan of literate programming.

91. latexr ◴[08 Dec 25 09:02 UTC] No.46190024{6}[source]▶

>>46189676 #

Intellectual property is a legal concept; plagiarism is ethical. We’re discussing the latter.

92. fransje26 ◴[08 Dec 25 09:10 UTC] No.46190083{3}[source]▶

>>46187063 #

> It's also fun to tell Copilot that the code will violate a license. It will seemingly always tell you it's fine. Safe legal advice.

Perfectly embodies the AI "startup" mentality. Nice.. /s

93. measurablefunc ◴[08 Dec 25 09:21 UTC] No.46190162{9}[source]▶

>>46189749 #

That's been my general experience and that was the most recent example. People keep forgetting that unless they can independently verify the outputs they are essentially paying OpenAI for the privilige of being very confidently gaslighted.

replies(1): >>46192945 #

94. fc417fc802 ◴[08 Dec 25 09:49 UTC] No.46190385{4}[source]▶

>>46187330 #

You make several good points, and I appreciate that they appear well thought out.

> What previously took creative human work now became a mechanical transformation of input data.

At which point I find myself wondering if there's actually a problem. If it was previously permitted due to the presence of creative input, why should automating that process change the legal status? What justifies treating human output differently?

> then free software is dead. No more freedom to inspect and modify.

It seems to me that depends on the ideological framing. Consider a (still entirely hypothetical) world where anyone can receive approximately any software they wish with little more than a Q&A session with an expert AI agent. Rather than free software being dead, such a scenario would appear to obviate the vast majority of needs that free software sets out to serve in the first place.

It seems a bit like worrying that free access to a comprehensive public transportation service would kill off a ride sharing service. It probably would, and the end result would also probably be a net benefit to humanity.

replies(2): >>46192893 #>>46192990 #

95. subscribed ◴[08 Dec 25 10:24 UTC] No.46190648{6}[source]▶

>>46187981 #

Beautiful, thank you.

96. marcus_holmes ◴[08 Dec 25 10:52 UTC] No.46190854{5}[source]▶

>>46189117 #

This is also true of humans, see every debate on free will ever.

The trick, of course, is getting to the exact same starting state.

97. checkmatez ◴[08 Dec 25 11:20 UTC] No.46191020[source]▶

>>46185478 #

> that LLM are fundamentally unable to produce original code.

What about humans? Are humans capable of producing completely original code or ideas or thoughts?

As the saying goes, if you want to create something from scratch, you have to start by inventing the universe.

Human mind works by noticing patterns and applying them in different contexts.

98. sholain ◴[08 Dec 25 11:51 UTC] No.46191218{5}[source]▶

>>46187380 #

- I don't think it's even reasonable to suggest that 1000 people all coming up with variations of some arbitrary bit of code either deserve credit - or certainly 'financial remuneration' because they wrote some arbitrary piece of code.

That scenario is already today very well accepted legally and morally etc as public domain.

- Copyleft is not OSS, it's a tiny variation of it, which is both highly ideological and impractical. Less than 2% of OSS projects are copyleft. It's a legit perspective obviously, but it hasn't bee representative for 20 years.

Whatever we do with AI, we already have a basic understanding of public domain, at least we can start from there.

replies(1): >>46213108 #

99. stOneskull ◴[08 Dec 25 11:56 UTC] No.46191241{6}[source]▶

>>46188381 #

>> They key difference between plagarism and building on someone's work is whether you say, "this based on code by linsey at github.com/socialnorms" or "here, let me write that for you."

> [i want to] shake thinking that is profoundly stuck [because they] aren’t able to see past the assumptions they don’t know they making

what is profoundly stuck, and what are the assumptions?

replies(1): >>46192275 #

100. nl ◴[08 Dec 25 12:21 UTC] No.46191426{4}[source]▶

>>46186728 #

This is simply not true.

Modern LLMs are trained by reinforcement learning where they try to solve a coding problem and receive a reward if it succeeds.

Data Processing Inequalities (from your link) aren't relevant: the model is learning from the reinforcement signal, not from human-written code.

replies(1): >>46192911 #

101. DangitBobby ◴[08 Dec 25 13:30 UTC] No.46192004{5}[source]▶

>>46189150 #

I need you to prove to me that it's not plagiarism when you write code that uses a library after reading documentation, I guess.

> since they're literally trying to recreate the input data as faithfully as possible.

Is that how they are able to produce unique code based on libraries that didn't exist in their training set? Or that they themselves wrote? Is that how you can give them the documentation for an API and it writes code that uses it? Your desire to make LLMs "not special" has made you completely blind to reality. Come back to us.

replies(2): >>46192252 #>>46193014 #

102. tovej ◴[08 Dec 25 13:55 UTC] No.46192252{6}[source]▶

>>46192004 #

What?

The LLM is trained on a corpus of text, and when it is given a sequence of tokens, it finds a set of token that, when one of them is appended, make the resulting sequence most like the text in that corpus.

If it is given a sequence of tokens that is unlike anything in its corpus, all bets are off and it produces garbage, just like machine learning models in general: if the input is outside the learned distribution, quality goes downhill fast.

The fact that they've added a Monte Carlo feature to the sequence generation, which makes it sometimes select a token that is slightly less like the most exact match in the corpus does not change this.

LLMs are fuzzy lookup tables for existing text, that hallucinate text for out-of-distribution queries.

This is LLM 101.

If the LLM was only trained using documentation, then there would be no problem. If it would generate a design, look at the documentation, understand the semantics of both, and translate the design to code by using the documentation as a guide.

But that's not how it works. It has open source repositories in its corpus that it then recreates by chaining together examples in this stochastic parrot -method I described above.

replies(1): >>46197240 #

103. macinjosh ◴[08 Dec 25 13:57 UTC] No.46192275{7}[source]▶

>>46191241 #

That your brain training on all the inputs it sees and creating output is fundamentally more legitimate than a computer doing the same thing.

replies(1): >>46194468 #

104. tatjam ◴[08 Dec 25 14:05 UTC] No.46192341{6}[source]▶

>>46188712 #

I think the key is that the LLM is having no trouble mapping from one "embedding" of the language to another (the task they are best performers at!), and that appears extremely intelligent to us humans, but certainly is not all there's to intelligence.

But just take a look at how LLMs struggle to handle dynamical, complex systems such as the "vending machine" paper published some time ago. Those kind of tasks, which we humans tend to think of as "less intelligent" than say, converting human language to a C++ implementation, seem to have some kind of higher (or at least, different) complexity than the embedding mapping done by LLMs. Maybe that's what we typically refer to as creativity? And if so, modern LLMs certainly struggle with that!

Quite sci-fi that we have created a "mind" so alien we struggle to even agree on the word to define what it's doing :)

105. bluesign ◴[08 Dec 25 14:40 UTC] No.46192736{3}[source]▶

>>46187109 #

106. jacquesm ◴[08 Dec 25 14:54 UTC] No.46192893{5}[source]▶

>>46190385 #

> At which point I find myself wondering if there's actually a problem. If it was previously permitted due to the presence of creative input, why should automating that process change the legal status? What justifies treating human output differently?

replies(1): >>46196859 #

107. jacquesm ◴[08 Dec 25 14:56 UTC] No.46192911{5}[source]▶

>>46191426 #

Ok, then we can leave the training data out of the input, everybody happy.

108. jacquesm ◴[08 Dec 25 14:59 UTC] No.46192945{10}[source]▶

>>46190162 #

It would be a really nice exercise - for which I unfortunately do not have the time - to have a non-trivial conversation with the best models of the day and then to rigorously fact-check every bit of output to determine the output quality. Judging from my own (probably not a representative sample) experience it would be a very meager showing.

I use AI as a means of last resort only now and then mostly as a source of inspiration rather than a direct tool aiming to solve an issue. And like that it has been useful on occasion, but it has at least as often been a tremendous waste of time.

109. jacquesm ◴[08 Dec 25 15:02 UTC] No.46192987{5}[source]▶

>>46188006 #

This same argument has now been made a couple of times in this thread (in different guises) and does absolutely nothing to move the conversation forward.

Words and letters are not copyrightable patterns in and of themselves. It is the composition of words and letters that we consider to be original creations and 'the bard' put them in a meaningful and original order not seen before, which established his reputation as a playwright.

110. martin-t ◴[08 Dec 25 15:02 UTC] No.46192990{5}[source]▶

>>46190385 #

> What justifies treating human output differently?

Human time is inherently valuable, computer time is not.

One angle:

The real issue is how this is made possible. Imagine an AI being created by a lone genius or a team of really good programmers and researchers by sitting down and just writing the code. From today's POV, it would be almost unimaginably impressive but that is how most people envisioned AI being created a few decades ago (and maybe as far as 5 years ago). These people would obviously deserve all the credit for their invaluable work and all the income from people using their work. (At least until another team does the same, then it's competition as normal.)

But that's not how AI is being created. What the programmers and researchers really do it create a highly advanced lossy compression algorithm which then takes nearly all publicly available human knowledge (disregarding licenses/consent) and creates a model of it which can reproduce both the first-order data (duh) and the higher-order patterns in it (cool). Do they still deserve all the credit and all the income? What if there's 1k researchers and programmers working on the compression algorithm (= training algorithm) and 1B people whose work ("content") is compressed by it (= used to train it). I will freely admit that the work done to build the algorithm is higher skilled than most of the work done by the 1B people. Maybe even 10x or 100x more expensive. But if you multiply those numbers (1k * 100 vs 1B), you have to come to the conclusion that the 1B people deserve the vast majority of the credit and the vast majority of the income generated by the combined work. (And notice when another team creates a competing model based on the same data, the share by the 1B stays the same and the 1k have to compete for their fraction.)

Another angle:

If you read a book, learn something from it and then apply the knowledge to make money, you currently don't pay a share to the author of the book. But you paid a fixed price for the book, hopefully. We could design a system where books are available for free, we determine how much the book helped you make that money, and you pay a proportional share to the author. This is not as entirely crazy as it might sound. When you cause an injury to someone, a court will determine how much each party involved is liable and there are complex rules (e.g. https://en.wikipedia.org/wiki/Joint_and_several_liability) determining the subsequent exchange of money. We could in theory do the same for material you learn from (though the fractions would probably be smaller than 1%). We don't because it would be prohibitively time consuming, very invasive, and often unprovable unless you (accidentally) praise a specific blog post or say you learned a technique from a book. Instead, we use this thing called market capitalism where the author sets a price and people either buy the book or not (depending on whether they think it's worth it for them), some of them make no money as a result, some make a lot, and we (choose to) believe that in aggregate, the author is fairly compensated.

Even if your blog is available for anyone to read freely, you get compensated in alternative ways by people crediting you and/or by building an audience you can influence to a degree.

With LLMs, there is no way to get the companies training the models to credit you or build you an audience. And even if they pay for the books they use for training, I don't believe they pay enough. The price was determined before the possibility of LLM training was known to the author and the value produced by a sufficiently sophisticated AI, perhaps AGI (which they openly claim to want to create) is effectively unlimited. The only way to compensate authors fairly is to periodically evaluate how much revenue the model attracted and pay a dividend to the authors as long as that model continues to be used.

Best of all, unlike with humans, the inner workings of a computer model, even a very complex one, can be analyzed in their entirety. So it should be possible to track (fractional) attribution throughout the whole process. There's just no incentive for the companies to invest into the tooling.

---

> approximately any software they wish with little more than a Q&A session with an expert AI agent

Making software is not just about writing code, it's about making decisions. Not just understanding problem and designing a solution but also picking tradeoffs and preferences.

I don't think most people are gonna do this just like most people today don't go to a program's settings and tweak every slider/checkbox/dropdown to their liking. They will at most say they want something exactly like another program with a few changes. And then it's clearly based on that original program and all the work performed to find out the users' preferences/likes/dislikes/workflows which remain unchanged.

But even if they genuinely recreate everything, then if it's done by an LLM, it's still based on work of others as per the argument above.

---

> the end result would also probably be a net benefit to humanity.

Possibly. But in the case of software fully written by sufficiently advanced LLMs, that net benefit would be created only by using the work of a hundred million or possibly a billion of people for free and without (quite often against) their consent.

Forced work without compensation is normally called slavery. (The only difference is that our work has already been done and we're "only" forced to not be able to prevent LLM companies from using it despite using licenses which by their intent and by the logic above absolutely should.)

The real question is how to achieve this benefit without exploiting people.

And don't forget such a model will not be offered for free to everyone as a public good. Not even to those people whose data was used to train it. It will be offered as a paid service. And most of the revenue won't even go to the researchers and programmers who worked on the model directly and who made it possible. It will go to the people who contributed the least (often zero) technical work.

---

This comment (and its GP), which contains arguments I have not seen anywhere else, was written over an hour long train ride. I could have instead worked remotely to make more than enough money to pay for the train ride. Instead, I write this training data which will be compressed and some patterns from it reproduced, allowing people I will never know and who will never know me to make an amount of money I have no chance quantifying and get nothing from. Now, I have to work some other hour to pay for the train ride. Make of that what you will.

replies(2): >>46193370 #>>46197440 #

111. jacquesm ◴[08 Dec 25 15:04 UTC] No.46193014{6}[source]▶

>>46192004 #

No, you need to prove that it is not plagiarism when you use an LLM to produce a piece of code that you then claim as yours.

You have the whole burden of proof thing backwards.

replies(1): >>46197274 #

112. jacquesm ◴[08 Dec 25 15:30 UTC] No.46193370{6}[source]▶

>>46192990 #

One of your remarks regarding attribution and compensation goes back to 'Xanadu' by the way, if you are not familiar with it that might be worth reading up on (Ted Nelson). He obviously did this well before the current AI age but a lot of the ideas apply.

A meta-comment:

I absolutely love your attention to detail in this discussion and avoiding taking 'the easy way out' from some of the more hairy concept embedded. This is exactly the kind of interaction that I love HN for, and it is interesting how this thread seems to bring out the best in you at the same time that it seems to bring out the worst in others.

Most likely they are responding as strongly as they do because they've bought into this matter to a degree that they are passing off works that they did not create as their own novel output, they got paid for it and they - like a religious person - are now so invested in this that it became their crutch and a part of their identity.

If you have another train ride to make I'd love for you to pick apart that argument and to refute it.

replies(1): >>46227255 #

113. ChromaticPanic ◴[08 Dec 25 16:12 UTC] No.46194023{3}[source]▶

>>46187456 #

This just reeks of a lack of understanding of how transformers work. Unlike Markov Chains that can only regurgitate known sequences, transformers can actually make new combinations.

114. grayhatter ◴[08 Dec 25 16:16 UTC] No.46194087[source]▶

>>46185791 #

> What if it were ten different humans writing ten different-but-related pieces of code, and an eleventh human piecing them together? What if it were 1,000 different humans?

What if it was just a single person? I take it you didn't read any of the code in the ocaml vibe pr that was posted a bit ago? The one where Claude copied non just implementation specifics, but even the copyright headers from a named, specific person.

It's clear that you can have no idea if the magic black box is copying from a single source, or from many.

So your comment boils down to; plagiarism is fine as long as I don't have to think about it. Are you really arguing that's ok?

replies(1): >>46195634 #

115. Arelius ◴[08 Dec 25 16:44 UTC] No.46194468{8}[source]▶

>>46192275 #

Copyright isn't some axiom, but to quote wikipedia: "Copyright laws allow products of creative human activities, such as literary and artistic production, to be preferentially exploited and thus incentivized."

It's a tool to incentivse human creative expression.

Thus it's entirely sensible to consider and treat the output from computers and humans differently.

Especially when you consider large differences between computers and humans, such as how trivial it is to create perfect duplicates of computer training.

116. moron4hire ◴[08 Dec 25 17:07 UTC] No.46194796{4}[source]▶

>>46188015 #

I'm not asking about averages. I'm asking about any. There is no need to perform an academic research study to prove that humans are capable of writing original code because the existence of our conversation right now is the counter-example to disprove the negation.

Yes, it is true that a lot of humans remix existing code. But not all. It has yet to be proven that any LLM is doing something more than remixing code.

I would submit as evidence to this idea (LLMs are not capable of writing original code) the fact that not a single company using LLM-based AI coding has developed a novel product that has outpaced its competition. In any category. If AI really makes people "10x" more productive, then companies that adopted AI a year ago should be 10 years ahead of their competition. Substitute any value N > 1 you want and you won't see it. Indeed, given the stories we're seeing of the massive amounts of waste that is occurring within AI startups and companies adopting AI, it would suggest that N < 1.

117. jacquesm ◴[08 Dec 25 18:10 UTC] No.46195634{3}[source]▶

>>46194087 #

> So your comment boils down to; plagiarism is fine as long as I don't have to think about it.

It is actually worse: plagiarism is fine if I'm shielded from such claims by using a digital mixer. When criminals use crypto tumblers to hide their involvement we tend to see that as proof of intent, not as absolution.

LLMs are copyright tumblers.

https://en.wikipedia.org/wiki/Cryptocurrency_tumbler

118. fc417fc802 ◴[08 Dec 25 20:00 UTC] No.46196859{6}[source]▶

>>46192893 #

Yes that's what the law currently says. I'm asking if it ought to say that in this specific scenario.

Previously there was no way for a machine to do large swaths of things that have now recently become possible. Thus a law predicated on the assumption that a machine can't do certain things might need to be revisited.

replies(1): >>46213315 #

119. DangitBobby ◴[08 Dec 25 20:30 UTC] No.46197240{7}[source]▶

>>46192252 #

120. DangitBobby ◴[08 Dec 25 20:34 UTC] No.46197274{7}[source]▶

>>46193014 #

Oh wild, I was operating under the assumption that the law requires you to prove that a law was broken, but it turns out you need to prove it wasn't. Thanks!

121. fc417fc802 ◴[08 Dec 25 20:46 UTC] No.46197440{6}[source]▶

>>46192990 #

Human time is certainly valuable to a particular human. However, if I choose to spend time doing something that a machine can do people will not generally choose to compensate me more for it just because it was me doing it instead of a machine.

I think it's worth remembering that IP law is generally viewed (at least legally) as existing for the net benefit of society as opposed to for ethical reasons. Certainly many authors feel like they have (or ought to have) some moral right to control their work but I don't believe that was ever the foundation of IP law.

Nor do I think it should be! If we are to restrict people's actions (ex copying) then it should be for a clear and articulable net societal benefit. The value proposition of IP law is that it prevents degenerate behavior that would otherwise stifle innovation. My question is thus, how do these AI developments fit into that?

So I completely agree that (for example) laundering a full work more or less verbatim through an AI should not be permissible. But when it comes to the higher order transformations and remixes that resemble genuine human work I'm no longer certain. I definitely don't think that "human exceptionalism" makes for a good basis either legally or ethically.

Regarding FOSS licenses, I'm again asking how AI relates back to the original motivations. Why does FOSS exist in the first place? What is it trying to accomplish? A couple ideological motivations that come to mind are preventing someone building on top and then profiting, or ensuring user freedom and ability to tinker.

Yes, the current crop of AI tools seem to pose an ideological issue. However! That's only because the current iteration can't truly innovate and also (as you note) the process still requires lots of painstaking human input. That's a far cry from the hypothetical that I previously posed.

122. martin-t ◴[10 Dec 25 01:47 UTC] No.46213108{6}[source]▶

>>46191218 #

> I don't think it's even reasonable to suggest that 1000 people all coming up with variations of some arbitrary bit of code either deserve credit

There's 8B people on the planet, probably ~100M can code to some degree[0]. Something only 1k people write is actually pretty rare.

Where would you draw the line? How many out of how many?

If I take a leaked bit of Google or MS or, god forbid, Oracle code and manage to find a variation of each small block in a few other projects, does it mean I can legally take the leaked code and use it for free?

Do you even realize to what lengths the tech companies went just a few years ago to protect their IP? People who ever even glanced at leaked code were prohibited from working on open source reimplementations.

> That scenario is already today very well accepted legally and morally etc as public domain.

1) Public domain is a legal concept, it has 0 relevance to morality.

2) Can you explain how you think this works? Can a person's work just automatically become public domain somehow by being too common?

> Copyleft is not OSS, it's a tiny variation of it, which is both highly ideological and impractical.

This sentence seems highly ideological. Linux is GPL, in fact, probably most SW on my non-work computer is GPL. It is very practical and works much better than commercial alternatives for me.

> Less than 2% of OSS projects are copyleft.

Where did you get this number? Using search engines, I get 20-30%.

[0]: It's the number of github users, though there's reportedly only ~25M professional SW devs, many more people can code but don't professionaly.

replies(1): >>46214987 #

123. martin-t ◴[10 Dec 25 02:25 UTC] No.46213315{7}[source]▶

>>46196859 #

This is the first technology in human history which only works if you use an exorbitant amount of other people's work (without their consent, often without even their knowledge) to automate their work.

There have been previous tech revolutions but they were based on independent innovation.

> Thus a law predicated on the assumption that a machine can't do certain things might need to be revisited.

Perhaps using copyright law for software and other engineering work might have been a mistake but it worked to a degree me and most devs were OK with.

Sidenote: There is no _one_ copyright law. IANAL but reportedly, for example datasets are treated differently in the US vs EU, with greater protection for the work that went into creating a database in the EU. And of course, China does what is best for China at a given moment.

There's 2 approaches:

1) Either we follow the current law. Its spirit and (again IANAL) probably the letter says that mechanical transformation preserves copyright. Therefore the LLMs and their output must be licensed under the same license as the training data (if all the training data use compatible licenses) or are illegal (if they mixed incompatible licenses). The consequence is that very roughly a) most proprietary code cannot be used for training, b) using only permissive code gives you a permissively licensed model and output and c) permissive and copyleft code can be combined, as long as the resulting model and output is copyleft. It still completely ignores attribution though but this is a compromise I would at least consider being OK with.

(But if I don't get even credit for 10 years of my free work being used to build this innovation, then there should be a limit on how much the people building the training algorithms get out of it as well.)

2) We design a new law. Legality and morality are, sadly, different and separate concepts. Now, call me a naive sucker, but I think legality should try to approximate morality as closely as possible, only deviating due to the real-world limitations of provability. (E.g. some people deserve to die but the state shouldn't have the right to kill them because the chance of error is unacceptably high.) In practice, the law is determined by what the people writing it can get away with before the people forced to follow it revolt. I don't want a revolution, but I think for example a bloody revolution is preferable to slavery.

Either way, there are established processes for handling both violations of laws and for writing new laws. This should not be decided by private for-profit corporations seeing whether they can get away with it scot-free or trembling that they might have to pay a fine which is a near-zero fraction of their revenue, with almost universally no repercussions for their owners.

124. sholain ◴[10 Dec 25 07:12 UTC] No.46214987{7}[source]▶

>>46213108 #

+ Once again: 1000 K people coming up with some arbitrary bit of content is already understood in basically every legal regime in the world as 'public domain'.

"Can you explain how you think this works? Can a person's work just automatically become public domain somehow by being too common?"

Please ask ChatGPT for the breakdown but start with this: if someone writes something and does not copyright it, it's already in the 'public domain' and what the other 999 people do does not matter. Moreover, a lot of things are not copyrightable in the first place.

FYI I've worked at Fortune 50 Tech Companies, with 'Legal' and I know how sensitive they are - this is not a concern for them.

It's not a concern for anyone.

'One Person' reproduction -> now that is definitely a concern. That's what this is all about.

+ For OSS I think 20% number may come from those that are explicitly licensed. Out of 'all repos' it's a very tiny amount, of those that have specific licensing details it's closer to 20%. You can verify this yourself just by cruising repos. The breakdown could be different for popular projects, but in the context of AI and IP rights we're more concerned about 'small entities' being overstepped as the more institutional entities may have recourse and protections.

I think the way this will play out is if LLMs are producing material that could be considered infringing, then they'll get sued. If they don't - they won't.

And that's it.

It's why they don't release the training data - it's fully of stuff that is in legal grey area.

125. martin-t ◴[11 Dec 25 03:18 UTC] No.46227255{7}[source]▶

>>46193370 #

> Xanadu

I've heard about it in the past, never really looked into what it is. It's now in my to-read list so I'll hope to potentially read about it this century...

> I absolutely love your attention to detail

Thanks. I've been thinking about this for almost two years and my position seems like it should be the obvious position for anyone who takes time to understand what is happening with this tech, both technologically and politically. Yet, a lot of people seem to be supportive, oblivious to the both the exploitation already happening and the coming consequences.

So I try to articulate this position as best as I can in hopes that I can convince at least a few people. And if they do the same, maybe we can have some impact. TBH, I am using HN as a proving ground for how to phrase ideas. Sadly, people react to the way something is written, not what is written. So even supporting a good idea can actually harm it if phrased poorly.

I started a blog and and wanted to write about tech but this IP theft ramped up before I finished my first real article and after that I didn't feel like pouring my energy into something which will be scraped and stripped of any connection to me just to make some rich asshole richer. I figured people are gonna come to their senses but most are indifferent or have accepted the reality, with a loud minority cheering for it. Probably because finally they're not the ones with a boot on their neck, it's the pesky white collar programmers and artists who had a good life through nothing more than lucky genetics which made them smart. I mean, I am already noticing some people who used to ask me for tech advice are starting to treat me differently since I am not longer useful (unless some super obscure bug hasn't made it into the training data), therefore no longer valuable to them.

Society is built on hierarchical power structures where each upper layer maintains or further entrenches their position by making people on the lower layers fight each other, sometimes literally.

WW1 was the peak of _visible_ human stupidity and submissivity with random people killing each other by the tens of thousands a day and not a single one of them stood anything to gain from "winning". For most men, when they're given a rifle, they have the most power they will have their entire life. Yet, they chose indirect suicide over direct murder. (Murder being a legal term, is carries no judgement of the morality of the act. I used it because "murder" is what opting out of this system would have been called by the people in power. Or "treason" if it had any chance of success. Or "revolution" if enough people did it.)

Since then, the power structures have evolved, the innovation is that fighting among us "commoners" is no longer so spectacularly visible.

So I gotta do something. These comments have almost 0 reach but they allow me to organize the ideas in my head and hopefully I'll bring myself to write proper blog posts. They'll also have almost 0 reach but it'll hopefully be a bit further from 0.

I have a lot of ideas and opinions which I have never heard expressed anywhere else. Certainly, I can't be that unique and many people must have through them or even written about them before but it's nearly impossible to find anything.

---

> the best in you

This is probably not the kind of reply you meant when you wrote that, I rant when I am tired.

---

Anyway, I added my blog to my profile. There's nothing of value there, I spent a decade reading about tech, excited for the bright future to come, and talking about starting my own tech blog. And right as I finally did, LLMs happened and that was the event which made me realize tech is just another tool of oppression and exploitation. Not that it wasn't before but I was naive and stupid and this was the event which catalyzed a large rethinking for me, personally. So if you wanna read something from me in the future, the RSS hopefully works.

↑