Most active commenters
  • menaerus(4)
  • adastra22(3)

←back to thread

Using LLMs at Oxide

(rfd.shared.oxide.computer)
694 points steveklabnik | 24 comments | | HN request time: 0.915s | source | bottom
1. csb6 ◴[] No.46179547[source]
Strange to see no mention of potential copyright violations found in LLM-generated code (e.g. LLMs reproducing code from Github verbatim without respecting the license). I would think that would be a pretty important consideration for any software development company, especially one that produces so much free software.
replies(4): >>46179678 #>>46179797 #>>46179941 #>>46188231 #
2. dboreham ◴[] No.46179678[source]
Is there current generation LLMs do this? I suppose I mean "do this any more than human developers do".
replies(1): >>46180015 #
3. don-bright ◴[] No.46179797[source]
Also since LLM generated content is not copyrightable what happens to code you publish as Copyleft license? The entire copyleft system is based on the idea of a human holding copyright to copyleft code. Is a big chunk of it, the LLM part, basically public domain? How do you ensure theres enough human content to make it copyrightable and hence copyleftable….
replies(1): >>46181201 #
4. fastball ◴[] No.46179941[source]
Has anything like this worked its way through the courts yet?
replies(1): >>46180382 #
5. theresistor ◴[] No.46180015[source]
A very recent example: https://github.com/ocaml/ocaml/pull/14369
replies(2): >>46180102 #>>46180413 #
6. phyzome ◴[] No.46180102{3}[source]
...what a remarkable thread.
replies(1): >>46180509 #
7. adastra22 ◴[] No.46180382[source]
Yes, training is considered fair use, and output is non-copyrightable / public domain. With many asterix and footnotes, of course.
replies(1): >>46180409 #
8. Madmallard ◴[] No.46180409{3}[source]
Don't see how output being public domain makes sense when they could be outputting copyrighted code.

Shouldn't the right's extend forward and simply require the LLM code to be deleted?

replies(2): >>46180495 #>>46180516 #
9. yard2010 ◴[] No.46180413{3}[source]
>> Here's my question: why did the files that you submitted name Mark Shinwell as the author?

> Beats me. AI decided to do so and I didn't question it. I did ask AI to look at the OxCaml implementation in the beginning.

This shows that the problem with AI is philosophical, not practical

10. menaerus ◴[] No.46180495{4}[source]
First, you have to prove it that it produced the copyrighted code. The question is what copyrighted code is in the first place? Literal copy-paste from source is easy but I think 99% of the time this isn't the case.
11. menaerus ◴[] No.46180509{4}[source]
Right? If this is really true, that some random folk without compiler engineering experience, implemented a completely new feature in ocaml compiler by prompting the LLM to produce the code for him, then I think it really is remarkable.
replies(2): >>46181474 #>>46182364 #
12. adastra22 ◴[] No.46180516{4}[source]
With many asterix and footnotes. One of which being that if it literally output the exact code, of course that would be copyright infringement. Something that greatly resembled but with minor changes would be a gray area.

Those kinds of cases, although they do happen, are exceptional. In a typical output that doesn't not line-for-line resemble a single training input, it is considered a new, but non-copyrightable work.

replies(1): >>46183504 #
13. IshKebab ◴[] No.46181201[source]
> since LLM generated content is not copyrightable

That's not how it works. If you ask an LLM to write Harry Potter and it writes something that is 99% the same as Harry Potter, it isn't magically free of copyright. That would obviously be insane.

The legal system is still figuring out exactly what the rules are here but it seems likely that it's going to be on the LLM user to know if the output is protected by copyright. I imagine AI vendors will develop secondary search thingies to warn you (if they haven't already), and there will probably be some "reasonable belief" defence in the eventual laws.

Either way it definitely isn't as simple as "LLM wrote it so we can ignore copyright".

replies(2): >>46181907 #>>46181977 #
14. ccortes ◴[] No.46181474{5}[source]
Oh wow, is that what you got from this?

It seems more like a non experienced guy asked the LLM to implement something and the LLM just output what and experienced guy did before, and it even gave him the credit

replies(2): >>46181940 #>>46182770 #
15. rcxdude ◴[] No.46181907{3}[source]
I think the poster is looking at it from the other way: purely machine-generated content is not generally copryrightable, even if it can violate copyright. So it's more a question of can a coplyleft license like GPL actually protect something that's original but primarily LLM generated? Should it do so?

(From what I understand, the amount of human input that's required to make the result copyrightable can be pretty small, perhaps even as little as selecting from multiple options. But this is likely to be quite a gray area.)

16. rcxdude ◴[] No.46181940{6}[source]
Copyright notices and signatures in generative AI output are generally a result of the expectation created by the training data that such things exist, and are generally unrelated to how much the output corresponds to any particular piece of training data, and especially to who exactly produced that work.

(It is, of course, exceptionally lazy to leave such things in if you are using the LLM to assist you with a task, and can cause problems of false attribution. Especially in this case where it seems to have just picked a name of one of the maintainers of the project)

17. rafterydj ◴[] No.46181977{3}[source]
>it seems likely that it's going to be on the LLM user to know if the out is protected by copyright.

To me, this is what seems more insane! If you've never read Harry Potter, and you ask an LLM to write you a story about a wizard boy, and it outputs 80% Harry Potter - how would you even know?

> there will be probably be some "reasonable belief" defence in eventual laws.

This is probably true, but it's irksome to shift all blame away from the LLM producers, using copy-written data to peddle copy-written output. This simply turns the business into copyright infringement as a service - what incentive would they have to actually build those "secondary search thingies" and build them well?

> it definitely isn't as simple as "LLM wrote it so we can ignore copyright".

Agreed. The copyright system is getting stress tested. It will be interesting to see how our legal systems can adapt to this.

replies(1): >>46184337 #
18. kfajdsl ◴[] No.46182364{5}[source]
It’s one thing for you (yes, you, the user using the tool) to generate code you don’t understand for a side project or one off tool. It’s another thing to expect your code to be upstreamed into a large project and let others take on the maintenance burden, not to mention review code you haven’t even reviewed yourself!

Note: I, myself, am guilty of forking projects, adding some simple feature I need with an LLM quickly because I don’t want to take the time to understand the codebase, and using it personally. I don’t attempt to upstream changes like this and waste maintainers’ time until I actually take the time myself to understand the project, the issue, and the solution.

replies(1): >>46182814 #
19. menaerus ◴[] No.46182770{6}[source]
Did you take a look at the code? Given your response I figure you did not because if you did you would see that the code was _not_ cloned but genuinely compiled by the LLM.
20. menaerus ◴[] No.46182814{6}[source]
What are you talking about? It was ridiculously useful debugging feature that nobody in their sanity would block because "added maintenance". MR was rejected purely because of political/social reasons.
21. vegardx ◴[] No.46183504{5}[source]
(I'm not a lawyer)

You should be careful about speaking in absolute terms when talking about copyright.

There is nothing that prevents multiple people from owning copyright to identical works. This is also why copyright infringement is such a mess to litigate.

I'd also be interested in knowing why you think code generated by LLMs can't be copyrighted. That's quite a statement.

There's also the problem with copyright law and different jurisdictions.

replies(1): >>46188198 #
22. IshKebab ◴[] No.46184337{4}[source]
> how would you even know?

The obvious way is by searching the training data for close matches. LLMs need to do that and warn you about it. Of course the problem is they all trained on pirated books and then deleted them...

But either way it's kind of a "your problem" thing. You can't really just say "I invented this great tool and it sometimes lets me violate copyright without realising. You don't mind do you, copyright holders?"

23. adastra22 ◴[] No.46188198{6}[source]
It is the official stance of the US copyright office.

It was upheld by Thaler v. Perlmutter.

Bartz v. Anthropic and Kadrey v. Meta confirmed with similar rulings.

24. cdaringe ◴[] No.46188231[source]
Perhaps, given the target audience and the state of the world, “it goes without saying” applies, or is wrapped up implicitly already thru the mentioned checks and balances (human firmly in the loop, etc etc)