←back to thread

Using LLMs at Oxide

(rfd.shared.oxide.computer)
694 points steveklabnik | 4 comments | | HN request time: 0s | source
Show context
csb6 ◴[] No.46179547[source]
Strange to see no mention of potential copyright violations found in LLM-generated code (e.g. LLMs reproducing code from Github verbatim without respecting the license). I would think that would be a pretty important consideration for any software development company, especially one that produces so much free software.
replies(4): >>46179678 #>>46179797 #>>46179941 #>>46188231 #
don-bright ◴[] No.46179797[source]
Also since LLM generated content is not copyrightable what happens to code you publish as Copyleft license? The entire copyleft system is based on the idea of a human holding copyright to copyleft code. Is a big chunk of it, the LLM part, basically public domain? How do you ensure theres enough human content to make it copyrightable and hence copyleftable….
replies(1): >>46181201 #
1. IshKebab ◴[] No.46181201[source]
> since LLM generated content is not copyrightable

That's not how it works. If you ask an LLM to write Harry Potter and it writes something that is 99% the same as Harry Potter, it isn't magically free of copyright. That would obviously be insane.

The legal system is still figuring out exactly what the rules are here but it seems likely that it's going to be on the LLM user to know if the output is protected by copyright. I imagine AI vendors will develop secondary search thingies to warn you (if they haven't already), and there will probably be some "reasonable belief" defence in the eventual laws.

Either way it definitely isn't as simple as "LLM wrote it so we can ignore copyright".

replies(2): >>46181907 #>>46181977 #
2. rcxdude ◴[] No.46181907[source]
I think the poster is looking at it from the other way: purely machine-generated content is not generally copryrightable, even if it can violate copyright. So it's more a question of can a coplyleft license like GPL actually protect something that's original but primarily LLM generated? Should it do so?

(From what I understand, the amount of human input that's required to make the result copyrightable can be pretty small, perhaps even as little as selecting from multiple options. But this is likely to be quite a gray area.)

3. rafterydj ◴[] No.46181977[source]
>it seems likely that it's going to be on the LLM user to know if the out is protected by copyright.

To me, this is what seems more insane! If you've never read Harry Potter, and you ask an LLM to write you a story about a wizard boy, and it outputs 80% Harry Potter - how would you even know?

> there will be probably be some "reasonable belief" defence in eventual laws.

This is probably true, but it's irksome to shift all blame away from the LLM producers, using copy-written data to peddle copy-written output. This simply turns the business into copyright infringement as a service - what incentive would they have to actually build those "secondary search thingies" and build them well?

> it definitely isn't as simple as "LLM wrote it so we can ignore copyright".

Agreed. The copyright system is getting stress tested. It will be interesting to see how our legal systems can adapt to this.

replies(1): >>46184337 #
4. IshKebab ◴[] No.46184337[source]
> how would you even know?

The obvious way is by searching the training data for close matches. LLMs need to do that and warn you about it. Of course the problem is they all trained on pirated books and then deleted them...

But either way it's kind of a "your problem" thing. You can't really just say "I invented this great tool and it sometimes lets me violate copyright without realising. You don't mind do you, copyright holders?"