Most active commenters
  • palmotea(4)

←back to thread

451 points croes | 19 comments | | HN request time: 0.682s | source | bottom
Show context
mattxxx ◴[] No.43962976[source]
Well, firing someone for this is super weird. It seems like an attempt to censor an interpretation of the law that:

1. Criticizes a highly useful technology 2. Matches a potentially-outdated, strict interpretation of copyright law

My opinion: I think using copyrighted data to train models for sure seems classically illegal. Despite that, Humans can read a book, get inspiration, and write a new book and not be litigated against. When I look at the litany of derivative fantasy novels, it's obvious they're not all fully independent works.

Since AI is and will continue to be so useful and transformative, I think we just need to acknowledge that our laws did not accomodate this use-case, then we should change them.

replies(19): >>43963017 #>>43963125 #>>43963168 #>>43963214 #>>43963243 #>>43963311 #>>43963423 #>>43963517 #>>43963612 #>>43963721 #>>43963943 #>>43964079 #>>43964280 #>>43964365 #>>43964448 #>>43964562 #>>43965792 #>>43965920 #>>43976732 #
palmotea[dead post] ◴[] No.43963168[source]
[flagged]
1. jobigoud ◴[] No.43963464[source]
We are talking about the rights of the humans training the models and the humans using the models to create new things.

Copyright only comes into play on publication. It's only concerned about publication of the models and publication of works. The machine itself doesn't have agency to publish anything at this point.

replies(5): >>43963564 #>>43964130 #>>43964131 #>>43964631 #>>43965405 #
2. MyOutfitIsVague ◴[] No.43963564[source]
It's not only publication, otherwise people wouldn't be able to be successfully sued for downloading and consuming copyrighted content, it would only be the uploaders who get into trouble.
replies(1): >>43963945 #
3. HappMacDonald ◴[] No.43963945[source]
Do you have any links to cases where people were sued for downloading and consuming content without also uploading (eg, bittorent), hosting, sharing the copyrighted works, etc?
replies(2): >>43965951 #>>43966372 #
4. bgwalter ◴[] No.43964130[source]
Does the distinction matter? If humans build a machine that uses so much oxygen that the oxygen levels on earth drop by half, can they say:

"Humans are allowed to breathe, so our machine is too, because it is operated by humans!"

replies(1): >>43964279 #
5. spacemadness ◴[] No.43964131[source]
Sounds like we’re talking about the right of AI company founders and people on HN to acquire wealth from creative works due to some weak argument concerning similarity to the human mind and creation of art. Since we’ve now veered into armchair philosophy territory, I think one could argue that the way human memory works and creates, both physically and mentally, from inspiration is vastly different from how AI works. So saying they’re the same and that’s it is both lazy and takes interesting questions off the table to squash debate.
6. TeMPOraL ◴[] No.43964279[source]
Yes, and then the response would be, "what have you done, we now need to pass laws about oxygen consumption where before we didn't".

Point being, laws aren't some God-ordained rules, beautiful in their fractal recursive abstraction, perfectly covering everything that will ever happen in the universe. No, laws are more or less crude hacks that deal with here and now. Intellectual property rights were questionable from the start and only got worse; they've been barely keeping up with digital media in the past couple decades, and they're entirely ill-equipped to deal with generative AI. This is a new situation, and laws need to be updated to cover it.

replies(1): >>43964747 #
7. palmotea ◴[] No.43964631[source]
>>> Despite that, Humans can read a book, get inspiration, and write a new book and not be litigated against.

>> The fatal flaw in your reasoning: machines aren't humans. You can't reason that a machine has rights from the fact a human has them. Otherwise it's murder to recycle a car.

> We are talking about the rights of the humans training the models and the humans using the models to create new things.

Then that's even easier, because that prevents appeals to things humans do, like learning, from muddying the waters.

If "training the models" entails loading up copyrighted works into your system (e.g. encoded them during training), you've just copied them into a retrieval system and violated copyright based on established precedent. And people have prompted verbatim copyrighted text out of well-known LLMs, which makes it even clearer.

And then to defend LLM training you're left with BS akin to claiming an ASCII encoded copy of a book not a copyright violation, because the book is paper and ASCII is numbers.

8. palmotea ◴[] No.43964747{3}[source]
> Yes, and then the response would be, "what have you done, we now need to pass laws about oxygen consumption where before we didn't".

Except in this case, we already have the equivalent of "laws about oxygen consumption": copyright.

> Intellectual property rights were questionable from the start and only got worse; they've been barely keeping up with digital media in the past couple decades, and they're entirely ill-equipped to deal with generative AI.

The laws are not "entirely ill-equipped to deal with generative AI," unless your interests lie in breaking them. All the hand-waving about the laws being "questionable" and "entirely ill-equipped" is just noise.

Under current law OpenAI, Google, etc. have no right to cheap training data, because someone made that data and may have the reasonable interest in getting paid for their efforts. Like all businesses, those companies would ideally like the law to be unfairly biased towards them: to protect them when they charge as much as they can, but not protect anyone else so they can pay as little as possible.

replies(3): >>43965500 #>>43965515 #>>43967544 #
9. moralestapia ◴[] No.43965405[source]
>Copyright only comes into play on publication.

Nope.

You have a right to not publish any work that you own. This is protected by Copyright law.

Copyright covers you from the moment you create some sort of original work (in a tangible medium).

10. TeMPOraL ◴[] No.43965500{4}[source]
> Under current law OpenAI, Google, etc. have no right to cheap training data, because someone made that data and may have the reasonable interest in getting paid for their efforts.

That's the thing though: intuitively, they do - training the model != generating from the model, and it's the output of a generation that violates copyright (and the user-supplied prompt is a crucial ingredient in getting the potentially copyrighted material to appear). And legally, that's AFAIK still an open question.

> Like all businesses, those companies would ideally like the law to be unfairly biased towards them: to protect them when they charge as much as they can, but not protect anyone else so they can pay as little as possible.

That's 100% true. I know that, I'm not denying that. But in this particular case, I find my own views align with their case. I'm not begrudging them for raking in heaps of money offering generative AI services, because they're legitimately offering value that's at least commensurate (IMHO it's much greater) to what they charge, and that value comes entirely from the work they're uniquely able to do, and any individual work that went into training data contributes approximately zero to it.

(GenAI doesn't rely on any individual work in training data; it relies on the breadth and amount being a notable fraction of humanity's total intellectual output. It so happens that almost all knowledge and culture is subject to copyright, so you couldn't really get to this without stepping on some legal landmines.)

(Also, much like AI companies would like the law to favor them, their opponents in this case would like the law to dictate they should be compensated for their works being used in training data, but compensated way beyond any value their works bring in, which in reality is, again, approximately zero.)

replies(1): >>43966813 #
11. ben_w ◴[] No.43965515{4}[source]
> Except in this case, we already have the equivalent of "laws about oxygen consumption": copyright.

Copyright laws were themselves created by the printing press making it easy to duplicate works, whereas previously if you half-remembered something that was just "inspiration".

But that only gave the impression of helping creative people: today, any new creative person has to compete with the entire reproducible cannon of all of humanity before them — can you write fantasy so well that new readers pick you up over Pratchett or Tolkien?

Now we have AI which are "inspired" (perhaps) by what they read, and half-remember it, in a way that seems similar to pre-printing-press humans sharing stories even if the mechanism is different.

How this is seen according to current law likely varies by jurisdiction; but the law as it is today matters less than what the law will be when the new ones are drafted to account for GenAI.

What that will look like, I am unsure. Could be that for training purposes, copyright becomes eternal… but it's also possible that copyright may cease to exist entirely — laws to protect the entire creative industry may seem good, but if AI displaces all humans from economic activity, will it continue to matter?

replies(2): >>43965733 #>>43966984 #
12. Jensson ◴[] No.43965733{5}[source]
> But that only gave the impression of helping creative people: today, any new creative person has to compete with the entire reproducible cannon of all of humanity before them — can you write fantasy so well that new readers pick you up over Pratchett or Tolkien?

That is even worse without copyright, as then every previous work would be free and you would have to compete with better works that are also free for people.

replies(1): >>43967536 #
13. MyOutfitIsVague ◴[] No.43965951{3}[source]
There were the famous napster cases, the kids and old ladies that got sued by the RIAA for using limewire to download some music.

There is also the fact that copyright holders will pressure your ISP into sending threatening letters and shutting off your Internet for piracy, even without you seeding. I haven't gotten the impression that you are in the clear for pirating as long as you don't distribute.

14. lavezzi ◴[] No.43966372{3}[source]
There's tonnes, this is a baffling question.
15. palmotea ◴[] No.43966813{5}[source]
> That's the thing though: intuitively, they do - training the model != generating from the model, and it's the output of a generation that violates copyright (and the user-supplied prompt is a crucial ingredient in getting the potentially copyrighted material to appear). And legally, that's AFAIK still an open question.

It's still copyright infringement if I download a pirated movie and never watch it (writing the bytes to the disk == "training" the disk's "model", reading the bytes back == "generating" from the disk's "model").

> That's 100% true. I know that, I'm not denying that. But in this particular case, I find my own views align with their case.

IMHO, unless you're massively wealthy and/or running a bigcorp, people like you benefit a lot more from copyright than are harmed by it. In a world without copyright protection, some bigcorp will be able to use its size to extract the value from the works that are out there (i.e. Amazon and Netflix will stop paying royalties instantly, but they'll still have customers because they have the scale to distribute). Copyright just means the little guy who's actually creating has some claim to get some of the value directed back to them.

> and any individual work that went into training data contributes approximately zero to it.

Then cut all those works out of the training set. I don't think it's an excuse that the infringement has to happen on a massive scale to be of value to the generative AI company.

16. palmotea ◴[] No.43966984{5}[source]
> Copyright laws were themselves created by the printing press making it easy to duplicate works, whereas previously if you half-remembered something that was just "inspiration".

Eh. I don't know the history, but my understanding was they were created because the printing press allowed others to deny the original creators the profits to their work, and direct those profits to others who had no hand in it.

After all, in market terms: a publisher that pays its authors can't compete with another that publisher that publishes the same works but without paying any authors. A word without copyright is one where some publisher still makes money, but it's a race to the bottom for authors.

> But that only gave the impression of helping creative people: today, any new creative person has to compete with the entire reproducible cannon of all of humanity before them — can you write fantasy so well that new readers pick you up over Pratchett or Tolkien?

Here's a hole in your thinking: if you like fantasy, would you be content to just re-read Tolkien over and over, forever? Don't you think that'd get boring no matter how good he was?

And empirically, "new creative [people]" manage to complete with Pratchett or Tolkien all the time, as new fantasy works are still being published and read. Do you remember that "Game of Thrones" was a mass cultural phenomenon not too long ago?

replies(1): >>43967772 #
17. Suppafly ◴[] No.43967536{6}[source]
>that are also free for people

sounds like a good deal if you're people.

18. Suppafly ◴[] No.43967544{4}[source]
>Under current law OpenAI, Google, etc. have no right to cheap training data, because someone made that data and may have the reasonable interest in getting paid for their efforts.

If it were that cut and dried we wouldn't have this conversation at all, so clearly your position isn't objectively true.

19. ben_w ◴[] No.43967772{6}[source]
> A word without copyright is one where some publisher still makes money, but it's a race to the bottom for authors.

This is the case anyway; there are many writers competing for the opportunity to be published, so the publishers have a massive advantage, and it is the technology of printing (and cheap paper) that makes this a one-sided relationship — if every story teller had to be heard in person, with no recordings or reproductions possible, then story tellers would be found in every community, and they would be valued by their community.

> Here's a hole in your thinking: if you like fantasy, would you be content to just re-read Tolkien over and over, forever? Don't you think that'd get boring no matter how good he was?

The examples aren't meant to be exclusive, and Pratchett has a lot of books.

There's far more books on the market right now than a human can read in a lifetime. At some point, we may have already passed it, there will be far more good books on the market than a human can read in a lifetime, at which point it's not quality, it's fashion.

> And empirically, "new creative [people]" manage to complete with Pratchett or Tolkien all the time, as new fantasy works are still being published and read.

At some point, there will be more books at least as good as Pratchett, Tolkien, Le Guin, McCaffrey, Martin, Heinlein, Niven etc. in each genre, than anyone can read.

> Do you remember that "Game of Thrones" was a mass cultural phenomenon not too long ago?

Published: August 1, 1996 — concurrently with Pratchett.

Better example would have been The Expanse — worth noting that SciFi has a natural advantage over (high) fantasy or romance, as the nature of speculative science fiction means it keeps considering futures that are rendered as obsolete as the worn-down buttons on the calculator that Hari Seldon was rumoured to keep under his pillow.