Most active commenters
  • cube00(3)

←back to thread

747 points porridgeraisin | 13 comments | | HN request time: 0.427s | source | bottom
1. JCM9 ◴[] No.45063064[source]
Not a surprise. All the major players have reached the limits of training on existing data—they’re already training on essentially the whole internet plus a bunch of content they allegedly stole (hence various lawsuits). There haven’t been any major breakthroughs in model architecture from the major players recently and thus they’re now in a battle for more data to train on. They need data, and they want YOUR data, now, and are gonna do increasingly shady things to get it.
replies(5): >>45063645 #>>45063676 #>>45063696 #>>45064759 #>>45064804 #
2. klabb3 ◴[] No.45063645[source]
> They need data, and they want YOUR data, now, and are gonna do increasingly shady things to get it.

But unlike the 100s of data brokers that also want your data, they have an existing operational funnel of your data already that you voluntary give them every day. All they need is dark pattern ToS changes and manage the minor PR issue. People will forget about this in a week.

replies(1): >>45064463 #
3. cube00 ◴[] No.45063676[source]
It's nice to see the newer models are suffering after being exposed to training on their own slop.

If they had done this in a more measured way they might have been able to separate human from AI content such as doing legal deals with publishers.

However they couldn't wait to just take it all to be first and now the well is poisoned for everyone.

replies(1): >>45064214 #
4. xyst ◴[] No.45063696[source]
Further proof why guardrails/regulation is needed.
5. theshackleford ◴[] No.45064214[source]
> It's nice to see the newer models are suffering after being exposed to training on their own slop.

I've seen zero evidence anything of the such is occurring, and that if it was, it's due to what you claim. I'd be highly interested in research suggesting both or either is occurring however.

replies(1): >>45067192 #
6. threetonesun ◴[] No.45064463[source]
Seems hard to believe legal teams at corporations are going to forget this in a week. I've always assumed the market play for these companies was spinning off an "Amazon basics" version of other companies software, this seems like another step towards that.
7. freejazz ◴[] No.45064759[source]
It's not alleged that they stole the content. They told the courts they pirated the materials.
replies(1): >>45066204 #
8. imiric ◴[] No.45064804[source]
Yeah, this is hardly surprising.

To AI companies, data is even more of a gold mine than to adtech companies. It is existentially important.

The truly evil behavior will emerge at the intersection of these two industries. I'm sure Google and Facebook are already using data from one to power the other, even if it's currently behind closed doors. I can hardly wait for the use cases these geniuses will think of once this is publicly acceptable and in widespread use by all companies.

9. whamlastxmas ◴[] No.45066204[source]
Infringement, not theft :)
replies(1): >>45069563 #
10. cube00 ◴[] No.45067192{3}[source]
"AI models collapse when trained on recursively generated data"

https://news.ycombinator.com/item?id=41058194

replies(1): >>45070408 #
11. freejazz ◴[] No.45069563{3}[source]
Reread my post.
12. theshackleford ◴[] No.45070408{4}[source]
That's not what I asked for as it's not relevant.

The claim was made that the models are "suffering", at this exact moment, because they have been recursively feeding themselves, RIGHT now.

I want evidence the current models are "suffering" right now, and I want further evidence that suggests this suffering is due to recursive data ingestion.

Some year old article with no relevance to today talking about hypotheticals of indiscriminate gorging of recursive data is not evidence of either of the things I asked for.

replies(1): >>45082311 #
13. cube00 ◴[] No.45082311{5}[source]
Did you mean the current models that are still stuck in 2023?

> what's the latest year of data you're trained on

> ChatGPT said: My training goes up to April 2023.

There's a reason they're not willing to update the training corpus even with GPT-5.

> Some year old article with no relevance to today

The current models are based on training even older so I guess you should disregard those too if you're choosing to judge things purely based on their age.