Microsoft and OpenAI's close partnership shows signs of fraying

(www.nytimes.com)

326 points jhunter1016 | 1 comments | 18 Oct 24 11:11 UTC | HN request time: 0.205s | source

Show context

mikeryan ◴[18 Oct 24 12:04 UTC] No.41878605[source]▶

While technical AI and LLMs are not something I’m well versed in. So as I sit on the sidelines and see the current proliferation of AI startups I’m starting to wonder where the moats are outside of access to raw computing power. Open AI seemed to have a massive lead in this space but that lead seems to be shrinking every day.

replies(10): >>41878784 #>>41878809 #>>41878843 #>>41880703 #>>41881606 #>>41882000 #>>41885618 #>>41886010 #>>41886133 #>>41887349 #

weberer ◴[18 Oct 24 12:28 UTC] No.41878784[source]▶

>>41878605 #

Obtaining high quality training data is the biggest moat right now.

replies(2): >>41882699 #>>41883992 #

segasaturn ◴[18 Oct 24 19:32 UTC] No.41882699[source]▶

>>41878784 #

Where are they going to get that data? Everything on the open web after 2023 is polluted with lowquality AI slop that poisons the data sets. My prediction: Aggressive dragnet surveillance of users. As in, Google recording your phone calls on Android, Windows sending screen recordings from Recall to OpenAI, Meta training off Whatsapp messages... It sounds dystopian, but the Line Must Go Up.

replies(3): >>41883095 #>>41883850 #>>41885531 #

crazygringo ◴[18 Oct 24 21:53 UTC] No.41883850[source]▶

>>41882699 #

> Everything on the open web after 2023 is polluted with lowquality AI slop that poisons the data sets.

Not even close to everything.

E.g. training on the NY Times and Wikipedia has zero meaningful AI. Training on books from reputable publishers similarly has zero meaningful AI. Any LLM usage was to polish prose or assist with research or whatever, but shouldn't affect the factual quality in any significant way.

The web hasn't been polluted with AI any more than e-mail has been polluted with spam. Which is to say it's there, but it's also entirely viable to separate. Nobody's worried that the group email chain with friends is being overrun with spam or with AI.

replies(3): >>41884785 #>>41884911 #>>41887686 #

mplewis ◴[19 Oct 24 01:24 UTC] No.41884911[source]▶

>>41883850 #

There is no heuristic that reliably separates human-written text from GPT text. It is impossible to avoid model collapse in such an environment.

replies(3): >>41887460 #>>41887598 #>>41887839 #

1. crazygringo ◴[19 Oct 24 14:06 UTC] No.41887839[source]▶

>>41884911 #

You're ignoring the information about the source.

It's not about a heuristic on text of unknown provenance -- it's about publishers that exert a certain level of editorial control and quality verification. Or social reputation mechanisms that achieve the same.

That's what is preventing your "model collapse". Reputations of provenance. Not pure-text heuristics.

↑