←back to thread

321 points jhunter1016 | 1 comments | | HN request time: 0s | source
Show context
mikeryan ◴[] No.41878605[source]
While technical AI and LLMs are not something I’m well versed in. So as I sit on the sidelines and see the current proliferation of AI startups I’m starting to wonder where the moats are outside of access to raw computing power. Open AI seemed to have a massive lead in this space but that lead seems to be shrinking every day.
replies(10): >>41878784 #>>41878809 #>>41878843 #>>41880703 #>>41881606 #>>41882000 #>>41885618 #>>41886010 #>>41886133 #>>41887349 #
weberer ◴[] No.41878784[source]
Obtaining high quality training data is the biggest moat right now.
replies(2): >>41882699 #>>41883992 #
segasaturn ◴[] No.41882699[source]
Where are they going to get that data? Everything on the open web after 2023 is polluted with lowquality AI slop that poisons the data sets. My prediction: Aggressive dragnet surveillance of users. As in, Google recording your phone calls on Android, Windows sending screen recordings from Recall to OpenAI, Meta training off Whatsapp messages... It sounds dystopian, but the Line Must Go Up.
replies(3): >>41883095 #>>41883850 #>>41885531 #
crazygringo ◴[] No.41883850[source]
> Everything on the open web after 2023 is polluted with lowquality AI slop that poisons the data sets.

Not even close to everything.

E.g. training on the NY Times and Wikipedia has zero meaningful AI. Training on books from reputable publishers similarly has zero meaningful AI. Any LLM usage was to polish prose or assist with research or whatever, but shouldn't affect the factual quality in any significant way.

The web hasn't been polluted with AI any more than e-mail has been polluted with spam. Which is to say it's there, but it's also entirely viable to separate. Nobody's worried that the group email chain with friends is being overrun with spam or with AI.

replies(3): >>41884785 #>>41884911 #>>41887686 #
mplewis ◴[] No.41884911[source]
There is no heuristic that reliably separates human-written text from GPT text. It is impossible to avoid model collapse in such an environment.
replies(3): >>41887460 #>>41887598 #>>41887839 #
1. Der_Einzige ◴[] No.41887598[source]
Anti-slop sampler and bigram/trigram stylometric analysis absolutely can give evidence that someone used a particular LLM with particular settings.

But in general it’s hard to separate human from AI text.