←back to thread

321 points jhunter1016 | 9 comments | | HN request time: 0.002s | source | bottom
Show context
mikeryan ◴[] No.41878605[source]
While technical AI and LLMs are not something I’m well versed in. So as I sit on the sidelines and see the current proliferation of AI startups I’m starting to wonder where the moats are outside of access to raw computing power. Open AI seemed to have a massive lead in this space but that lead seems to be shrinking every day.
replies(10): >>41878784 #>>41878809 #>>41878843 #>>41880703 #>>41881606 #>>41882000 #>>41885618 #>>41886010 #>>41886133 #>>41887349 #
weberer ◴[] No.41878784[source]
Obtaining high quality training data is the biggest moat right now.
replies(2): >>41882699 #>>41883992 #
segasaturn ◴[] No.41882699[source]
Where are they going to get that data? Everything on the open web after 2023 is polluted with lowquality AI slop that poisons the data sets. My prediction: Aggressive dragnet surveillance of users. As in, Google recording your phone calls on Android, Windows sending screen recordings from Recall to OpenAI, Meta training off Whatsapp messages... It sounds dystopian, but the Line Must Go Up.
replies(3): >>41883095 #>>41883850 #>>41885531 #
crazygringo ◴[] No.41883850[source]
> Everything on the open web after 2023 is polluted with lowquality AI slop that poisons the data sets.

Not even close to everything.

E.g. training on the NY Times and Wikipedia has zero meaningful AI. Training on books from reputable publishers similarly has zero meaningful AI. Any LLM usage was to polish prose or assist with research or whatever, but shouldn't affect the factual quality in any significant way.

The web hasn't been polluted with AI any more than e-mail has been polluted with spam. Which is to say it's there, but it's also entirely viable to separate. Nobody's worried that the group email chain with friends is being overrun with spam or with AI.

replies(3): >>41884785 #>>41884911 #>>41887686 #
1. intended ◴[] No.41887686[source]
Training on Ny Times and Wikipedia…

The Wikipedia part at least, is incorrect. Currently Wikipedia mods/admins are dealing with AI generated articles being uploaded.

As for NYT - I am assuming that lots of those stories are already available in some blog or the other.

The e-mail and web forums are 100% polluted with spam, which takes constant effort to remove. For GenAI based content, it is far harder to identify and remove.

This example assumes the effort required to keep the web functional can deal with AI created content. Speaking from experience, our filters (human and otherwise) cannot. They fail to do so even now.

PS: Even given your example of closed email chains - the information in that depends on sources people read. Like plastic pollution in the food chain, this is inescapable.

replies(1): >>41887854 #
2. crazygringo ◴[] No.41887854[source]
> Currently Wikipedia mods/admins are dealing with AI generated articles being uploaded.

And they've always dealt with spam and low-quality submissions before. The system is working.

> As for NYT - I am assuming that lots of those stories are already available in some blog or the other.

I don't know what relevance that has to what we're talking about. The point is, train on the NYT. Blogs don't change what's on the NYT.

> The e-mail and web forums are 100% polluted with spam, which takes constant effort to remove.

They've always been polluted with low-quality content. So yes, either don't train on them, or only train on highly upvoted solutions, etc.

AI pollution isn't fundamentally any different from previous low-quality content and spam. It's not terribly difficult to determine which parts of the internet are known to be high-quality and train only on those. LLM's can't spam the NY Times.

replies(1): >>41887925 #
3. intended ◴[] No.41887925[source]
>The system is working. Given that this is an issue I've heard from wikipedia admins themselves, I am impressed by your confidence.

>The point is, train on the NYT. Blogs don't change what's on the NYT.

The counter point is that NYT content is already in the training data because its already replicated or copied into random blogs.

>So yes, either don't train on them, or only train on highly upvoted solutions, etc.

Highly upvoted messages on reddit are very regular bots copying older top comments. Mods already have issues with AI comments.

----

TLDR: Pollution is already happening. Verification does not scale, while generation scales.

replies(1): >>41887995 #
4. crazygringo ◴[] No.41887995{3}[source]
> The counter point is that NYT content is already in the training data

That's not a counter point. My point is, train on things like the NYT, not random blogs. You can also whitelist the blogs you know are written by people, rather than randomly spidering the whole internet.

Also, no -- most of the NYT hasn't been copied into blogs. A small proportion of top articles, maybe.

> Highly upvoted messages on reddit are very regular bots copying older top comments.

What does that matter if the older top comment was written by a person? Also, Reddit is not somewhere you want to train in the first place if you're trying to generate a model where factual accuracy matters.

> Verification does not scale, while generation scales.

You don't need to verify everything -- you just need to verify enough stuff to train a model on. We're always going to have plenty of stuff that's sufficiently verified, whether from newspapers or Wikipedia or whitelisted blogs or books from verified publishers or whatever. It's not a problem.

You shouldn't be training on blogspam from random untrusted domains in the first place. So it doesn't matter if that junk is AI-generated or not.

replies(1): >>41888484 #
5. intended ◴[] No.41888484{4}[source]
>What does that matter if the older top comment was written by a person? That is the entire issue? LLMs fail when they are trained on GenAI based content?

> Also, Reddit is not somewhere you want to train in the first place if you're trying to generate a model where factual accuracy matters.

There is no model that can create facutal accuracy. This would basically contravene the laws of physics. LLMs predict the next token.

>You shouldn't be training on blogspam from random untrusted domains in the first place. So it doesn't matter if that junk is AI-generated or not

Afaik, all the current models are trained on this corpus. That is how they work.

replies(1): >>41888993 #
6. crazygringo ◴[] No.41888993{5}[source]
> There is no model that can create facutal accuracy.

Factual accuracy is not binary, it is a matter of degrees. Obviously training on content that is more factually correct will result in more factually correct next tokens. This is a pretty fundamental aspect of LLM's.

> Afaik, all the current models are trained on this corpus.

Then apologies for being so blunt, but you know wrong. There is a tremendous amount of work that goes on by the LLM companies in verifying, sanitizing, and structuring the training corpuses, using a wide array of techniques. The are absolutely not just throwing in blogspam and hoping for the best.

replies(1): >>41889428 #
7. intended ◴[] No.41889428{6}[source]
Thank you for being blunt. Let me attempt to speak in the same earnest tone.

You are contradicting the papers and work that the people who make the models are saying. Alternatively, you are looking at the dataset curation process with rose tinted glasses. >There is a tremendous amount of work that goes on by the LLM companies in verifying, sanitizing, and structuring the training corpuses, using a wide array of techniques.

Common crawl is instrumental in building our models, 60% of GPT's training data was Common Crawl. (https://arxiv.org/pdf/2005.14165) pg 9.

CC in turn was never intended for LLM training, this misalignment in goals results in downstream issues like hate speech, NYT content, copyrighted content and more getting used to train models.

https://foundation.mozilla.org/en/research/library/generativ... (This article is to establish the issues with CC as a source of LLM training)

https://facctconference.org/static/papers24/facct24-148.pdf (this details those issues.)

Firms, such as the NYT are now stopping common crawl from archiving their pages. https://www.wired.com/story/the-fight-against-ai-comes-to-a-...

-----

TLDR: 'NYT' and other high quality content has largely been ingested by models. Reddit and other sources play a large part in training current models.

While I appreciate your being blunt, this also means not being sharp and incisive. Perhaps precision would be required here to clarify your point.

Finally -

>Factual accuracy is not binary, it is a matter of degrees. Obviously training on content that is more factually correct will result in more factually correct next tokens

What. Come on, I think you wouldnt agree with your own statement after reading it once more -Factual correctness is not a matter of degrees.

Furthermore, facts dont automatically create facts. Calcuation, processing, testing and verification create more facts. Just putting facts together creates content.

replies(1): >>41895879 #
8. crazygringo ◴[] No.41895879{7}[source]
Re: corpus content, I think we're talking past each other. I'm saying that current models aren't being blindly trained on untrusted blogspam, and that there's a lot of work done to verify, structure, transform, etc. And earlier models were trained with lower-quality content, as companies were trying to figure out how much scale mattered. Now they're paying huge amounts of money to improve the quality of what they ingest, to better shape the quality of output. What they take from Reddit, they're not blindly ingesting every comment from every user. My overall main point stands: we have viable, working, scalable mechanisms to avoid the "pollution" you're worried about.

> What. Come on, I think you wouldnt agree with your own statement after reading it once more -Factual correctness is not a matter of degrees.

Of course it is. An LLM can be correct 30% of the time, 80% of the time, 95% of the time, 99% of the time. If that's not a matter of degrees, I don't know what is. If you're looking for 100% perfection, I think you'll find that not even humans can do that. ;)

replies(1): >>41901123 #
9. intended ◴[] No.41901123{8}[source]
> I think we're talking past each other

Likely.

Re: > we have viable, working, scalable mechanisms to avoid the "pollution" you're worried about.

Do note - it’s the scalable mechanisms that I am looking at. I dont think the state of the art has shifted much since the last paper by OpenAI.

Can you link me to some new information or sources that lend credence to your claim.

> An LLM can be correct 30% of the time, 80% of the time, 95%…

That would be the error rate, which can be a matter of degrees.

However factual correctness largely cannot - the capital of Sweden today is Stockholm, with 0% variation in that answer.