Most active commenters
  • crazygringo(6)
  • intended(5)

←back to thread

321 points jhunter1016 | 23 comments | | HN request time: 0.577s | source | bottom
Show context
mikeryan ◴[] No.41878605[source]
While technical AI and LLMs are not something I’m well versed in. So as I sit on the sidelines and see the current proliferation of AI startups I’m starting to wonder where the moats are outside of access to raw computing power. Open AI seemed to have a massive lead in this space but that lead seems to be shrinking every day.
replies(10): >>41878784 #>>41878809 #>>41878843 #>>41880703 #>>41881606 #>>41882000 #>>41885618 #>>41886010 #>>41886133 #>>41887349 #
weberer ◴[] No.41878784[source]
Obtaining high quality training data is the biggest moat right now.
replies(2): >>41882699 #>>41883992 #
1. segasaturn ◴[] No.41882699[source]
Where are they going to get that data? Everything on the open web after 2023 is polluted with lowquality AI slop that poisons the data sets. My prediction: Aggressive dragnet surveillance of users. As in, Google recording your phone calls on Android, Windows sending screen recordings from Recall to OpenAI, Meta training off Whatsapp messages... It sounds dystopian, but the Line Must Go Up.
replies(3): >>41883095 #>>41883850 #>>41885531 #
2. jazzyjackson ◴[] No.41883095[source]
I'm really curious if Microsoft will ever give in to the urge to train on private business data - since transitioning office to o365, they hold the world's and even governments word documents and emails. I'm pretty sure they've promised never to touch it but they can certainly read it so... Information wants to be free.
replies(3): >>41883349 #>>41886762 #>>41898416 #
3. jhickok ◴[] No.41883349[source]
Microsoft "trains" on business data already, but typically for things like fine-tuning security automation and recognizing malicious signals. It sure would be a big step to reading chats and email and feeding them in to a model.
4. crazygringo ◴[] No.41883850[source]
> Everything on the open web after 2023 is polluted with lowquality AI slop that poisons the data sets.

Not even close to everything.

E.g. training on the NY Times and Wikipedia has zero meaningful AI. Training on books from reputable publishers similarly has zero meaningful AI. Any LLM usage was to polish prose or assist with research or whatever, but shouldn't affect the factual quality in any significant way.

The web hasn't been polluted with AI any more than e-mail has been polluted with spam. Which is to say it's there, but it's also entirely viable to separate. Nobody's worried that the group email chain with friends is being overrun with spam or with AI.

replies(3): >>41884785 #>>41884911 #>>41887686 #
5. mplewis ◴[] No.41884911[source]
There is no heuristic that reliably separates human-written text from GPT text. It is impossible to avoid model collapse in such an environment.
replies(3): >>41887460 #>>41887598 #>>41887839 #
6. lfmunoz4 ◴[] No.41885531[source]
Would think most quality data is books and news articles and scientific journals. Not crap people are texting each other.

These companies will never admit it but AI is built on the back of piracy archives, easiest way and cheapest way to getting massive amounts of quality data.

replies(2): >>41885758 #>>41887361 #
7. mcmcmc ◴[] No.41885758[source]
That entirely depends on what quality you’re going for. If the goal is to simulate passably human conversation, texts and dms are probably more desirable.
8. ENGNR ◴[] No.41886762[source]
Slack tried it but the backlash got them, this time anyway.
replies(1): >>41890399 #
9. mswen ◴[] No.41887361[source]
A friend and I built a proof-of-concept of using a variation of Latent Semantic Analysis to automatically build up conceptual maps and loadings of individual words against the latent conceptual vectors back in 2000. In exploring what it would take to scale I concluded, like you, that we should use professionally written and edited content like books, news articles and scientific journals as the corpus against which to build up the core knowledge graph.

Twenty-four years later I still regret not being able to raise money to enable us to keep working on that nascent startup. In most ways it was still too early. Google was still burning through VC money at that point and the midwestern investors we had some access to didn't get it. And, honestly they were probably correct. Compute power was still too expensive and quality data sources like published text were mostly locked up and generally not available to harvest.

10. ValentinA23 ◴[] No.41887460{3}[source]
Unless you're OpenAI and provided like 98% of ai generated text that was posted to the internet.
11. Der_Einzige ◴[] No.41887598{3}[source]
Anti-slop sampler and bigram/trigram stylometric analysis absolutely can give evidence that someone used a particular LLM with particular settings.

But in general it’s hard to separate human from AI text.

12. intended ◴[] No.41887686[source]
Training on Ny Times and Wikipedia…

The Wikipedia part at least, is incorrect. Currently Wikipedia mods/admins are dealing with AI generated articles being uploaded.

As for NYT - I am assuming that lots of those stories are already available in some blog or the other.

The e-mail and web forums are 100% polluted with spam, which takes constant effort to remove. For GenAI based content, it is far harder to identify and remove.

This example assumes the effort required to keep the web functional can deal with AI created content. Speaking from experience, our filters (human and otherwise) cannot. They fail to do so even now.

PS: Even given your example of closed email chains - the information in that depends on sources people read. Like plastic pollution in the food chain, this is inescapable.

replies(1): >>41887854 #
13. crazygringo ◴[] No.41887839{3}[source]
You're ignoring the information about the source.

It's not about a heuristic on text of unknown provenance -- it's about publishers that exert a certain level of editorial control and quality verification. Or social reputation mechanisms that achieve the same.

That's what is preventing your "model collapse". Reputations of provenance. Not pure-text heuristics.

14. crazygringo ◴[] No.41887854{3}[source]
> Currently Wikipedia mods/admins are dealing with AI generated articles being uploaded.

And they've always dealt with spam and low-quality submissions before. The system is working.

> As for NYT - I am assuming that lots of those stories are already available in some blog or the other.

I don't know what relevance that has to what we're talking about. The point is, train on the NYT. Blogs don't change what's on the NYT.

> The e-mail and web forums are 100% polluted with spam, which takes constant effort to remove.

They've always been polluted with low-quality content. So yes, either don't train on them, or only train on highly upvoted solutions, etc.

AI pollution isn't fundamentally any different from previous low-quality content and spam. It's not terribly difficult to determine which parts of the internet are known to be high-quality and train only on those. LLM's can't spam the NY Times.

replies(1): >>41887925 #
15. intended ◴[] No.41887925{4}[source]
>The system is working. Given that this is an issue I've heard from wikipedia admins themselves, I am impressed by your confidence.

>The point is, train on the NYT. Blogs don't change what's on the NYT.

The counter point is that NYT content is already in the training data because its already replicated or copied into random blogs.

>So yes, either don't train on them, or only train on highly upvoted solutions, etc.

Highly upvoted messages on reddit are very regular bots copying older top comments. Mods already have issues with AI comments.

----

TLDR: Pollution is already happening. Verification does not scale, while generation scales.

replies(1): >>41887995 #
16. crazygringo ◴[] No.41887995{5}[source]
> The counter point is that NYT content is already in the training data

That's not a counter point. My point is, train on things like the NYT, not random blogs. You can also whitelist the blogs you know are written by people, rather than randomly spidering the whole internet.

Also, no -- most of the NYT hasn't been copied into blogs. A small proportion of top articles, maybe.

> Highly upvoted messages on reddit are very regular bots copying older top comments.

What does that matter if the older top comment was written by a person? Also, Reddit is not somewhere you want to train in the first place if you're trying to generate a model where factual accuracy matters.

> Verification does not scale, while generation scales.

You don't need to verify everything -- you just need to verify enough stuff to train a model on. We're always going to have plenty of stuff that's sufficiently verified, whether from newspapers or Wikipedia or whitelisted blogs or books from verified publishers or whatever. It's not a problem.

You shouldn't be training on blogspam from random untrusted domains in the first place. So it doesn't matter if that junk is AI-generated or not.

replies(1): >>41888484 #
17. intended ◴[] No.41888484{6}[source]
>What does that matter if the older top comment was written by a person? That is the entire issue? LLMs fail when they are trained on GenAI based content?

> Also, Reddit is not somewhere you want to train in the first place if you're trying to generate a model where factual accuracy matters.

There is no model that can create facutal accuracy. This would basically contravene the laws of physics. LLMs predict the next token.

>You shouldn't be training on blogspam from random untrusted domains in the first place. So it doesn't matter if that junk is AI-generated or not

Afaik, all the current models are trained on this corpus. That is how they work.

replies(1): >>41888993 #
18. crazygringo ◴[] No.41888993{7}[source]
> There is no model that can create facutal accuracy.

Factual accuracy is not binary, it is a matter of degrees. Obviously training on content that is more factually correct will result in more factually correct next tokens. This is a pretty fundamental aspect of LLM's.

> Afaik, all the current models are trained on this corpus.

Then apologies for being so blunt, but you know wrong. There is a tremendous amount of work that goes on by the LLM companies in verifying, sanitizing, and structuring the training corpuses, using a wide array of techniques. The are absolutely not just throwing in blogspam and hoping for the best.

replies(1): >>41889428 #
19. intended ◴[] No.41889428{8}[source]
Thank you for being blunt. Let me attempt to speak in the same earnest tone.

You are contradicting the papers and work that the people who make the models are saying. Alternatively, you are looking at the dataset curation process with rose tinted glasses. >There is a tremendous amount of work that goes on by the LLM companies in verifying, sanitizing, and structuring the training corpuses, using a wide array of techniques.

Common crawl is instrumental in building our models, 60% of GPT's training data was Common Crawl. (https://arxiv.org/pdf/2005.14165) pg 9.

CC in turn was never intended for LLM training, this misalignment in goals results in downstream issues like hate speech, NYT content, copyrighted content and more getting used to train models.

https://foundation.mozilla.org/en/research/library/generativ... (This article is to establish the issues with CC as a source of LLM training)

https://facctconference.org/static/papers24/facct24-148.pdf (this details those issues.)

Firms, such as the NYT are now stopping common crawl from archiving their pages. https://www.wired.com/story/the-fight-against-ai-comes-to-a-...

-----

TLDR: 'NYT' and other high quality content has largely been ingested by models. Reddit and other sources play a large part in training current models.

While I appreciate your being blunt, this also means not being sharp and incisive. Perhaps precision would be required here to clarify your point.

Finally -

>Factual accuracy is not binary, it is a matter of degrees. Obviously training on content that is more factually correct will result in more factually correct next tokens

What. Come on, I think you wouldnt agree with your own statement after reading it once more -Factual correctness is not a matter of degrees.

Furthermore, facts dont automatically create facts. Calcuation, processing, testing and verification create more facts. Just putting facts together creates content.

replies(1): >>41895879 #
20. bossyTeacher ◴[] No.41890399{3}[source]
Elaborate? never heard of this
21. crazygringo ◴[] No.41895879{9}[source]
Re: corpus content, I think we're talking past each other. I'm saying that current models aren't being blindly trained on untrusted blogspam, and that there's a lot of work done to verify, structure, transform, etc. And earlier models were trained with lower-quality content, as companies were trying to figure out how much scale mattered. Now they're paying huge amounts of money to improve the quality of what they ingest, to better shape the quality of output. What they take from Reddit, they're not blindly ingesting every comment from every user. My overall main point stands: we have viable, working, scalable mechanisms to avoid the "pollution" you're worried about.

> What. Come on, I think you wouldnt agree with your own statement after reading it once more -Factual correctness is not a matter of degrees.

Of course it is. An LLM can be correct 30% of the time, 80% of the time, 95% of the time, 99% of the time. If that's not a matter of degrees, I don't know what is. If you're looking for 100% perfection, I think you'll find that not even humans can do that. ;)

replies(1): >>41901123 #
22. wkat4242 ◴[] No.41898416[source]
Tbh it would also make the model much more accurate for corporate uses. It's not a bad idea for that reason.

But security and DLP teams will never accept it.

23. intended ◴[] No.41901123{10}[source]
> I think we're talking past each other

Likely.

Re: > we have viable, working, scalable mechanisms to avoid the "pollution" you're worried about.

Do note - it’s the scalable mechanisms that I am looking at. I dont think the state of the art has shifted much since the last paper by OpenAI.

Can you link me to some new information or sources that lend credence to your claim.

> An LLM can be correct 30% of the time, 80% of the time, 95%…

That would be the error rate, which can be a matter of degrees.

However factual correctness largely cannot - the capital of Sweden today is Stockholm, with 0% variation in that answer.