Most active commenters
  • crazygringo(6)
  • intended(5)
  • Der_Einzige(4)
  • staticautomatic(3)
  • mplewis(3)

←back to thread

321 points jhunter1016 | 61 comments | | HN request time: 1.25s | source | bottom
1. mikeryan ◴[] No.41878605[source]
While technical AI and LLMs are not something I’m well versed in. So as I sit on the sidelines and see the current proliferation of AI startups I’m starting to wonder where the moats are outside of access to raw computing power. Open AI seemed to have a massive lead in this space but that lead seems to be shrinking every day.
replies(10): >>41878784 #>>41878809 #>>41878843 #>>41880703 #>>41881606 #>>41882000 #>>41885618 #>>41886010 #>>41886133 #>>41887349 #
2. weberer ◴[] No.41878784[source]
Obtaining high quality training data is the biggest moat right now.
replies(2): >>41882699 #>>41883992 #
3. wongarsu ◴[] No.41878809[source]
Data. You want huge amounts of high quality data with a diverse range of topics, writing styles and languages. Everyone seems to balance those requirements a bit differently, and different actors have access to different training data

There is also some moat in the refinement process (rlhf, model "safety" etc)

4. InkCanon ◴[] No.41878843[source]
You hit the nail on the head. Companies are scrambling for an edge. Not a real edge, an edge to convince investors to keep giving them money. Perplexity is going all in on convincing VCs it can create a "data flywheel".
replies(1): >>41884196 #
5. runeblaze ◴[] No.41880703[source]
In addition to data, having the infra to scale up training robustly is very very non-trivial.
6. YetAnotherNick ◴[] No.41881606[source]
> Open AI seemed to have a massive lead in this space but that lead seems to be shrinking every day.

The lead is as strong as ever. They are 34 ELO above anyone else in blind testing, and 73 ELO above in coding [1]. They also seem to have artificially constrain the lead as they already have stronger model like o1 which they haven't released. Consistent to the past, they seem to release just <50 ELO above anyone else, and upgrades the model in weeks when someone gets closer.

[1]: https://lmarena.ai/

replies(2): >>41881810 #>>41884236 #
7. adventured ◴[] No.41881810[source]
It's rather amusing that people have said this about OpenAI - that they essentially had no lead - for about two years non-stop.

The moat as usual is extraordinary scale, resources, time. Nobody is putting $10 billion into the 7th OpenAI clone. Big tech isn't aggressively partnering with the 7th OpenAI clone. The door is already shut to that 7th OpenAI clone (they can never succeed or catch-up), there's just an enormous amount of naivety in tech circles about how things work in the real world: I can just spin up a ChatGPT competitor over the weekend on my 5090, therefore OpenAI have no barriers to entry, etc.

HN used to endlessly talk about how Uber could be cloned in a weekend. It's just people talking about something they don't actually understand. They might understand writing code (or similar) and their bias extends from the premise that their thing is the hard part of the equation (writing the code, building an app, is very far from the hardest part of the equation for an Uber).

replies(1): >>41882938 #
8. Der_Einzige ◴[] No.41882000[source]
How can anyone say that the lead is shrinking when no one still has any good competitor to strawberry? Dspy has been out for how long and how many folks have shown better reasoning models than strawberry built with literally anything else? Oh yeah, zero.
replies(2): >>41884918 #>>41887996 #
9. segasaturn ◴[] No.41882699[source]
Where are they going to get that data? Everything on the open web after 2023 is polluted with lowquality AI slop that poisons the data sets. My prediction: Aggressive dragnet surveillance of users. As in, Google recording your phone calls on Android, Windows sending screen recordings from Recall to OpenAI, Meta training off Whatsapp messages... It sounds dystopian, but the Line Must Go Up.
replies(3): >>41883095 #>>41883850 #>>41885531 #
10. TeaBrain ◴[] No.41882938{3}[source]
No-one was saying this 2 or even 1.5 years ago.
11. jazzyjackson ◴[] No.41883095{3}[source]
I'm really curious if Microsoft will ever give in to the urge to train on private business data - since transitioning office to o365, they hold the world's and even governments word documents and emails. I'm pretty sure they've promised never to touch it but they can certainly read it so... Information wants to be free.
replies(3): >>41883349 #>>41886762 #>>41898416 #
12. jhickok ◴[] No.41883349{4}[source]
Microsoft "trains" on business data already, but typically for things like fine-tuning security automation and recognizing malicious signals. It sure would be a big step to reading chats and email and feeding them in to a model.
13. crazygringo ◴[] No.41883850{3}[source]
> Everything on the open web after 2023 is polluted with lowquality AI slop that poisons the data sets.

Not even close to everything.

E.g. training on the NY Times and Wikipedia has zero meaningful AI. Training on books from reputable publishers similarly has zero meaningful AI. Any LLM usage was to polish prose or assist with research or whatever, but shouldn't affect the factual quality in any significant way.

The web hasn't been polluted with AI any more than e-mail has been polluted with spam. Which is to say it's there, but it's also entirely viable to separate. Nobody's worried that the group email chain with friends is being overrun with spam or with AI.

replies(3): >>41884785 #>>41884911 #>>41887686 #
14. staticautomatic ◴[] No.41883992[source]
I’m in this space and no it isn’t.
replies(1): >>41884027 #
15. sumedh ◴[] No.41884027{3}[source]
What is the moat then?
replies(1): >>41884043 #
16. staticautomatic ◴[] No.41884043{4}[source]
Idk but it’s not lack of training data.
replies(2): >>41884094 #>>41887047 #
17. sumedh ◴[] No.41884094{5}[source]
You work in this space and you dont know what the moat is?
replies(4): >>41884130 #>>41884226 #>>41884915 #>>41887971 #
18. staticautomatic ◴[] No.41884130{6}[source]
The market is two-sided
19. disqard ◴[] No.41884196[source]
Perhaps I've missed something, but where will the infinite amounts of training data come from, for future improvements?

If these models will be trained on the outputs of themselves (and other models), then it's not so much a "flywheel", as it is a Perpetual Motion Machine.

replies(2): >>41885368 #>>41886361 #
20. sangnoir ◴[] No.41884226{6}[source]
All of the people working in CS don't know if P=NP: working in a field doesn't mean you have all the answers.
21. epolanski ◴[] No.41884236[source]
Idc about lmarena benchmarks, I test different models everyday in Cursor, Sonnet is way better at coding web applications than ChatGPT4o
replies(1): >>41885498 #
22. mplewis ◴[] No.41884911{4}[source]
There is no heuristic that reliably separates human-written text from GPT text. It is impossible to avoid model collapse in such an environment.
replies(3): >>41887460 #>>41887598 #>>41887839 #
23. mplewis ◴[] No.41884915{6}[source]
Tells you a lot about the state of the industry.
24. mplewis ◴[] No.41884918[source]
Wow, this thing can reason now? How come it keeps getting my basic word problems wrong?
replies(1): >>41885965 #
25. Tier3r ◴[] No.41885368{3}[source]
Perplexity has a dubious idea based around harvesting user chats -> making service better -> getting more user prompts. I am quite unconvinced that user prompts and stored chats will materially improve an LLM that is trained on a trillion high quality tokens.

The second idea being kicked around is synthetic data will create a new fountain of youth for data that will also fix its reasoning abilities.

26. Zetaphor ◴[] No.41885498{3}[source]
Completely agree. It's well known that the LMSys arena benchmarks are heavily skewed with bias towards whatever is new and exciting. Meanwhile even OpenAI have acknowledged Sonnet as being a superior coding model.

This is clearly evident to anyone who spends any amount of time working on non-trivial projects with both models.

27. lfmunoz4 ◴[] No.41885531{3}[source]
Would think most quality data is books and news articles and scientific journals. Not crap people are texting each other.

These companies will never admit it but AI is built on the back of piracy archives, easiest way and cheapest way to getting massive amounts of quality data.

replies(2): >>41885758 #>>41887361 #
28. _giorgio_ ◴[] No.41885618[source]
Quite the opposite.

The new O1 models by open ai are surprisingly good. So good that you can see intelligence at work through solutions when you go from:

GPT4 O1 mini O1 preview

Way better than the competition. And they have a mobile voice app with advanced voice which is phenomenal.

29. mcmcmc ◴[] No.41885758{4}[source]
That entirely depends on what quality you’re going for. If the goal is to simulate passably human conversation, texts and dms are probably more desirable.
30. Der_Einzige ◴[] No.41885965{3}[source]
Tokenization
replies(1): >>41886060 #
31. PeterStuer ◴[] No.41886010[source]
Those extremely detailed models they can build of you (digital twin) from all your conversations coupled with intelligent subtle, likable, relatable and very personal artificial salesmen in service of the highest bidder has enourmous financial potential.

Even if the latter becomes commoditized (and we are far from that in practice), the former is a serious mote. Just like their is no secret to building a search engine or a social network platform (and that is not saying their are no technical challanges), operating it profitably requires massive aggregate user profile exploitation potential which requires huge upfront loss leaders.

32. youoy ◴[] No.41886060{4}[source]
Is it really a tested conclusion? Or a plausible conclusion to try to hide the limitations of the model architecture?

I'm asking because I know that with some prompts it gets the answer correct, and in those cases nothing in the tokenization has changed.

replies(1): >>41887614 #
33. bboygravity ◴[] No.41886133[source]
The lead has been taken over by Xai already as far as I know. They seem to have 100k H100's up and running.

OpenAI is not really leading the LLM world anyway ever since Claude sonnet 3.5 came out.

replies(1): >>41887052 #
34. LarsDu88 ◴[] No.41886361{3}[source]
There's pretraining which is just raw text from the internet but there's also supervised preference data sourced from users.

Right now the edge is in acquiring the latter which OpenAI has a slight lead in

35. ENGNR ◴[] No.41886762{4}[source]
Slack tried it but the backlash got them, this time anyway.
replies(1): >>41890399 #
36. fastball ◴[] No.41887047{5}[source]
If you don't know what it is, how do you know it's not training data?
37. belter ◴[] No.41887052[source]
> OpenAI is not really leading the LLM world anyway ever since Claude sonnet 3.5 came out.

Something many times repeated. It just takes a few minutes with the different models to find out it's not true.

replies(3): >>41887315 #>>41887320 #>>41888317 #
38. csomar ◴[] No.41887315{3}[source]
I'd say it's 50/50 at the moment. Half the times, sonnet gives a definitive and better answer.
39. daghamm ◴[] No.41887320{3}[source]
I usually use them side by side, and very often (not always) Claude is better.
40. browningstreet ◴[] No.41887349[source]
OpenAI just raised a huge pile of money. I’m sure the most was a consideration during the funding raise. I’m guessing the issues between openAI and Ms are beyond just the moat and competitiveness’s.

Every OpenAI thread focuses on the moat, but surely that was baked into their business dealings of the last 60 days.

replies(1): >>41887660 #
41. mswen ◴[] No.41887361{4}[source]
A friend and I built a proof-of-concept of using a variation of Latent Semantic Analysis to automatically build up conceptual maps and loadings of individual words against the latent conceptual vectors back in 2000. In exploring what it would take to scale I concluded, like you, that we should use professionally written and edited content like books, news articles and scientific journals as the corpus against which to build up the core knowledge graph.

Twenty-four years later I still regret not being able to raise money to enable us to keep working on that nascent startup. In most ways it was still too early. Google was still burning through VC money at that point and the midwestern investors we had some access to didn't get it. And, honestly they were probably correct. Compute power was still too expensive and quality data sources like published text were mostly locked up and generally not available to harvest.

42. ValentinA23 ◴[] No.41887460{5}[source]
Unless you're OpenAI and provided like 98% of ai generated text that was posted to the internet.
43. Der_Einzige ◴[] No.41887598{5}[source]
Anti-slop sampler and bigram/trigram stylometric analysis absolutely can give evidence that someone used a particular LLM with particular settings.

But in general it’s hard to separate human from AI text.

44. Der_Einzige ◴[] No.41887614{5}[source]
Yes, this is 100% tested and proven ad nasum within the field. I have some of my own papers on this, but you can look at literally any major AI conference and find dozens of papers analyzing yet more issues caused by byte pair tokenization.

Honestly the folks who don’t want to admit that it’s tokenization are just extremely salty that AI is actually good right now. Your “AI couldn’t tell me how many Rs in strawberry” stuff is extreme cope for your job prospects evaporating from a system that can’t spell correctly.

replies(1): >>41889402 #
45. appplication ◴[] No.41887660[source]
> I’m sure the most was a consideration during the funding raise

Ha. Really though, the entirety of venture capital could be summarized as “this probably won’t pay out but if it does it’s gonna be epic”. I wouldn’t read too much into America’s thirstiest capitalists spending their discretionary billions on the latest hype cycle.

46. intended ◴[] No.41887686{4}[source]
Training on Ny Times and Wikipedia…

The Wikipedia part at least, is incorrect. Currently Wikipedia mods/admins are dealing with AI generated articles being uploaded.

As for NYT - I am assuming that lots of those stories are already available in some blog or the other.

The e-mail and web forums are 100% polluted with spam, which takes constant effort to remove. For GenAI based content, it is far harder to identify and remove.

This example assumes the effort required to keep the web functional can deal with AI created content. Speaking from experience, our filters (human and otherwise) cannot. They fail to do so even now.

PS: Even given your example of closed email chains - the information in that depends on sources people read. Like plastic pollution in the food chain, this is inescapable.

replies(1): >>41887854 #
47. crazygringo ◴[] No.41887839{5}[source]
You're ignoring the information about the source.

It's not about a heuristic on text of unknown provenance -- it's about publishers that exert a certain level of editorial control and quality verification. Or social reputation mechanisms that achieve the same.

That's what is preventing your "model collapse". Reputations of provenance. Not pure-text heuristics.

48. crazygringo ◴[] No.41887854{5}[source]
> Currently Wikipedia mods/admins are dealing with AI generated articles being uploaded.

And they've always dealt with spam and low-quality submissions before. The system is working.

> As for NYT - I am assuming that lots of those stories are already available in some blog or the other.

I don't know what relevance that has to what we're talking about. The point is, train on the NYT. Blogs don't change what's on the NYT.

> The e-mail and web forums are 100% polluted with spam, which takes constant effort to remove.

They've always been polluted with low-quality content. So yes, either don't train on them, or only train on highly upvoted solutions, etc.

AI pollution isn't fundamentally any different from previous low-quality content and spam. It's not terribly difficult to determine which parts of the internet are known to be high-quality and train only on those. LLM's can't spam the NY Times.

replies(1): >>41887925 #
49. intended ◴[] No.41887925{6}[source]
>The system is working. Given that this is an issue I've heard from wikipedia admins themselves, I am impressed by your confidence.

>The point is, train on the NYT. Blogs don't change what's on the NYT.

The counter point is that NYT content is already in the training data because its already replicated or copied into random blogs.

>So yes, either don't train on them, or only train on highly upvoted solutions, etc.

Highly upvoted messages on reddit are very regular bots copying older top comments. Mods already have issues with AI comments.

----

TLDR: Pollution is already happening. Verification does not scale, while generation scales.

replies(1): >>41887995 #
50. tim333 ◴[] No.41887971{6}[source]
There doesn't seem to be much of a moat. OpenAI, Gemini, Meta, X.Ai and Anthropic all seem to be able to much the same stuff. o1 is novel at the moment but I bet it'll be copied soon.
51. crazygringo ◴[] No.41887995{7}[source]
> The counter point is that NYT content is already in the training data

That's not a counter point. My point is, train on things like the NYT, not random blogs. You can also whitelist the blogs you know are written by people, rather than randomly spidering the whole internet.

Also, no -- most of the NYT hasn't been copied into blogs. A small proportion of top articles, maybe.

> Highly upvoted messages on reddit are very regular bots copying older top comments.

What does that matter if the older top comment was written by a person? Also, Reddit is not somewhere you want to train in the first place if you're trying to generate a model where factual accuracy matters.

> Verification does not scale, while generation scales.

You don't need to verify everything -- you just need to verify enough stuff to train a model on. We're always going to have plenty of stuff that's sufficiently verified, whether from newspapers or Wikipedia or whitelisted blogs or books from verified publishers or whatever. It's not a problem.

You shouldn't be training on blogspam from random untrusted domains in the first place. So it doesn't matter if that junk is AI-generated or not.

replies(1): >>41888484 #
52. tim333 ◴[] No.41887996[source]
It's been out a little over a month. Give it a bit longer and everyone will have reverse engineered clones.
53. mnky9800n ◴[] No.41888317{3}[source]
I think the main issue with these metrics, which you implicitly highlight, Is that they are not a one size fits all approach. In fact, they are often treated, at least casually, like they are some kind of model fit like an r squared value. Which is maybe a good description narrowly constrained to the task or set of tasks they are being evaluated on for the metric. But the complexity of the user experience combined with the poor sample rate that a person can individually experience leads to conclusions like these. And they are perfectly valid conclusions. If the model doesn’t work for you, why use it? But it also suggests that personal experience cannot be used to decide if the model performs in aggregate well or not. But this doesn’t matter to the individual user or problem space. Because they should of course use whatever works best for them.
54. intended ◴[] No.41888484{8}[source]
>What does that matter if the older top comment was written by a person? That is the entire issue? LLMs fail when they are trained on GenAI based content?

> Also, Reddit is not somewhere you want to train in the first place if you're trying to generate a model where factual accuracy matters.

There is no model that can create facutal accuracy. This would basically contravene the laws of physics. LLMs predict the next token.

>You shouldn't be training on blogspam from random untrusted domains in the first place. So it doesn't matter if that junk is AI-generated or not

Afaik, all the current models are trained on this corpus. That is how they work.

replies(1): >>41888993 #
55. crazygringo ◴[] No.41888993{9}[source]
> There is no model that can create facutal accuracy.

Factual accuracy is not binary, it is a matter of degrees. Obviously training on content that is more factually correct will result in more factually correct next tokens. This is a pretty fundamental aspect of LLM's.

> Afaik, all the current models are trained on this corpus.

Then apologies for being so blunt, but you know wrong. There is a tremendous amount of work that goes on by the LLM companies in verifying, sanitizing, and structuring the training corpuses, using a wide array of techniques. The are absolutely not just throwing in blogspam and hoping for the best.

replies(1): >>41889428 #
56. youoy ◴[] No.41889402{6}[source]
But does a different prompt get the answer correct? I find it surprising. Can you share a link? I'm not saying this out of saltiness, I would be very grateful. If you don't want to I will try the shitty Google search, no problem.
57. intended ◴[] No.41889428{10}[source]
Thank you for being blunt. Let me attempt to speak in the same earnest tone.

You are contradicting the papers and work that the people who make the models are saying. Alternatively, you are looking at the dataset curation process with rose tinted glasses. >There is a tremendous amount of work that goes on by the LLM companies in verifying, sanitizing, and structuring the training corpuses, using a wide array of techniques.

Common crawl is instrumental in building our models, 60% of GPT's training data was Common Crawl. (https://arxiv.org/pdf/2005.14165) pg 9.

CC in turn was never intended for LLM training, this misalignment in goals results in downstream issues like hate speech, NYT content, copyrighted content and more getting used to train models.

https://foundation.mozilla.org/en/research/library/generativ... (This article is to establish the issues with CC as a source of LLM training)

https://facctconference.org/static/papers24/facct24-148.pdf (this details those issues.)

Firms, such as the NYT are now stopping common crawl from archiving their pages. https://www.wired.com/story/the-fight-against-ai-comes-to-a-...

-----

TLDR: 'NYT' and other high quality content has largely been ingested by models. Reddit and other sources play a large part in training current models.

While I appreciate your being blunt, this also means not being sharp and incisive. Perhaps precision would be required here to clarify your point.

Finally -

>Factual accuracy is not binary, it is a matter of degrees. Obviously training on content that is more factually correct will result in more factually correct next tokens

What. Come on, I think you wouldnt agree with your own statement after reading it once more -Factual correctness is not a matter of degrees.

Furthermore, facts dont automatically create facts. Calcuation, processing, testing and verification create more facts. Just putting facts together creates content.

replies(1): >>41895879 #
58. bossyTeacher ◴[] No.41890399{5}[source]
Elaborate? never heard of this
59. crazygringo ◴[] No.41895879{11}[source]
Re: corpus content, I think we're talking past each other. I'm saying that current models aren't being blindly trained on untrusted blogspam, and that there's a lot of work done to verify, structure, transform, etc. And earlier models were trained with lower-quality content, as companies were trying to figure out how much scale mattered. Now they're paying huge amounts of money to improve the quality of what they ingest, to better shape the quality of output. What they take from Reddit, they're not blindly ingesting every comment from every user. My overall main point stands: we have viable, working, scalable mechanisms to avoid the "pollution" you're worried about.

> What. Come on, I think you wouldnt agree with your own statement after reading it once more -Factual correctness is not a matter of degrees.

Of course it is. An LLM can be correct 30% of the time, 80% of the time, 95% of the time, 99% of the time. If that's not a matter of degrees, I don't know what is. If you're looking for 100% perfection, I think you'll find that not even humans can do that. ;)

replies(1): >>41901123 #
60. wkat4242 ◴[] No.41898416{4}[source]
Tbh it would also make the model much more accurate for corporate uses. It's not a bad idea for that reason.

But security and DLP teams will never accept it.

61. intended ◴[] No.41901123{12}[source]
> I think we're talking past each other

Likely.

Re: > we have viable, working, scalable mechanisms to avoid the "pollution" you're worried about.

Do note - it’s the scalable mechanisms that I am looking at. I dont think the state of the art has shifted much since the last paper by OpenAI.

Can you link me to some new information or sources that lend credence to your claim.

> An LLM can be correct 30% of the time, 80% of the time, 95%…

That would be the error rate, which can be a matter of degrees.

However factual correctness largely cannot - the capital of Sweden today is Stockholm, with 0% variation in that answer.