←back to thread

321 points jhunter1016 | 1 comments | | HN request time: 0s | source
Show context
mikeryan ◴[] No.41878605[source]
While technical AI and LLMs are not something I’m well versed in. So as I sit on the sidelines and see the current proliferation of AI startups I’m starting to wonder where the moats are outside of access to raw computing power. Open AI seemed to have a massive lead in this space but that lead seems to be shrinking every day.
replies(10): >>41878784 #>>41878809 #>>41878843 #>>41880703 #>>41881606 #>>41882000 #>>41885618 #>>41886010 #>>41886133 #>>41887349 #
weberer ◴[] No.41878784[source]
Obtaining high quality training data is the biggest moat right now.
replies(2): >>41882699 #>>41883992 #
segasaturn ◴[] No.41882699[source]
Where are they going to get that data? Everything on the open web after 2023 is polluted with lowquality AI slop that poisons the data sets. My prediction: Aggressive dragnet surveillance of users. As in, Google recording your phone calls on Android, Windows sending screen recordings from Recall to OpenAI, Meta training off Whatsapp messages... It sounds dystopian, but the Line Must Go Up.
replies(3): >>41883095 #>>41883850 #>>41885531 #
lfmunoz4 ◴[] No.41885531[source]
Would think most quality data is books and news articles and scientific journals. Not crap people are texting each other.

These companies will never admit it but AI is built on the back of piracy archives, easiest way and cheapest way to getting massive amounts of quality data.

replies(2): >>41885758 #>>41887361 #
1. mswen ◴[] No.41887361[source]
A friend and I built a proof-of-concept of using a variation of Latent Semantic Analysis to automatically build up conceptual maps and loadings of individual words against the latent conceptual vectors back in 2000. In exploring what it would take to scale I concluded, like you, that we should use professionally written and edited content like books, news articles and scientific journals as the corpus against which to build up the core knowledge graph.

Twenty-four years later I still regret not being able to raise money to enable us to keep working on that nascent startup. In most ways it was still too early. Google was still burning through VC money at that point and the midwestern investors we had some access to didn't get it. And, honestly they were probably correct. Compute power was still too expensive and quality data sources like published text were mostly locked up and generally not available to harvest.