Most active commenters

trott(7)
LoganDark(4)
lostmsu(4)
spiralk(3)
stubish(3)

Popular/hot comments

>>40715430 #
>>40714248 #
>>40714649 #
>>40714795 #
>>40715822 #

Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data

(epochai.org)

1. ofrzeta ◴[18 Jun 24 04:35 UTC] No.40714106[source]▶

"If trends continue, language models will fully utilize this stock between 2026 and 2032" - that will require data centers with their own nuclear reactors (or other power plants) as hinted at by Marc Zuckerberg?

replies(2): >>40714248 #>>40714295 #

2. trott ◴[18 Jun 24 05:09 UTC] No.40714248[source]▶

>>40714106 #

If you take Llama-3-400B, and 30x its data (hitting the data ceiling, AFAICT), 30x its size to match, and the hardware improves by, say, 3x, then you'll use up about a year's worth of energy from a typical nuclear power plant.

replies(3): >>40714269 #>>40714294 #>>40714809 #

3. mathsmath ◴[18 Jun 24 05:14 UTC] No.40714269{3}[source]▶

>>40714248 #

I don’t know much about LLMs, but is it possible to throttle their training?

Solar has gotten pretty cheap, and I’m just wondering if you can throttle up and down based on how much output the panels are producing.

replies(1): >>40714416 #

4. spiralk ◴[18 Jun 24 05:21 UTC] No.40714294{3}[source]▶

>>40714248 #

If its for training a new foundation model it is not that bad. It's still only a fraction of the energy compared to many human industries. I did rough math some time ago and found that that training llama-3-70B used the equivalent energy to 1/30 of a full loaded container ship going from China to the US. Even scaled up 100x and trained 10x longer, its seems like the energy consumption is relatively small compared to other industries. The fact that people are considering nuclear power for AI training is an advantage not a downside, imo. It should have a much lower CO2 footprint.

replies(1): >>40715462 #

5. bamboozled ◴[18 Jun 24 05:21 UTC] No.40714295[source]▶

>>40714106 #

Remember tackling climate change, Remember all the Silicon Valleys pushing for us to tackle climate change?

replies(1): >>40715630 #

6. throwaway48476 ◴[18 Jun 24 05:30 UTC] No.40714342[source]▶

>>40713389 (OP) #

Were not even close to running out of human generated data. The reason it seems this way is because it's so hard to find old data. There are tons of whole magazine scans on some obscure website that's not even indexed. Most of this is the fault of Google who has been an atrocious steward of search. Why is it that I still can't do full text search of the internet archive dataset? Forever copyright of commercially de minimus works also plays a large role.

There's a monumental amount of quality data out there that's not indexed, not searchable, and abandoned but unused. We just need to value it enough to use it.

replies(1): >>40714649 #

7. moi2388 ◴[18 Jun 24 05:49 UTC] No.40714416{4}[source]▶

>>40714269 #

Of course it is, but the trade-off is time.

8. _boffin_ ◴[18 Jun 24 06:02 UTC] No.40714494[source]▶

>>40713389 (OP) #

The amount of data that all the different government agencies has tucked away in their different file cabinets has to be magnitudes more than what's on the public internet. The amount of data in the military... i couldn't even fathom.

One data source i've been thinking about that i don't know if they've hit yet is all the different agencies local and state agencies and their private and public meetings, ordinances, discourse, etc...

replies(2): >>40715221 #>>40716466 #

9. stubish ◴[18 Jun 24 06:28 UTC] No.40714649[source]▶

>>40714342 #

So much old, out of date, factually incorrect, racist, sexist and even illegal information. I think it is already clear that training systems on everything is not the way forward, and about as reliable as the set of 80s Encyclopedias my mother refuses to throw out. The current tech needs to be trained on good data to produce good results, as it can't reason and gauge reliability or even pick up when its output is self contradictory.

replies(3): >>40714750 #>>40716624 #>>40720349 #

10. makapuf ◴[18 Jun 24 06:52 UTC] No.40714795[source]▶

>>40713389 (OP) #

Funny that it does not need that much data to train your average 20th century human genius. I'd say that if we are dreaming of the future of ai, learning and reasoning seems the greatest issue, not data. That said, the article title is about LLMs, so that's what will need changing I guess.

replies(3): >>40715430 #>>40715643 #>>40716666 #

11. monero-xmr ◴[18 Jun 24 06:54 UTC] No.40714809{3}[source]▶

>>40714248 #

If someone is willing to pay, who cares? Energy has a price. Focus on regulating how energy is generated, and when prices climb the market will solve the problem.

If instead you focus on using the government to outlaw demand, only failure will follow. I mean, didn't the government outlawing the demand for illegal drugs fail miserably? I believe drugs are cheaper, more potent, and more available than ever.

Similarly, if there is demand for compute, then compute will occur. There is always a clearing price commiserate with the risk.

replies(1): >>40715727 #

12. aaron695 ◴[18 Jun 24 07:53 UTC] No.40715173[source]▶

>>40713389 (OP) #

They haven't even translated non-English material and mixed it all in yet (that I know of)

This is big because it would hold novel data the West doesn't access. What is the 'mood' of the average Chinese farmer on Taiwan.

Otherwise it's hard to see how adding more text of the same thing is going to create a revolution.

Video will be something new. But if like "Her" it watches every Twitch stream simultaneously for a month, and is talking to a billion people for a month and still doesn't get it what else is going to happen?

replies(1): >>40715724 #

13. freilanzer ◴[18 Jun 24 08:02 UTC] No.40715221[source]▶

>>40714494 #

> The amount of data that all the different government agencies has tucked away in their different file cabinets has to be magnitudes more than what's on the public internet. The amount of data in the military... i couldn't even fathom.

Definitely not when it comes to text. The internet is the largest resource. I'd like to see all books in the Vatican digitalised, if they're not already - probably not though.

14. jstanley ◴[18 Jun 24 08:45 UTC] No.40715430[source]▶

>>40714795 #

Humans aren't just text interfaces though. The majority of your input is not textual but is sights, sounds, feelings, etc., that LLMs don't (yet?) have access to.

Humans receive an enormous amount of training data in forms not currently available to LLMs.

If you locked baby Einstein in a room with the collected works of humanity and left him there for a lifetime, I doubt he'd have even learnt to read on his own.

replies(6): >>40715609 #>>40715647 #>>40715822 #>>40715950 #>>40716247 #>>40716485 #

15. adrianN ◴[18 Jun 24 08:52 UTC] No.40715462{4}[source]▶

>>40714294 #

You always have to compare the cost to the value it generates. A year of power from a nuclear plant might be used in more productive ways.

replies(2): >>40715605 #>>40719559 #

16. bambax ◴[18 Jun 24 09:04 UTC] No.40715513[source]▶

>>40713389 (OP) #

The paradox is that the amount of data available for LLM training is going down, not up, because earlier models made ample use of copyrighted works that later models won't have access to.

replies(1): >>40717153 #

17. fifteen1506 ◴[18 Jun 24 09:09 UTC] No.40715539[source]▶

>>40713389 (OP) #

I thought having the ex-NSA chief on board would mean a military AI would have lots of info to be fed on, in near real-time.

18. ofrzeta ◴[18 Jun 24 09:17 UTC] No.40715605{5}[source]▶

>>40715462 #

Yes, and consider that with the current hype around AI and enough (venture) capital there will be several corporations competing for the best AI and suddenly we are at several "nuclear plants" or equivalent energy "consumption".

replies(1): >>40719695 #

19. devoutsalsa ◴[18 Jun 24 09:18 UTC] No.40715609{3}[source]▶

>>40715430 #

I remember listening to an AI researched in some interview over 20 years ago. He said that in his quest to create an AI, he realized at some point he could just have kids instead.

20. surfingdino ◴[18 Jun 24 09:21 UTC] No.40715630{3}[source]▶

>>40714295 #

Yeah, where's that app that was supposed to fix it?

21. verdverm ◴[18 Jun 24 09:25 UTC] No.40715643[source]▶

>>40714795 #

This has become my thinking as well. There is plenty of data for humans, so this is really a hardware / algos problem, not a data one. Using lots of text and energy is just a step on the path. We are already moving on to text+image and not long until video and later robots (with sensors) too

22. mrweasel ◴[18 Jun 24 09:26 UTC] No.40715647{3}[source]▶

>>40715430 #

We also have a funny way of applying solutions and lessons learn in one field to problems in completely unrelated areas. Given the statistical nature of LLMs I'm not convinced that they are able look across fields in the same way as a human brain, they lack creativity.

The greatest advantage you can have in life is a creative mind and I don't believe that is something that can be taught. It can be stomped out of you as a child, but it's not learnable.

replies(1): >>40716407 #

23. surfingdino ◴[18 Jun 24 09:39 UTC] No.40715724[source]▶

>>40715173 #

> They haven't even translated non-English material and mixed it all in yet (that I know of)

The current performance of LLMs on non-English languages is disappointing. Feeding it more non-English material is not going guaranteed to help.

> This is big because it would hold novel data the West doesn't access. What is the 'mood' of the average Chinese farmer on Taiwan.

The average Chinese farmer does not produce textual output of that kind. It is generally not advisable to put your thoughts in writing in oppressive regimes. It could be a life-ending mistake.

> Otherwise it's hard to see how adding more text of the same thing is going to create a revolution.

The LLM gang are like the people who think they can get slimmer by eating more.

> Video will be something new. But if like "Her" it watches every Twitch stream simultaneously for a month, and is talking to a billion people for a month and still doesn't get it what else is going to happen?

Since when does Twitch carry broadcasts that have any value to humanity? Is it used to hold scientific discussions? Or for shooting shit and pushing paid products and services?

24. amanaplanacanal ◴[18 Jun 24 09:39 UTC] No.40715727{4}[source]▶

>>40714809 #

A carbon tax would be the most free market way to do it: tax fossil carbon as it comes out of the ground. The market can handle the rest. Can’t seem to make that happen politically, though.

replies(1): >>40717387 #

25. trott ◴[18 Jun 24 09:56 UTC] No.40715822{3}[source]▶

>>40715430 #

The stream of data from vision does NOT explain why humans learn 1000x faster: Children who lost their sight early on, can grow up to be intelligent. They can learn English, for example. They don't need to hear 200B words, like GPT-3.

replies(3): >>40716628 #>>40716999 #>>40720531 #

26. nope1000 ◴[18 Jun 24 10:16 UTC] No.40715950{3}[source]▶

>>40715430 #

We also inherit a lot of network topology already

27. nkrisc ◴[18 Jun 24 11:00 UTC] No.40716247{3}[source]▶

>>40715430 #

> I doubt he'd have even learnt to read on his own.

Of course not, not any more than he’d learn to program a computer on his own.

Reading and writing are not natural skills. They are a technology that was invented by humans and they must be taught in some capacity. As we learn to read and write, parts of our brain related to language and other skills (which we do possess naturally) are co-opted to enable reading and writing, an unnatural skill.

Intelligence and language, however, are natural human abilities that we have evolved, likely over millions of years. Any parent will tell you this is obvious. It’s amazing how much children are able to infer and learn on their own.

28. lassoiat ◴[18 Jun 24 11:19 UTC] No.40716407{4}[source]▶

>>40715647 #

I have come to the point that it is not really fair to the LLM to statistically train it on human output and expect it to come up with something more than the average.

There will be much value in automating the tedious and the routine.

Of course, that doesn't make for a great science fiction story. We first have to placate all these science fiction fantasies and in the process we will automate the tedious and the routine as a side effect of trying to figure out how many AGI can dance on the head of a pin.

Then human creativity will just be worth all the more.

29. dgoodell ◴[18 Jun 24 11:26 UTC] No.40716466[source]▶

>>40714494 #

As someone who works for the nasa, I’m not so sure. You’d be surprised how much stuff gets randomly thrown away to save space.

And it’s going to get worse I now that paper files are disappearing.

I wanted some info and data from a test we did 9 years ago. It was a pretty big deal, lots of people involved, many millions of dollars, multiple nasa centers contributing. Every single person on the test randomly kept their own files for the portion of the test they were responsible for. And the only copy of the raw test data was deleted by one of them to save some space when upgrading. There is no record anywhere of what equipment was used for the test.

One of my coworkers has 4 TB external HDD that he keeps everything he has ever worked on. It’s not backed up anywhere else. It just failed and he thought he lost everything, luckily I was able to recover most of it. Wtf.

30. bluedevilzn ◴[18 Jun 24 11:30 UTC] No.40716485{3}[source]▶

>>40715430 #

Helen Keller is the counter point to this. She’s obviously intelligent and had no sights or sounds as input but only texts.

31. glimshe ◴[18 Jun 24 11:45 UTC] No.40716624{3}[source]▶

>>40714649 #

The geniuses and stewards of our civilization of just a couple of decades ago were trained on this very data. We don't yet know what outcome we'll get by handing out the world to the people trained on "new, up to date, factually correct, egalitarian and legal" data.

replies(1): >>40723207 #

32. bhickey ◴[18 Jun 24 11:46 UTC] No.40716628{4}[source]▶

>>40715822 #

The human brain isn't randomly initialized. It's undergone 500m years of pretraining.

replies(2): >>40717032 #>>40719440 #

33. bastien2 ◴[18 Jun 24 11:50 UTC] No.40716666[source]▶

>>40714795 #

That's because humans learn in stages of growing complexity and semantic depth, and LLMs don't.

The chatbots do what infant humans do: mimic what it "sees" until it gets the pattern consistently matching what it saw without any capacity to understand what it is doing.

Once humans have that part done, whole new layers of semantic learning kick in and create the critical analyses we perceive as "intelligence".

LLMs, as a consequence of their design, lack those deeper layers. They are not artificially intelligent at all. Rather, they're the latest iteration of what centuries ago gave us steam-powered songbirds.

34. LoganDark ◴[18 Jun 24 12:27 UTC] No.40716999{4}[source]▶

>>40715822 #

Humans use bottom-up reinforcement learning, but nearly all LLMs use gradient descent. Not only are those completely different directions (bottom-up as in humans versus top-down as in gradient descent) with completely different emergent behavior, but minimizing loss is not in the reward function of a human, even if schools like to think it makes for an effective education. (I'd argue it doesn't.)

35. LoganDark ◴[18 Jun 24 12:30 UTC] No.40717032{5}[source]▶

>>40716628 #

This makes me wonder if human brains can be genetically predisposed to a particular dominant language. I'd imagine not since that isn't typically a factor in selection, but I still wonder.

replies(2): >>40717173 #>>40717313 #

36. LoganDark ◴[18 Jun 24 12:43 UTC] No.40717153[source]▶

>>40715513 #

Not only that, but a dataset that includes LLM-generated content has been known to reduce model quality. I remember there being a paper on it but I can't seem to find it now. Essentially, the internet now being chock full of LLM garbage means that any model you train on it is going to end up quite a bit worse than it could have been, simply because of the dataset being "poisoned" by preexisting LLMs. I bet OpenAI's only real advantage is having a dataset that was gathered before LLM use was widespread.

37. bhickey ◴[18 Jun 24 12:45 UTC] No.40717173{6}[source]▶

>>40717032 #

I doubt it. Language and human evolution operate on different time scales. We wouldn't be able to converse with someone from 13th century England. If anything I will expect selective pressure on languages—those that are easy to use are more likely to be adopted.

Secondarily, I would expect this effect to be swamped by other factors (e.g. conquest).

38. Grimblewald ◴[18 Jun 24 12:59 UTC] No.40717313{6}[source]▶

>>40717032 #

From what I have read and have come to understand, it is more that we are generally predisposed to human language in general, specific portions of our brain especially so.

replies(1): >>40717555 #

39. DrNosferatu ◴[18 Jun 24 13:00 UTC] No.40717327[source]▶

>>40713389 (OP) #

Also, important to keep in mind the inevitable contamination with AI-generated content in datasets from now on.

40. DrNosferatu ◴[18 Jun 24 13:07 UTC] No.40717387{5}[source]▶

>>40715727 #

Careful with carbon tax plans, they can be regressive:

https://blogs.worldbank.org/en/energy/what-carbon-tax-can-do...

41. LoganDark ◴[18 Jun 24 13:22 UTC] No.40717555{7}[source]▶

>>40717313 #

> it is more that we are generally predisposed to human language in general

I understand, that's not what I was wondering.

42. trott ◴[18 Jun 24 16:09 UTC] No.40719440{5}[source]▶

>>40716628 #

> The human brain isn't randomly initialized. It's undergone 500m years of pretraining.

All of the information accumulated by evolution gets passed through DNA. For humans, that's well under 1GB. Probably a very tiny fraction of that determines how the brain works at the algorithmic level. You should think of this information as the "software" of the brain, not pretrained LLM weights (350GB for GPT-3).

43. spiralk ◴[18 Jun 24 16:21 UTC] No.40719559{5}[source]▶

>>40715462 #

Sure I agree, but if we compared value it generates per unit energy it would still probably be better than many non-essential industries: the entertainment industry, fashion industry, alcohol, etc. Even in the current state LLMs can provide more useful practical value compared to industries with higher energy and CO2 footprints.

44. spiralk ◴[18 Jun 24 16:36 UTC] No.40719695{6}[source]▶

>>40715605 #

I don't see increasing demand for nuclear power as a disadvantage. We have nuclear material that can last humanity 1000s of years at least. The CO2 footprint is an issue but nuclear is much better than others. Personally, I think it's better we utilize more energy and discover new breakthroughs while society is relatively stable and functioning, because there's no guarantee that it will last. Population collapse seems imminent in more educated societies, even China and India are trending this way now. Without some level of AI assistance, humanity would likely lose a great deal of productive output.

Also, if this path to AGI does not work out, its not as though the nuclear reactors will be wasted. People will find something else to do with the energy.

45. BriggyDwiggs42 ◴[18 Jun 24 17:49 UTC] No.40720349{3}[source]▶

>>40714649 #

I’d be willing to bet the proportion factual is roughly the same as today’s data. We love to think we’re better, but most people aren’t expert scientists who know all of the latest literature. Also, what information is illegal?

replies(2): >>40721595 #>>40723173 #

46. lostmsu ◴[18 Jun 24 18:10 UTC] No.40720531{4}[source]▶

>>40715822 #

Even audio is several magnitudes larger. Uncompressed stereo is 100 kilobytes per second. So an hour is already 0.5 gigabytes. A year is ~3 TB.

replies(1): >>40720817 #

47. resiros ◴[18 Jun 24 18:18 UTC] No.40720599[source]▶

>>40713389 (OP) #

Lots of assumption here. First, that we will only be training on text data, if we take into considerations all the videos and audios shared I am quite sure we would have one or two orders of magnitude more of data. Second, that it even matter, there has been some early research showing that training on the right data improves prediction more than training on more data (which intuitively makes sense, training on papers and book is much more useful than training on youtube comments). Additionally, lots of the improvement in quality are because of RLHF, which is basically manual human labeling. And last, my guess is that improvements in architecture are what will unlock the next level of performance, not just scaling.

replies(1): >>40721542 #

48. trott ◴[18 Jun 24 18:47 UTC] No.40720817{5}[source]▶

>>40720531 #

> Uncompressed stereo is 100 kilobytes per second.

How much of that is cognitively useful for learning English? On top of the textual content, audio gives you emphasis and mood. Not a lot of information in that -- a few bits per sentence.

replies(1): >>40721824 #

49. trott ◴[18 Jun 24 19:53 UTC] No.40721542[source]▶

>>40720599 #

> Lots of assumption here. First, that we will only be training on text data, if we take into considerations all the videos and audios shared I am quite sure we would have one or two orders of magnitude more of data.

1GB of text is way more useful for generating text than 1GB of video is.

> training on the right data improves prediction more than training on more data

Books are more useful than Facebook rants. But this is an argument for data scarcity rather than for data abundance.

50. throwaway48476 ◴[18 Jun 24 19:57 UTC] No.40721595{4}[source]▶

>>40720349 #

I'd argue in an era of hyper competition the incentives to lie are higher than before.

51. lostmsu ◴[18 Jun 24 20:20 UTC] No.40721824{6}[source]▶

>>40720817 #

Nearly all of it. You need a lot of pictures without cats to explain what a cat is.

replies(2): >>40722319 #>>40723538 #

52. nojvek ◴[18 Jun 24 20:23 UTC] No.40721846[source]▶

>>40713389 (OP) #

Argument of running out of data is kind of stupid.

We have billions of cameras, microphones and IMU/GPS sensors. In-fact one in almost every pocket and desk.

Survival requires intelligence being energy and resource efficient.

Those who build the most powerful and useful models that run locally on edge and are data efficient have a higher chance of winning.

Whoever provides the cheapest fastest most useful models will keep on winning.

53. makapuf ◴[18 Jun 24 21:27 UTC] No.40722319{7}[source]▶

>>40721824 #

But you don't need millions of pictures of lions as a kid to know what a lion is.

replies(1): >>40723371 #

54. stubish ◴[18 Jun 24 23:30 UTC] No.40723173{4}[source]▶

>>40720349 #

I'd bet that today's data is worse. Data used to be expensive, because you needed to print it or store it on expensive media. Now, bullshit is cheaper than ever to reproduce. And more publishing than ever is about making money rather than disseminating information. But I think the point holds that adding even more, different bullshit into the training data with a few new facts isn't going to improve the quality of output.

Illegal information depends on jurisdiction. Even before you get to governments restricting access to facts, unflattering opinions, or information they consider immoral, you have information considered fraudulent, defamatory and perhaps even treasonous. Secret information might also count here, since governments don't want that ending up in your training data even if they do trust your cloud storage. Illegal information can get you thrown in prison if you seek it out or publish it or sometimes even just read it.

55. stubish ◴[18 Jun 24 23:36 UTC] No.40723207{4}[source]▶

>>40716624 #

We hope they used their reason to maintain their knowledge over the years, or at least updated their poor fashion choices. Or maybe not given so much effort is made to enforce moral opinions from biblical times.

56. lostmsu ◴[19 Jun 24 00:05 UTC] No.40723371{8}[source]▶

>>40722319 #

Neither do CNNs, so I don't quite see your point. You are throwing numbers without good estimates. Get descent estimates for both children and NNs then make categorical conclusions.

Better even measure in bytes. And remember that kids look at video, not at individual pictures (even if these are videos of pictures).

57. trott ◴[19 Jun 24 00:37 UTC] No.40723538{7}[source]▶

>>40721824 #

> Nearly all of it.

Maybe you misunderstood me. I'm not talking about learning to understand spoken English.

You don't need hearing or vision at all to grow up to be intelligent (and able to write English).

replies(1): >>40725184 #

58. lostmsu ◴[19 Jun 24 05:41 UTC] No.40725184{8}[source]▶

>>40723538 #

What is your point exactly? Did you estimate raw amount of data received by people to make claims about data efficiency?

↑