Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data

1. makapuf ◴[18 Jun 24 06:52 UTC] No.40714795[source]▶

Funny that it does not need that much data to train your average 20th century human genius. I'd say that if we are dreaming of the future of ai, learning and reasoning seems the greatest issue, not data. That said, the article title is about LLMs, so that's what will need changing I guess.

replies(3): >>40715430 #>>40715643 #>>40716666 #

2. jstanley ◴[18 Jun 24 08:45 UTC] No.40715430[source]▶

>>40714795 (TP) #

Humans aren't just text interfaces though. The majority of your input is not textual but is sights, sounds, feelings, etc., that LLMs don't (yet?) have access to.

Humans receive an enormous amount of training data in forms not currently available to LLMs.

If you locked baby Einstein in a room with the collected works of humanity and left him there for a lifetime, I doubt he'd have even learnt to read on his own.

replies(6): >>40715609 #>>40715647 #>>40715822 #>>40715950 #>>40716247 #>>40716485 #

3. devoutsalsa ◴[18 Jun 24 09:18 UTC] No.40715609[source]▶

>>40715430 #

I remember listening to an AI researched in some interview over 20 years ago. He said that in his quest to create an AI, he realized at some point he could just have kids instead.

4. verdverm ◴[18 Jun 24 09:25 UTC] No.40715643[source]▶

>>40714795 (TP) #

This has become my thinking as well. There is plenty of data for humans, so this is really a hardware / algos problem, not a data one. Using lots of text and energy is just a step on the path. We are already moving on to text+image and not long until video and later robots (with sensors) too

5. mrweasel ◴[18 Jun 24 09:26 UTC] No.40715647[source]▶

>>40715430 #

We also have a funny way of applying solutions and lessons learn in one field to problems in completely unrelated areas. Given the statistical nature of LLMs I'm not convinced that they are able look across fields in the same way as a human brain, they lack creativity.

The greatest advantage you can have in life is a creative mind and I don't believe that is something that can be taught. It can be stomped out of you as a child, but it's not learnable.

replies(1): >>40716407 #

6. trott ◴[18 Jun 24 09:56 UTC] No.40715822[source]▶

>>40715430 #

The stream of data from vision does NOT explain why humans learn 1000x faster: Children who lost their sight early on, can grow up to be intelligent. They can learn English, for example. They don't need to hear 200B words, like GPT-3.

replies(3): >>40716628 #>>40716999 #>>40720531 #

7. nope1000 ◴[18 Jun 24 10:16 UTC] No.40715950[source]▶

>>40715430 #

We also inherit a lot of network topology already

8. nkrisc ◴[18 Jun 24 11:00 UTC] No.40716247[source]▶

>>40715430 #

> I doubt he'd have even learnt to read on his own.

Of course not, not any more than he’d learn to program a computer on his own.

Reading and writing are not natural skills. They are a technology that was invented by humans and they must be taught in some capacity. As we learn to read and write, parts of our brain related to language and other skills (which we do possess naturally) are co-opted to enable reading and writing, an unnatural skill.

Intelligence and language, however, are natural human abilities that we have evolved, likely over millions of years. Any parent will tell you this is obvious. It’s amazing how much children are able to infer and learn on their own.

9. lassoiat ◴[18 Jun 24 11:19 UTC] No.40716407{3}[source]▶

>>40715647 #

I have come to the point that it is not really fair to the LLM to statistically train it on human output and expect it to come up with something more than the average.

There will be much value in automating the tedious and the routine.

Of course, that doesn't make for a great science fiction story. We first have to placate all these science fiction fantasies and in the process we will automate the tedious and the routine as a side effect of trying to figure out how many AGI can dance on the head of a pin.

Then human creativity will just be worth all the more.

10. bluedevilzn ◴[18 Jun 24 11:30 UTC] No.40716485[source]▶

>>40715430 #

Helen Keller is the counter point to this. She’s obviously intelligent and had no sights or sounds as input but only texts.

11. bhickey ◴[18 Jun 24 11:46 UTC] No.40716628{3}[source]▶

>>40715822 #

The human brain isn't randomly initialized. It's undergone 500m years of pretraining.

replies(2): >>40717032 #>>40719440 #

12. bastien2 ◴[18 Jun 24 11:50 UTC] No.40716666[source]▶

>>40714795 (TP) #

That's because humans learn in stages of growing complexity and semantic depth, and LLMs don't.

The chatbots do what infant humans do: mimic what it "sees" until it gets the pattern consistently matching what it saw without any capacity to understand what it is doing.

Once humans have that part done, whole new layers of semantic learning kick in and create the critical analyses we perceive as "intelligence".

LLMs, as a consequence of their design, lack those deeper layers. They are not artificially intelligent at all. Rather, they're the latest iteration of what centuries ago gave us steam-powered songbirds.

13. LoganDark ◴[18 Jun 24 12:27 UTC] No.40716999{3}[source]▶

>>40715822 #

Humans use bottom-up reinforcement learning, but nearly all LLMs use gradient descent. Not only are those completely different directions (bottom-up as in humans versus top-down as in gradient descent) with completely different emergent behavior, but minimizing loss is not in the reward function of a human, even if schools like to think it makes for an effective education. (I'd argue it doesn't.)

14. LoganDark ◴[18 Jun 24 12:30 UTC] No.40717032{4}[source]▶

>>40716628 #

This makes me wonder if human brains can be genetically predisposed to a particular dominant language. I'd imagine not since that isn't typically a factor in selection, but I still wonder.

replies(2): >>40717173 #>>40717313 #

15. bhickey ◴[18 Jun 24 12:45 UTC] No.40717173{5}[source]▶

>>40717032 #

I doubt it. Language and human evolution operate on different time scales. We wouldn't be able to converse with someone from 13th century England. If anything I will expect selective pressure on languages—those that are easy to use are more likely to be adopted.

Secondarily, I would expect this effect to be swamped by other factors (e.g. conquest).

16. Grimblewald ◴[18 Jun 24 12:59 UTC] No.40717313{5}[source]▶

>>40717032 #

From what I have read and have come to understand, it is more that we are generally predisposed to human language in general, specific portions of our brain especially so.

replies(1): >>40717555 #

17. LoganDark ◴[18 Jun 24 13:22 UTC] No.40717555{6}[source]▶

>>40717313 #

> it is more that we are generally predisposed to human language in general

I understand, that's not what I was wondering.

18. trott ◴[18 Jun 24 16:09 UTC] No.40719440{4}[source]▶

>>40716628 #

> The human brain isn't randomly initialized. It's undergone 500m years of pretraining.

All of the information accumulated by evolution gets passed through DNA. For humans, that's well under 1GB. Probably a very tiny fraction of that determines how the brain works at the algorithmic level. You should think of this information as the "software" of the brain, not pretrained LLM weights (350GB for GPT-3).

19. lostmsu ◴[18 Jun 24 18:10 UTC] No.40720531{3}[source]▶

>>40715822 #

Even audio is several magnitudes larger. Uncompressed stereo is 100 kilobytes per second. So an hour is already 0.5 gigabytes. A year is ~3 TB.

replies(1): >>40720817 #

20. trott ◴[18 Jun 24 18:47 UTC] No.40720817{4}[source]▶

>>40720531 #

> Uncompressed stereo is 100 kilobytes per second.

How much of that is cognitively useful for learning English? On top of the textual content, audio gives you emphasis and mood. Not a lot of information in that -- a few bits per sentence.

replies(1): >>40721824 #

21. lostmsu ◴[18 Jun 24 20:20 UTC] No.40721824{5}[source]▶

>>40720817 #

Nearly all of it. You need a lot of pictures without cats to explain what a cat is.

replies(2): >>40722319 #>>40723538 #

22. makapuf ◴[18 Jun 24 21:27 UTC] No.40722319{6}[source]▶

>>40721824 #

But you don't need millions of pictures of lions as a kid to know what a lion is.

replies(1): >>40723371 #

23. lostmsu ◴[19 Jun 24 00:05 UTC] No.40723371{7}[source]▶

>>40722319 #

Neither do CNNs, so I don't quite see your point. You are throwing numbers without good estimates. Get descent estimates for both children and NNs then make categorical conclusions.

Better even measure in bytes. And remember that kids look at video, not at individual pictures (even if these are videos of pictures).

24. trott ◴[19 Jun 24 00:37 UTC] No.40723538{6}[source]▶

>>40721824 #

> Nearly all of it.

Maybe you misunderstood me. I'm not talking about learning to understand spoken English.

You don't need hearing or vision at all to grow up to be intelligent (and able to write English).

replies(1): >>40725184 #

25. lostmsu ◴[19 Jun 24 05:41 UTC] No.40725184{7}[source]▶

>>40723538 #

What is your point exactly? Did you estimate raw amount of data received by people to make claims about data efficiency?