Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data

(epochai.org)

Show context

makapuf ◴[18 Jun 24 06:52 UTC] No.40714795[source]▶

Funny that it does not need that much data to train your average 20th century human genius. I'd say that if we are dreaming of the future of ai, learning and reasoning seems the greatest issue, not data. That said, the article title is about LLMs, so that's what will need changing I guess.

replies(3): >>40715430 #>>40715643 #>>40716666 #

jstanley ◴[18 Jun 24 08:45 UTC] No.40715430[source]▶

>>40714795 #

Humans aren't just text interfaces though. The majority of your input is not textual but is sights, sounds, feelings, etc., that LLMs don't (yet?) have access to.

Humans receive an enormous amount of training data in forms not currently available to LLMs.

If you locked baby Einstein in a room with the collected works of humanity and left him there for a lifetime, I doubt he'd have even learnt to read on his own.

replies(6): >>40715609 #>>40715647 #>>40715822 #>>40715950 #>>40716247 #>>40716485 #

trott ◴[18 Jun 24 09:56 UTC] No.40715822[source]▶

>>40715430 #

The stream of data from vision does NOT explain why humans learn 1000x faster: Children who lost their sight early on, can grow up to be intelligent. They can learn English, for example. They don't need to hear 200B words, like GPT-3.

replies(3): >>40716628 #>>40716999 #>>40720531 #

1. bhickey ◴[18 Jun 24 11:46 UTC] No.40716628[source]▶

>>40715822 #

The human brain isn't randomly initialized. It's undergone 500m years of pretraining.

replies(2): >>40717032 #>>40719440 #

2. LoganDark ◴[18 Jun 24 12:30 UTC] No.40717032[source]▶

>>40716628 (TP) #

This makes me wonder if human brains can be genetically predisposed to a particular dominant language. I'd imagine not since that isn't typically a factor in selection, but I still wonder.

replies(2): >>40717173 #>>40717313 #

3. bhickey ◴[18 Jun 24 12:45 UTC] No.40717173[source]▶

>>40717032 #

I doubt it. Language and human evolution operate on different time scales. We wouldn't be able to converse with someone from 13th century England. If anything I will expect selective pressure on languages—those that are easy to use are more likely to be adopted.

Secondarily, I would expect this effect to be swamped by other factors (e.g. conquest).

4. Grimblewald ◴[18 Jun 24 12:59 UTC] No.40717313[source]▶

>>40717032 #

From what I have read and have come to understand, it is more that we are generally predisposed to human language in general, specific portions of our brain especially so.

replies(1): >>40717555 #

5. LoganDark ◴[18 Jun 24 13:22 UTC] No.40717555{3}[source]▶

>>40717313 #

> it is more that we are generally predisposed to human language in general

I understand, that's not what I was wondering.

6. trott ◴[18 Jun 24 16:09 UTC] No.40719440[source]▶

>>40716628 (TP) #

> The human brain isn't randomly initialized. It's undergone 500m years of pretraining.

All of the information accumulated by evolution gets passed through DNA. For humans, that's well under 1GB. Probably a very tiny fraction of that determines how the brain works at the algorithmic level. You should think of this information as the "software" of the brain, not pretrained LLM weights (350GB for GPT-3).

↑