Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data

(epochai.org)

56 points trott | 2 comments | 18 Jun 24 02:04 UTC | HN request time: 0.414s | source

Show context

makapuf ◴[18 Jun 24 06:52 UTC] No.40714795[source]▶

Funny that it does not need that much data to train your average 20th century human genius. I'd say that if we are dreaming of the future of ai, learning and reasoning seems the greatest issue, not data. That said, the article title is about LLMs, so that's what will need changing I guess.

replies(3): >>40715430 #>>40715643 #>>40716666 #

jstanley ◴[18 Jun 24 08:45 UTC] No.40715430[source]▶

>>40714795 #

Humans aren't just text interfaces though. The majority of your input is not textual but is sights, sounds, feelings, etc., that LLMs don't (yet?) have access to.

Humans receive an enormous amount of training data in forms not currently available to LLMs.

If you locked baby Einstein in a room with the collected works of humanity and left him there for a lifetime, I doubt he'd have even learnt to read on his own.

replies(6): >>40715609 #>>40715647 #>>40715822 #>>40715950 #>>40716247 #>>40716485 #

trott ◴[18 Jun 24 09:56 UTC] No.40715822[source]▶

>>40715430 #

The stream of data from vision does NOT explain why humans learn 1000x faster: Children who lost their sight early on, can grow up to be intelligent. They can learn English, for example. They don't need to hear 200B words, like GPT-3.

replies(3): >>40716628 #>>40716999 #>>40720531 #

lostmsu ◴[18 Jun 24 18:10 UTC] No.40720531[source]▶

>>40715822 #

Even audio is several magnitudes larger. Uncompressed stereo is 100 kilobytes per second. So an hour is already 0.5 gigabytes. A year is ~3 TB.

replies(1): >>40720817 #

trott ◴[18 Jun 24 18:47 UTC] No.40720817[source]▶

>>40720531 #

> Uncompressed stereo is 100 kilobytes per second.

How much of that is cognitively useful for learning English? On top of the textual content, audio gives you emphasis and mood. Not a lot of information in that -- a few bits per sentence.

replies(1): >>40721824 #

lostmsu ◴[18 Jun 24 20:20 UTC] No.40721824[source]▶

>>40720817 #

Nearly all of it. You need a lot of pictures without cats to explain what a cat is.

replies(2): >>40722319 #>>40723538 #

1. trott ◴[19 Jun 24 00:37 UTC] No.40723538[source]▶

>>40721824 #

> Nearly all of it.

Maybe you misunderstood me. I'm not talking about learning to understand spoken English.

You don't need hearing or vision at all to grow up to be intelligent (and able to write English).

replies(1): >>40725184 #

2. lostmsu ◴[19 Jun 24 05:41 UTC] No.40725184[source]▶

>>40723538 (TP) #

What is your point exactly? Did you estimate raw amount of data received by people to make claims about data efficiency?

↑