Humans receive an enormous amount of training data in forms not currently available to LLMs.
If you locked baby Einstein in a room with the collected works of humanity and left him there for a lifetime, I doubt he'd have even learnt to read on his own.
How much of that is cognitively useful for learning English? On top of the textual content, audio gives you emphasis and mood. Not a lot of information in that -- a few bits per sentence.
Better even measure in bytes. And remember that kids look at video, not at individual pictures (even if these are videos of pictures).