There’s no real rule worthy of any respect imho that LLMs can’t be configured to get additional input data from images, audio, proprioception sensors, and any other modes. I can easily write a script to convert such data into tokens in any number of ways that would allow them to be fed in as tokens of a “language.” Convolutions for example. A real expert could do it even more easily or do a better job. And then don’t LeCun’s objections just evaporate? I don’t see why he thinks he has some profound point. For gods sake our own senses are heavily attenuated and mediated and it’s not like we actually experience raw reality ourselves, ever; we just feel like we do. LLMs can be extended to be situated. So much can be done. It’s like he’s seeing http in 1993 and saying it won’t be enough for the full web… well duh, but it’s a great start. Now go build on it.
If anything the flaw in LLMs is how they maintain only one primary thread of prediction. But this is changing; having a bunch of threads working on the same problem and checking each other from different angles of the problem will be an obvious fix for a lot of issues.