Previously, he very publicly and strongly said:
a) LLMs can't do math. They trick us in poetry but that's subjective. They can't do objective math.
b) they can't plan
c) by the very nature of autoregressive arch, errors compound. So the longer you go in your generation, the higher the error rate. so at long contexts the answers become utter garbage.
All of these were proven wrong, 1-2 years later. "a" at the core (gold at IMO), "b" w/ software glue and "c" with better training regimes.
I'm not interested in the will it won't it debates about AGI, I'm happy with what we have now, and I think these things are good enough now, for several usecases. But it's important to note when people making strong claims get them wrong. Again, I think I get where he's coming from, but the public stances aren't the place to get into the deep research minutia.
That being said, I hope he gets to find whatever it is that he's looking for, and wish him success in his endeavours. Between him, Fei Fei Li and Ilya, something cool has to come out of the small shops. Heck, I'm even rooting for the "let's commoditise lora training" that Mira's startup seems to go for.
I think transformers have been proven to be general purpose, but that doesn't mean that we can't use new fundamental approaches.
To me it's obvious that researchers are acting like sheep as they always do. He's trying to come up with a real innovation.
LeCun has seen how new paradigms have taken over. Variations of LLMs are not the type of new paradigm that serious researches should be aiming for.
I wonder if there can be a unification of spatial-temporal representations and language. I am guessing diffusion video generators already achieve this in some way. But I wonder if new techniques can improve the efficiency and capabilities.
I assume the Nested Learning stuff is pretty relevant.
Although I've never totally grokked transformers and LLMs, I always felt that MoE was the right direction and besides having a strong mapping or unified view of spatial and language info, there also should somehow be the capability of representing information in a non-sequential way. We really use sequences because we can only speak or hear one sound at a time. Information in general isn't particularly sequential, so I doubt that's an ideal representation.
So I guess I am kind of variations of transformers myself to be honest.
But besides being able to convert between sequential discrete representations and less discrete non-sequential representations (maybe you have tokens but every token has a scalar attached), there should be lots of tokenizations, maybe for each expert. Then you have experts that specialize in combining and translating between different scalar-token tokenizations.
Like automatically clustering problems or world model artifacts or something and automatically encoding DSLs for each sub problem.
I wish I really understood machine learning.