I learned it from: https://youtube.com/playlist?list=PLLHTzKZzVU9eaEyErdV26ikyo...
Yann LeCun, and Michael Bronstein and his colleagues have some similarities in trying to properly Sciencify Deep Learning.
Yann LeCun's approach, as least for Vision has one core tenet- energy minimization, just like in Physics. In his course, he also shows some current arch/algos to be special cases for EBMs.
Yann believes that understanding the Whys of the behavior of DL algorithms are going to be beneficial in the long term rather than playing around with hyper-params.
There is also a case for language being too low-dimensional to lead to AGI even if it is solved. Like, in a recent video, he said that the total amount of data existing on all digitized books and internet are the same as what a human children takes in in the first 4/5 years. He considers this low.
There are also epistemological arguments against language not being able to lead to AGI, but I haven't heard him talk about them.
He also believes that Vision is a more important aspect of intellgence. One reason being it being very high-dim. (Edit) Consider an example. Take 4 monochrome pixels. All pixels can range from 0 to 255. 4 pixels can create 256^4 = 2^32 combinations. 4 words can create 4! = 24 combinations. Solving language is easier and therefore low-stakes. Remember the monkey producing a Shakespeare play by randomly punching typewriter keys? If that was an astronomically big number, think how obscenely long it would take a monkey to paint Mona Lisa by randomly assigning pixel values. Left as an exercise to the reader.
Juergen Schmidhuber has gone a lot queit now. But he also told that a world-model, explicitly included in training is reasoning is better, rather than only text or image or whatever. He has a good paper with Lucas Beyer.