Though honestly I don’t think new neural network architectures are going to get us over this local maximum, I think the next steps forward involve something that’s
1. Non lossy
2. Readily interpretable
I remain highly skeptical. I doubt that transformers are the best architecture possible, but they set a high bar. And it sure seems like people who keep making the suggestion that "transformers aren't the future" aren't good enough to actually clear that bar.
https://arcprize.org/blog/hrm-analysis#analyzing-hrms-contri...
Nothing about human brain is "readily interpretable", and artificial neural networks - which, unlike brains, can be instrumented and experimented on easily - tend to resist interpretation nonetheless.
If there was an easy to reduce ML to "readily interpretable" representations, someone would have done so already. If there were architectures that perform similarly but are orders of magnitude more interpretable, they will be used, because interpretability is desirable. Instead, we get what we get.
If any midwit can say "X is deeply flawed" but no one can put together an Y that would beat X, then clearly, pointing out the flaws was never the bottleneck at all.
It's not a linear process so I'm not sure the "bottleneck" analogy holds here.
We're not limited to only talking about "the bottleneck". I think the argument is more that we're very close to optimal results for the current approach/architecture, so getting superior outcomes from AI will actually require meaningfully different approaches.
The preceding seq-2-seq architectures had been RNN (LSTM) based, then RNN + attention (Bahdanau et al "Jointly Learning to Align & Translate"), with the Transformer "attention is all you need" paper then meaning you can drop use of RNNs altogether and just use attention.
Of course NOT using RNNs was the key motivator behind the new Transformer architecture - not only did you not NEED an RNN, but they explicitly wanted to avoid it since the goal was to support parallel vs sequential processing for better performance on the available highly parallel hardware.
For CNNs, we know very well how the early layers work - edge detectors, curve detectors, etc. This understanding decays further into the model. In the brain, V1/V2 are similarly well studied, but it breaks down deeper into the visual cortex - and the sheer architectural complexity there sure doesn't help.
Ironically, the same could be said about Attention Is All You Need in 2017. It didn’t drive any improvements immediately- actual decent Transformer models took a few years to arrive after that.