That's baloney. The old ML adage "there's no data like more data" is as old as mankind itself.
That's baloney. The old ML adage "there's no data like more data" is as old as mankind itself.
My only surprise is how long it took to get to imagenet, but in retrospect, I appreciate that a number of conditions had to be met (much more data, much better algorithms, much faster computers). I also didn't recognize just how poorly MLPs were for sequence modelling, compared to RNNs and transformers.
I haven't invested the time to take the loss function from our paper and implement in a modern framework, but IIUC, I wouldn't need to provide the derivatives manually. That would be a satisfying outcome (indicating I had wasted a lot of effort learning math that simply wasn't necessary, because somebody had automated it better than I could do manually, in a way I can understand more easily).