If we had, there would be no reason to train a model with more parameters than are strictly necessary to represent the space's semantic structure. But then it should be impossible for distilled models with less parameters to come close to the performance of the original model.
Yet this is what happens - the distilled or quantized models often come very close to the original model.
So I think there are still many low-hanging fruits to pick.
"My name is <?>" without distillation has only one valid answer (from the dataset) and everything else is wrong.
Whereas with distillation, you get lots of other names too (from the teacher), and you can add some weight to them too. That way, model learns faster, because it gets more information in each update.
(So instead of "My name is Foo", the model learns "My name is <some name, but in this case Foo>")