TLDR: In order to get good performance you need to use vendor-specific extensions that result in the same lock-in Modular has been claiming they will enable you to avoid.
replies(2):
Basically, you need a good description of the hardware and the compiler automatically generates the state of the art GEMM kernel.
Maybe it's 20% worse than Nvidia's hand written kernels, but you can switch hardware vendors or build arbitrary fused kernels at will.