TLDR: In order to get good performance you need to use vendor-specific extensions that result in the same lock-in Modular has been claiming they will enable you to avoid.
replies(2):
I have not used Triton/Cute/Cutlass though, so I can't compare against anything other than Cuda really.