Modular has been pushing the notion that they are building technology that allows writing HW-vendor neutral solutions so that users can break free of NVIDIA's hold on high performance kernels.
From their own writing:
> We want a unified, programmable system (one small binary!) that can scale across architectures from multiple vendors—while providing industry-leading performance on the most widely used GPUs (and CPUs).
Basically, you need a good description of the hardware and the compiler automatically generates the state of the art GEMM kernel.
Maybe it's 20% worse than Nvidia's hand written kernels, but you can switch hardware vendors or build arbitrary fused kernels at will.
So, you can support either vendor with as-good-vendor-library performance. That’s not lock-in to me at least.
It’s not as good as the compiler being able to just magically produce optimized kernels for arbitrary hardware though, fully agree there. But it’s a big step forward from Cuda/HIP.
I have not used Triton/Cute/Cutlass though, so I can't compare against anything other than Cuda really.