Matmul on Blackwell: Part 2 – Using Hardware Features to Optimize Matmul

(www.modular.com)

23 points robertvc | 2 comments | 05 Sep 25 18:16 UTC | HN request time: 0.001s | source

Show context

subharmonicon ◴[07 Sep 25 00:29 UTC] No.45154175[source]▶

TLDR: In order to get good performance you need to use vendor-specific extensions that result in the same lock-in Modular has been claiming they will enable you to avoid.

replies(2): >>45154429 #>>45156295 #

totalperspectiv ◴[07 Sep 25 01:16 UTC] No.45154429[source]▶

>>45154175 #

I don’t follow your logic. Mojo can target multiple gpu vendors. What is the Modular specific lock in?

replies(2): >>45154650 #>>45156105 #

1. subharmonicon ◴[07 Sep 25 07:20 UTC] No.45156105[source]▶

>>45154429 #

The blog post is about using an NVIDIA-specific tensor core API that they have built to get good performance.

Modular has been pushing the notion that they are building technology that allows writing HW-vendor neutral solutions so that users can break free of NVIDIA's hold on high performance kernels.

From their own writing:

> We want a unified, programmable system (one small binary!) that can scale across architectures from multiple vendors—while providing industry-leading performance on the most widely used GPUs (and CPUs).

replies(1): >>45158760 #

2. totalperspectiv ◴[07 Sep 25 14:58 UTC] No.45158760[source]▶

>>45156105 (TP) #

They allow you to write a kernel for Nvidia, or AMD, that can take full advantage of the Hardware of either one, then throw a compile time if-statement in there to switch which kernel to use based on the hardware available.

So, you can support either vendor with as-good-vendor-library performance. That’s not lock-in to me at least.

It’s not as good as the compiler being able to just magically produce optimized kernels for arbitrary hardware though, fully agree there. But it’s a big step forward from Cuda/HIP.

↑