Matmul on Blackwell: Part 2 – Using Hardware Features to Optimize Matmul

(www.modular.com)

23 points robertvc | 3 comments | 05 Sep 25 18:16 UTC | HN request time: 0s | source

Show context

subharmonicon ◴[07 Sep 25 00:29 UTC] No.45154175[source]▶

TLDR: In order to get good performance you need to use vendor-specific extensions that result in the same lock-in Modular has been claiming they will enable you to avoid.

replies(2): >>45154429 #>>45156295 #

totalperspectiv ◴[07 Sep 25 01:16 UTC] No.45154429[source]▶

>>45154175 #

I don’t follow your logic. Mojo can target multiple gpu vendors. What is the Modular specific lock in?

replies(2): >>45154650 #>>45156105 #

smilekzs ◴[07 Sep 25 01:49 UTC] No.45154650[source]▶

>>45154429 #

Not OP but I think this could be an instance of leaky abstraction at work. Most of the time you hand-write an accelerator kernel hoping to optimize for runtime performance. If the abstraction/compiler does not fully insulate you from micro-architectural details affecting performance in non-trivial ways (e.g. memory bank conflict as mentioned in the article) then you end up still having per-vendor implementations, or compile-time if-else blocks all over the place. This is less than ideal, but still arguably better than working with separate vendor APIs, or worse, completely separate toolchains.

replies(1): >>45154893 #

whimsicalism ◴[07 Sep 25 02:38 UTC] No.45154893[source]▶

>>45154650 #

Yes, it looks like they have some sort of metaprogramming setup (nicer than C++) for doing this: https://www.modular.com/mojo

replies(1): >>45158770 #

1. totalperspectiv ◴[07 Sep 25 14:59 UTC] No.45158770[source]▶

>>45154893 #

I can confirm, it’s quite nice.

replies(1): >>45160567 #

2. whimsicalism ◴[07 Sep 25 17:56 UTC] No.45160567[source]▶

>>45158770 (TP) #

jw: why do you use mojo here over triton or the new pythonic cute/cutlass?

replies(1): >>45167931 #

3. totalperspectiv ◴[08 Sep 25 13:22 UTC] No.45167931[source]▶

>>45160567 #

Because I was originally writing some very CPU intensive SIMD stuff, which Mojo is also fantastic for. Once I got that working and running nicely I decided to try getting the same algo running on GPU since, at the time, they had just open sourced the GPU parts of the stdlib. It was really easy to get going with.

I have not used Triton/Cute/Cutlass though, so I can't compare against anything other than Cuda really.

↑