(burn.dev)

80 points homarp | 5 comments | 18 Jul 25 19:59 UTC | HN request time: 1.118s | source

1. semessier ◴[19 Jul 25 01:04 UTC] No.44611583[source]▶

I had bet that matmult would be in transformer-optimized hardware costing a fraction of GPUs first class in torch 2 years ago with no reason to use GPUs any more. Wrong.

replies(2): >>44611628 #>>44613343 #

2. almostgotcaught ◴[19 Jul 25 01:12 UTC] No.44611628[source]▶

>>44611583 (TP) #

> matmult would be in transformer-optimized hardware

It is... it's in GPUs lol

> first class in torch

It is

> costing a fraction of GPUs

Why would anyone give you this for cheaper than GPUs lol?

replies(1): >>44611975 #

3. atty ◴[19 Jul 25 02:18 UTC] No.44611975[source]▶

>>44611628 #

I think they’re referring to hardware like TPUs and other ASICs. Which also exist, of course :)

replies(1): >>44612010 #

4. almostgotcaught ◴[19 Jul 25 02:24 UTC] No.44612010{3}[source]▶

>>44611975 #

Sure but GPUs literally have MMA engines now

5. gchadwick ◴[19 Jul 25 07:29 UTC] No.44613343[source]▶

>>44611583 (TP) #

The real bottleneck is the memory, optimize your matmul architecture all you like whilst you still have it connected to a big chunk of HBM memory (or whatever your chosen high bandwidth memory is) you can only do so much.

So really GPU v not GPU (e.g. TPU) doesn't matter a whole lot if you've got fundamentally the same memory architecture.

↑

Multiplatform Matrix Multiplication Kernels