Multiplatform Matrix Multiplication Kernels

1. Lerc ◴[18 Jul 25 22:58 UTC] No.44610720[source]▶

Has there been much research into slightly flawed matrix multiplications?

If you have a measure of correctness, and a measure of performance. Is there a maximum value of correctness per some unit of processing that exists below a full matrix multiply

Obviously it can be done with precision, since that is what floating point is. But is there anything where you can save x% of computation and have fewer than x% incorrect values in a matrix multiplications?

Gradient descent wouldn't really care about a few (Reliably) dud values.

replies(4): >>44610899 #>>44614746 #>>44614820 #>>44617249 #

2. wuubuu ◴[18 Jul 25 23:19 UTC] No.44610899[source]▶

>>44610720 (TP) #

Randomized matrix sketching is one way to get at this (see https://arxiv.org/abs/2302.11474), the problem is hardware is heavily optimized for dense multiplies so what you save in flops doesn't translate to real runtime speeds ups.

3. WithinReason ◴[19 Jul 25 11:54 UTC] No.44614746[source]▶

>>44610720 (TP) #

If you do it in 8-bit it's usually 2x as fast as 16 bit on Tensorcores

4. MayeulC ◴[19 Jul 25 12:05 UTC] No.44614820[source]▶

>>44610720 (TP) #

Well, approximate computing seems to be a superset of the field you describe here, with many different approaches, including analog computation. As you say, some algorithms care a bit less about precision, especially for LSBs.

5. kolinko ◴[19 Jul 25 17:05 UTC] No.44617249[source]▶

>>44610720 (TP) #

I did research on vector-matrix last year:

https://kolinko.github.io/effort/

For semi-random weights you cam get down to 20-30% multiplications/mem reads and maintain ~0.98 cosine similarity output between the approximated and full result.

As far as LLM inference goes, the speedup from removing multiplications is at best comparable to the speedup of quantisation (that is - you get at best similar KL divergence score whether you remove calculations or quantise).