←back to thread

507 points martinald | 1 comments | | HN request time: 0s | source
Show context
qrios ◴[] No.45053956[source]
For sure an interesting calculation. Only one remark from someone with GPU metal experience:

> But compute becomes the bottleneck in certain scenarios. With long context sequences, attention computation scales quadratically with sequence length.

Even if the statement about quadratically scales is right, the bottleneck we are talking about is somewhere north by factor 1000. If 10k cores do only simple matrix operations each needs to have new data (up to 64k) available every 500 cycles (let's say). Getting these amount of data (without _any_ collision) means something like 100+GByte/s per core. Even 2+TByte/s on HBM means the bottleneck is the memory transfer rate, by something like 500 times. With collision, we talk about an additional factor like 5000 (last time I've done some tests with a 4090).

replies(1): >>45054207 #
Onavo ◴[] No.45054207[source]
What do you mean by collision?
replies(2): >>45054438 #>>45054585 #
1. agf ◴[] No.45054438[source]
I believe it's that the bus can only serve one chip at a time, so it has to actually be faster since sometimes one chip's data will have to wait for the data of another chip to finish first.