Are OpenAI and Anthropic losing money on inference?

(martinalderson.com)

507 points martinald | 2 comments | 28 Aug 25 10:15 UTC | HN request time: 0s | source

Show context

qrios ◴[28 Aug 25 16:13 UTC] No.45053956[source]▶

For sure an interesting calculation. Only one remark from someone with GPU metal experience:

> But compute becomes the bottleneck in certain scenarios. With long context sequences, attention computation scales quadratically with sequence length.

Even if the statement about quadratically scales is right, the bottleneck we are talking about is somewhere north by factor 1000. If 10k cores do only simple matrix operations each needs to have new data (up to 64k) available every 500 cycles (let's say). Getting these amount of data (without _any_ collision) means something like 100+GByte/s per core. Even 2+TByte/s on HBM means the bottleneck is the memory transfer rate, by something like 500 times. With collision, we talk about an additional factor like 5000 (last time I've done some tests with a 4090).

replies(1): >>45054207 #

Onavo ◴[28 Aug 25 16:37 UTC] No.45054207[source]▶

>>45053956 #

What do you mean by collision?

replies(2): >>45054438 #>>45054585 #

qrios ◴[28 Aug 25 17:12 UTC] No.45054585[source]▶

>>45054207 #

If multiple cores tries to get the same memory addresses, the MMU feeds only one core, the second one have to whait. Depends on the type of RAM, this will cost a lot of cycles.

GPU MMUs can handle multiple line in parallel. But not 10k cores at the same time. The HBM is not able to transfer 3.5TByte sequencial.

replies(1): >>45054631 #

whatshisface ◴[28 Aug 25 17:16 UTC] No.45054631[source]▶

>>45054585 #

Why is that? It seems like multiple cores requesting the same address would be easier for the MMU to fetch for, not harder.

replies(3): >>45054700 #>>45054872 #>>45057841 #

reliabilityguy ◴[28 Aug 25 17:40 UTC] No.45054872[source]▶

>>45054631 #

It’s not that the fetching is the problem, but serving the data to many cores at the same time from a single source.

replies(1): >>45056813 #

1. supersour ◴[28 Aug 25 20:42 UTC] No.45056813{3}[source]▶

>>45054872 #

I'm not familiar with GPU architecture, is there not a shared L2/L3 data cache from which this data would be shared?

replies(1): >>45066968 #

2. reliabilityguy ◴[29 Aug 25 17:28 UTC] No.45066968[source]▶

>>45056813 (TP) #

MMU has a finite amount of ports that drive the data to the consumers. An extreme case: all 32 cores want the same piece of data at the same time.

↑