(martinalderson.com)

507 points martinald | 1 comments | 28 Aug 25 10:15 UTC | HN request time: 0.211s | source

Show context

ekelsen ◴[28 Aug 25 14:46 UTC] No.45052850[source]▶

The math on the input tokens is definitely wrong. It claims each instance (8 GPUs) can handle 1.44 million tokens/sec of input. Let's check that out.

1.44e6 tokens/sec * 37e9 bytes/token / 3.3e12 bytes/sec/GPU = ~16,000 GPUs

And that's assuming a more likely 1 byte per parameter.

So the article is only off by a factor of at least 1,000. I didn't check any of the rest of the math, but that probably has some impact on their conclusions...

replies(5): >>45052936 #>>45052942 #>>45052964 #>>45053047 #>>45053166 #

GaggiX ◴[28 Aug 25 14:53 UTC] No.45052942[source]▶

>>45052850 #

Your calculations make no sense. Why are you loading the model for each token independently? You can process all the input tokens at the same time as long as they can fit in memory.

You are doing the calculation as they were output tokens on a single batch, it would not make sense even in the decode phase.

replies(2): >>45053669 #>>45055050 #

1. ekelsen ◴[28 Aug 25 17:57 UTC] No.45055050[source]▶

>>45052942 #

Then the right calculation is to use FLOPs not bandwidth like they did.

↑

Are OpenAI and Anthropic losing money on inference?