←back to thread

507 points martinald | 6 comments | | HN request time: 0.488s | source | bottom
Show context
chillee ◴[] No.45057409[source]
This article's math is wrong on many fundamental levels. One of the most obvious ones is that prefill is nowhere near bandwidth bound.

If you compute out the MFU the author gets it's 1.44 million input tokens per second * 37 billion active params * 2 (FMA) / 8 [GPUs per instance] = 13 Petaflops per second. That's approximately 7x absolutely peak FLOPS on the hardware. Obviously, that's impossible.

There's many other issues with this article, such as assuming only 32 concurrent requests(?), only 8 GPUs per instance as opposed to the more efficient/standard prefill-decode disagg setups, assuming that attention computation is the main thing that makes models compute-bound, etc. It's a bit of an indictment of HN's understanding of LLMs that most people are bringing up issues with the article that aren't any of the fundamental misunderstandings here.

replies(5): >>45057603 #>>45057767 #>>45057801 #>>45058397 #>>45060353 #
Den_VR ◴[] No.45057603[source]
So, bottom line, do you think it’s probable that either OpenAI or Anthropic are “losing money on inference?”
replies(2): >>45057664 #>>45061050 #
1. chillee ◴[] No.45057664[source]
No. In some sense, the article comes to the right conclusion haha. But it's probably >100x off on its central premise about output tokens costing more than input.
replies(2): >>45057722 #>>45057791 #
2. doctorpangloss ◴[] No.45057722[source]
I’m pretty sure input tokens are cheap because they want to ingest the data for training later no? They want huge contexts to slice up.
replies(1): >>45062499 #
3. martinald ◴[] No.45057791[source]
Thanks for the correction (author here). I'll update the article - very fair point on compute on input tokens which I messed up. Tbh I'm pleased my napkin math was only 7x off the laws of physics :).

Even rerunning the math on my use cases with way higher input token cost doesn't change much though.

replies(1): >>45057928 #
4. chillee ◴[] No.45057928[source]
The 32 parallel sequences is also arbitrary and significantly changes your conclusions. For example, if they run with 256 parallel sequences then that would result in a 8x cheaper factor in your calculations for both prefill and decode.

The component about requiring long context lengths to be compute-bound for attention is also quite misleading.

replies(1): >>45058427 #
5. Barbing ◴[] No.45058427{3}[source]
Anyone up to publishing their own guess range?
6. awwaiid ◴[] No.45062499[source]
Afaik all the large providers flipped the default to contractually NOT train on your data. So no, training data context size is not a factor.