←back to thread

507 points martinald | 2 comments | | HN request time: 0.02s | source
Show context
chillee ◴[] No.45057409[source]
This article's math is wrong on many fundamental levels. One of the most obvious ones is that prefill is nowhere near bandwidth bound.

If you compute out the MFU the author gets it's 1.44 million input tokens per second * 37 billion active params * 2 (FMA) / 8 [GPUs per instance] = 13 Petaflops per second. That's approximately 7x absolutely peak FLOPS on the hardware. Obviously, that's impossible.

There's many other issues with this article, such as assuming only 32 concurrent requests(?), only 8 GPUs per instance as opposed to the more efficient/standard prefill-decode disagg setups, assuming that attention computation is the main thing that makes models compute-bound, etc. It's a bit of an indictment of HN's understanding of LLMs that most people are bringing up issues with the article that aren't any of the fundamental misunderstandings here.

replies(5): >>45057603 #>>45057767 #>>45057801 #>>45058397 #>>45060353 #
Den_VR ◴[] No.45057603[source]
So, bottom line, do you think it’s probable that either OpenAI or Anthropic are “losing money on inference?”
replies(2): >>45057664 #>>45061050 #
chillee ◴[] No.45057664[source]
No. In some sense, the article comes to the right conclusion haha. But it's probably >100x off on its central premise about output tokens costing more than input.
replies(2): >>45057722 #>>45057791 #
1. doctorpangloss ◴[] No.45057722[source]
I’m pretty sure input tokens are cheap because they want to ingest the data for training later no? They want huge contexts to slice up.
replies(1): >>45062499 #
2. awwaiid ◴[] No.45062499[source]
Afaik all the large providers flipped the default to contractually NOT train on your data. So no, training data context size is not a factor.