←back to thread

507 points martinald | 1 comments | | HN request time: 0.216s | source
Show context
chillee ◴[] No.45057409[source]
This article's math is wrong on many fundamental levels. One of the most obvious ones is that prefill is nowhere near bandwidth bound.

If you compute out the MFU the author gets it's 1.44 million input tokens per second * 37 billion active params * 2 (FMA) / 8 [GPUs per instance] = 13 Petaflops per second. That's approximately 7x absolutely peak FLOPS on the hardware. Obviously, that's impossible.

There's many other issues with this article, such as assuming only 32 concurrent requests(?), only 8 GPUs per instance as opposed to the more efficient/standard prefill-decode disagg setups, assuming that attention computation is the main thing that makes models compute-bound, etc. It's a bit of an indictment of HN's understanding of LLMs that most people are bringing up issues with the article that aren't any of the fundamental misunderstandings here.

replies(5): >>45057603 #>>45057767 #>>45057801 #>>45058397 #>>45060353 #
pama ◴[] No.45057767[source]
Agree that the writeup is very wrong, especially for the output tokens. Here is how anyone with enough money to allocate a small cluster of powerful GPUs can decode huge models at scale, since nearly 4 months ago, with costs of 0.2 USD/million output tokens.

https://lmsys.org/blog/2025-05-05-large-scale-ep/

This has gotten significantly cheaper yet with additional code hacks since then, and with using the B200s.

replies(1): >>45059554 #
ma2rten ◴[] No.45059554[source]
You can also look at the price of opensource models on openrouter, which are a fraction of the cost of closed source models. This is a market that is heavily commoditized, so I would expect it reflect the true cost with a small margin.
replies(1): >>45060798 #
1. pama ◴[] No.45060798[source]
If you make careful calculations and estimate the theoretical margins for inference only of most of the big open models on openrouter, the margins are typically crazy high if the openrouter providers served at scale (north of 800% for most of the large models). The high cost probably reflects salaries, investments, and amortization of other expenses like free serving or occasional partial serving occupancy. Sometimes it is hard to keep uniform high load due to other preferences of users that dont get covered at any price, eg maximal context length (which is costing output performance), latency, and time for first token, but also things like privacy guarantees, or simply switching to the next best model quickly. I have always thought that centralized inference is the real goldmine of AI because you get so much value at scale for hardly any cost.