←back to thread

507 points martinald | 2 comments | | HN request time: 0.417s | source
Show context
jsnell ◴[] No.45051797[source]
I don't believe the asymmetry between prefill and decode is that large. If it were, it would make no sense for most of the providers to have separate pricing for prefill with cache hits vs. without.

Given the analysis is based on R1, Deepseek's actual in-production numbers seem highly relevant: https://github.com/deepseek-ai/open-infra-index/blob/main/20...

(But yes, they claim 80% margins on the compute in that article.)

> When established players emphasize massive costs and technical complexity, it discourages competition and investment in alternatives

But it's not the established players emphasizing the costs! They're typically saying that inference is profitable. Instead the false claims about high costs and unprofitability are part of the anti-AI crowd's standard talking points.

replies(1): >>45051921 #
martinald ◴[] No.45051921[source]
Yes. I was really surprised at this myself (author here). If you have some better numbers I'm all ears. Even on my lowly 9070XT I get 20x the tok/s input vs output, and I'm not doing batching or anything locally.

I think the cache hit vs miss stuff makes sense at >100k tokens where you start getting compute bound.

replies(2): >>45052374 #>>45053461 #
1. jsnell ◴[] No.45052374[source]
I linked to the writeup by Deepseek with their actual numbers from production, and you want "better numbers" than that?!

> Each H800 node delivers an average throughput of ~73.7k tokens/s input (including cache hits) during prefilling or ~14.8k tokens/s output during decoding.

That's a 5x difference, not 1000x. It also lines up with their pricing, as one would expect.

(The decode throughputs they give are roughly equal to yours, but you're claiming a prefill performance 200x times higher than they can achieve.)

replies(1): >>45053309 #
2. smarterclayton ◴[] No.45053309[source]
A good rule of thumb is that a prefill token is about 1/6th the compute cost of decode token, and that you can get about 15k prefill tokens a second on Llama3 8B on a single H100. Bigger models will require more compute per token, and quantization like FP8 or FP4 will require less.