←back to thread

507 points martinald | 2 comments | | HN request time: 0.513s | source
Show context
_sword ◴[] No.45055003[source]
I've done the modeling on this a few times and I always get to a place where inference can run at 50%+ gross margins, depending mostly on GPU depreciation and how good the host is at optimizing utilization. The challenge for the margins is whether or not you consider model training costs as part of the calculation. If model training isn't capitalized + amortized, margins are great. If they are amortized and need to be considered... yikes
replies(7): >>45055030 #>>45055275 #>>45055536 #>>45055820 #>>45055835 #>>45056242 #>>45056523 #
1. ozgune ◴[] No.45055835[source]
I agree that you could get to high margins, but I think the modeling holds only if you're an AI lab operating at scale with a setup tuned for your model(s). I think the most open study on this one is from the DeepSeek team: https://github.com/deepseek-ai/open-infra-index/blob/main/20...

For others, I think the picture is different. When we ran benchmarks on DeepSeek-R1 on 8x H200 SXM using vLLM, we got up to 12K total tok/s (concurrency 200, input:output ratio of 6:1). If you're spiking up 100-200K tok/s, you need a lot of GPUs for that. Then, the GPUs sit idle most of the time.

I'll read the blog post in more detail, but I don't think the following assumptions hold outside of AI labs.

* 100% utilization (no spikes, balanced usage between day/night or weekdays) * Input processing is free (~$0.001 per million tokens) * DeepSeek fits into H100 cards in a way that network isn't the bottleneck

replies(1): >>45057000 #
2. _sword ◴[] No.45057000[source]
I was modeling configurations purpose-built for running specific models in specific workloads. I was trying to figure out how much of a gross margin drag some software companies could have if they hosted their own models and served them up as APIs or as integrated copilots with their other offerings