←back to thread

281 points GabrielBianconi | 3 comments | | HN request time: 0.622s | source
Show context
brilee ◴[] No.45065876[source]
For those commenting on cost per token:

This throughput assumes 100% utilizations. A bunch of things raise the cost at scale:

- There are no on-demand GPUs at this scale. You have to rent them for multi-year contracts. So you have to lock in some number of GPUs for your maximum throughput (or some sufficiently high percentile), not your average throughput. Your peak throughput at west coast business hours is probably 2-3x higher than the throughput at tail hours (east coast morning, west coast evenings)

- GPUs are often regionally locked due to data processing issues + latency issues. Thus, it's difficult to utilize these GPUs overnight because Asia doesn't want their data sent to the US and the US doesn't want their data sent to Asia.

These two factors mean that GPU utilization comes in at 10-20%. Now, if you're a massive company that spends a lot of money on training new models, you could conceivably slot in RL inference or model training to happen in these off-peak hours, maximizing utilization.

But for those companies purely specializing in inference, I would _not_ assume that these 90% margins are real. I would guess that even when it seems "10x cheaper", you're only seeing margins of 50%.

replies(7): >>45067585 #>>45067903 #>>45067926 #>>45068175 #>>45068222 #>>45072198 #>>45073200 #
parhamn ◴[] No.45068222[source]
Do we know how big the "batch processing" market is? I know the major providers offer 50%+ off for off-peak processing.

I assumed it was to slightly correct this problem and on the surface it seems like it'd be useful for big data places where process-eventually is enough, e.g. it could be a relatively big market. Is it?

replies(1): >>45069433 #
1. sdesol ◴[] No.45069433[source]
I don't think you need to be big data to benefit.

A major issue we have right now is, we want the coding process to be more "Agentic", but we don't have an easy way for LLMs to determine what to pull into context to solve a problem. This is a problem that I am working on with my personal AI search assistant, which I talk about below:

https://github.com/gitsense/chat/blob/main/packages/chat/wid...

Analyzers are the "Brains" for my search, but generating the analysis is both tedious and can be costly. I'm working on the tedious part and with batch processing, you can probably process thousands of files for under 5 dollars with Gemini 2.5 Flash.

With batch processing and the ability to continuously analyze 10s of thousands of files, I can see companies wanting to make "Agentic" coding smarter, which should help with GPU utilization and drive down the cost of software development.

replies(1): >>45073205 #
2. saagarjha ◴[] No.45073205[source]
You sound like you are talking about something completely different.
replies(2): >>45073477 #>>45075958 #
3. sdesol ◴[] No.45075958[source]
No what I am saying is there are more applications for batch processing that will help with utilization. I can see developers and companies using off hour processing to prep their data for agentic coding.