Most active commenters

cpard(5)

Batch Mode in the Gemini API: Process More for Less

(developers.googleblog.com)

Show context

dsjoerg ◴[11 Jul 25 02:11 UTC] No.44527801[source]▶

>>44492014 (OP) #

We used the previous version of this batch mode, which went through BigQuery. It didn't work well for us at the time because we were in development mode and we needed faster cycle time to iterate and learn. Sometimes the response would come back much faster than 24 hours, but sometimes not. There was no visibility offered into what response time you would get; just submit and wait.

You have to be pretty darn sure that your job is going to do exactly what you want to be able to wait 24 hours for a response. It's like going back to the punched-card era. If I could get even 1% of the batch in a quicker response and then the rest more slowly, that would have made a big difference.

replies(4): >>44527819 #>>44528277 #>>44528385 #>>44530651 #

1. cpard ◴[11 Jul 25 02:14 UTC] No.44527819[source]▶

>>44527801 #

It seems that the 24h SLA is standard for batch inference among the vendors and I wonder how useful it can be when you have no visibility on when the job will be delivered.

I wonder why they do that and who is actually getting value out of these batch APIs.

Thanks for sharing your experience!

replies(5): >>44527850 #>>44527911 #>>44528102 #>>44528329 #>>44530652 #

2. vineyardmike ◴[11 Jul 25 02:20 UTC] No.44527850[source]▶

>>44527819 (TP) #

It’s like most batch processes, it’s not useful if you don’t know what the response will be and you’re iterating interactively. It for data pipelines, analytics workloads, etc, you can handle that delay because no one is waiting on the response.

I’m a developer working on a product that lets users upload content. This upload is not time sensitive. We pass the content through a review pipeline, where we did moderation and analysis, and some business-specific checks that the user uploaded relevant content. We’re migrating some of that to an LLM based approach because (in testing) the results are just as good, and tweaking a prompt is easier than updating code. We’ll probably use a batch API for this and accept that content can take 24 hours to be audited.

replies(1): >>44528900 #

3. 3eb7988a1663 ◴[11 Jul 25 02:32 UTC] No.44527911[source]▶

>>44527819 (TP) #

Think of it like you have a large queue of work to be done (eg summarize N decades of historical documents). There is little urgency to the outcome because the bolus is so large. You just want to maintain steady progress on the backlog where cost optimization is more important than timing.

replies(1): >>44528950 #

4. YetAnotherNick ◴[11 Jul 25 03:15 UTC] No.44528102[source]▶

>>44527819 (TP) #

Contrary to other comments it's likely not because of queue or general batch reasons. I think it is because that LLMs are unique in the sense that it requires lot of fixed nodes because of vRAM requirements and hence it is harder to autoscale. So likely the batch jobs are executed when they have free resources from interactive servers.

replies(2): >>44528956 #>>44536031 #

5. jampa ◴[11 Jul 25 04:14 UTC] No.44528329[source]▶

>>44527819 (TP) #

> who is actually getting value out of these batch APIs

I used the batch API extensively for my side project, where I wanted to ingest a large amount of images, extract descriptions, and create tags for searching. After you get the right prompt, and the output is good, you can just use the Batch API for your pipeline. For any non-time-sensitive operations, it is excellent.

replies(1): >>44528924 #

6. cpard ◴[11 Jul 25 06:19 UTC] No.44528900[source]▶

>>44527850 #

yeah I get that part of batch, but even with batch processing, you usually want to have some kind of sense of when the data will be done. Especially when downstream processes depend on that.

The other part that I think makes batch LLM inference unique, is that the results are not deterministic. That's where I think what the parent was saying about some of the data at least should be available earlier even if the rest will be available in 24h.

7. cpard ◴[11 Jul 25 06:23 UTC] No.44528924[source]▶

>>44528329 #

What you describe makes total sense. I think that the tricky part is the "non-time-sensitive operations", in an environment where even if you don't care to have results in minutes, you have pipelines that run regularly and there are dependencies on them.

Maybe I'm just thinking too much in data engineering terms here.

8. cpard ◴[11 Jul 25 06:28 UTC] No.44528950[source]▶

>>44527911 #

yes, what you describe feels like a one off job that you want to run, which is big and also not time critical.

Here's an example:

If you are a TV broadcaster and you want to summarize and annotate the content generated in the past 12 hours you most probably need to have access to the summaries of the previous 12 hours too.

Now if you submit a batch job for the first 12 hours of content, you might end up in a situation where you want to process the next batch but the previous one is not delivered yet.

And imo that's fine as long as you somehow know that it will take more than 12h to complete but it might be delivered to you in 1h or in 23h.

That's the part of the these batch APIs that I find hard to understand how you use in a production environment outside of one off jobs.

9. cpard ◴[11 Jul 25 06:30 UTC] No.44528956[source]▶

>>44528102 #

that makes total sense and what it entails is that interactive inference >>> batch inference in the market today in terms of demand.

10. dist-epoch ◴[11 Jul 25 10:48 UTC] No.44530652[source]▶

>>44527819 (TP) #

> you have no visibility on when the job will be delivered

You do have - within 24 hours. So don't submit requests you need in 10 hours.

11. dekhn ◴[11 Jul 25 19:30 UTC] No.44536031[source]▶

>>44528102 #

Yes, almost certainly in this case Google sees traffic die off when a data center is in the dark. Specifically, there is a diurnal cycle of traffic, and Google usually routes users to close-by resources. So, late at night, all those backends which were running hot doing low-latency replies to users in near-real-time can instead switch over to processing batches. When I built an idle cycle harvester at google, I thought most of hte free cycles would come from low-usage periods, but it turned out that some clusters were just massively underutilized and had free resources all 24 hours.

↑