Batch Mode in the Gemini API: Process More for Less

(developers.googleblog.com)

1. tripplyons ◴[11 Jul 25 01:45 UTC] No.44527652[source]▶

>>44492014 (OP) #

For those who aren't aware, OpenAI has a very similar batch mode (50% discount if you wait up to 24 hours): https://platform.openai.com/docs/api-reference/batch

It's nice to see competition in this space. AI is getting cheaper and cheaper!

replies(4): >>44528108 #>>44528444 #>>44528451 #>>44532342 #

2. dsjoerg ◴[11 Jul 25 02:11 UTC] No.44527801[source]▶

>>44492014 (OP) #

We used the previous version of this batch mode, which went through BigQuery. It didn't work well for us at the time because we were in development mode and we needed faster cycle time to iterate and learn. Sometimes the response would come back much faster than 24 hours, but sometimes not. There was no visibility offered into what response time you would get; just submit and wait.

You have to be pretty darn sure that your job is going to do exactly what you want to be able to wait 24 hours for a response. It's like going back to the punched-card era. If I could get even 1% of the batch in a quicker response and then the rest more slowly, that would have made a big difference.

replies(4): >>44527819 #>>44528277 #>>44528385 #>>44530651 #

3. cpard ◴[11 Jul 25 02:14 UTC] No.44527819[source]▶

>>44527801 #

It seems that the 24h SLA is standard for batch inference among the vendors and I wonder how useful it can be when you have no visibility on when the job will be delivered.

I wonder why they do that and who is actually getting value out of these batch APIs.

Thanks for sharing your experience!

replies(5): >>44527850 #>>44527911 #>>44528102 #>>44528329 #>>44530652 #

4. vineyardmike ◴[11 Jul 25 02:20 UTC] No.44527850{3}[source]▶

>>44527819 #

It’s like most batch processes, it’s not useful if you don’t know what the response will be and you’re iterating interactively. It for data pipelines, analytics workloads, etc, you can handle that delay because no one is waiting on the response.

I’m a developer working on a product that lets users upload content. This upload is not time sensitive. We pass the content through a review pipeline, where we did moderation and analysis, and some business-specific checks that the user uploaded relevant content. We’re migrating some of that to an LLM based approach because (in testing) the results are just as good, and tweaking a prompt is easier than updating code. We’ll probably use a batch API for this and accept that content can take 24 hours to be audited.

replies(1): >>44528900 #

5. 3eb7988a1663 ◴[11 Jul 25 02:32 UTC] No.44527911{3}[source]▶

>>44527819 #

Think of it like you have a large queue of work to be done (eg summarize N decades of historical documents). There is little urgency to the outcome because the bolus is so large. You just want to maintain steady progress on the backlog where cost optimization is more important than timing.

replies(1): >>44528950 #

6. nnx ◴[11 Jul 25 02:39 UTC] No.44527942[source]▶

>>44492014 (OP) #

It would be nice if OpenRouter supported batch mode too, sending a batch and letting OpenRouter find the best provider for the batch within given price and response time.

7. YetAnotherNick ◴[11 Jul 25 03:15 UTC] No.44528102{3}[source]▶

>>44527819 #

Contrary to other comments it's likely not because of queue or general batch reasons. I think it is because that LLMs are unique in the sense that it requires lot of fixed nodes because of vRAM requirements and hence it is harder to autoscale. So likely the batch jobs are executed when they have free resources from interactive servers.

replies(2): >>44528956 #>>44536031 #

8. fantispug ◴[11 Jul 25 03:16 UTC] No.44528108[source]▶

>>44527652 #

Yes, this seems to be a common capability - Anthropic and Mistral have something very similar as do resellers like AWS Bedrock.

I guess it lets them better utilise their hardware in quiet times throughout the day. It's interesting they all picked 50% discount.

replies(3): >>44528237 #>>44529423 #>>44532883 #

9. qrian ◴[11 Jul 25 03:50 UTC] No.44528237{3}[source]▶

>>44528108 #

Bedrock has a batch mode but only for claude 3.5 which is like one year old, which isn't very useful.

10. serjester ◴[11 Jul 25 04:00 UTC] No.44528277[source]▶

>>44527801 #

We've submitted tens of millions of requests at a time and never had it take longer than a couple hours - I think the zone you submit to plays a role.

11. jampa ◴[11 Jul 25 04:14 UTC] No.44528329{3}[source]▶

>>44527819 #

> who is actually getting value out of these batch APIs

I used the batch API extensively for my side project, where I wanted to ingest a large amount of images, extract descriptions, and create tags for searching. After you get the right prompt, and the output is good, you can just use the Batch API for your pipeline. For any non-time-sensitive operations, it is excellent.

replies(1): >>44528924 #

12. pugio ◴[11 Jul 25 04:20 UTC] No.44528356[source]▶

>>44492014 (OP) #

Hah, I've been wrestling with this ALL DAY. Another example of Phenomenal Cosmic Powers (AI) combined with itty bitty docs (typical of Google). The main endpoint ("https://generativelanguage.googleapis.com/v1beta/models/gemi...") doesn't even have actual REST documentation in the API. The Python API has 3 different versions of the same types. One of the main ones (`GenerateContentRequest`) isn't available in the newest path (`google.genai.types`) so you need to find it in an older version, but then you start getting version mismatch errors, and then pydantic errors, until you finally decide to just cross your fingers and submit raw JSON, only to get opaque API errors.

So, if anybody else is frustrated and not finding anything online about this, here are a few things I learned, specifically for structured output generation (which is a main use case for batching) - the individual request JSON should resolve to this:

```json { "request": { "contents": [ { "parts": [ { "text": "Give me the main output please" } ] } ], "system_instruction": { "parts": [ { "text": "You are a main output maker." } ] }, "generation_config": { "response_mime_type": "application/json", "response_json_schema": { "type": "object", "properties": { "output1": { "type": "string" }, "output2": { "type": "string" } }, "required": [ "output1", "output2" ] } } }, "metadata": { "key": "my_id" } } ```

To get actual structured output, don't just do `generation_config.response_schema`, you need to include the mime-type, and the key should be `response_json_schema`. Any other combination will either throw opaque errors or won't trigger Structured Output (and will contain the usual LLM intros "I'm happy to do this for you...").

So you upload a .jsonl file with the above JSON, and then you try to submit it for a batch job. If something is wrong with your file, you'll get a "400" and no other info. If something is wrong with the request submission you'll get a 400 with "Invalid JSON payload received. Unknown name \"file_name\" at 'batch.input_config.requests': Cannot find field."

I got the above error endless times when trying their exact sample code: ``` BATCH_INPUT_FILE='files/123456' # File ID curl https://generativelanguage.googleapis.com/v1beta/models/gemi... \ -X POST \ -H "x-goog-api-key: $GEMINI_API_KEY" \ -H "Content-Type:application/json" \ -d "{ 'batch': { 'display_name': 'my-batch-requests', 'input_config': { 'requests': { 'file_name': ${BATCH_INPUT_FILE} } } } }" ```

Finally got the job submission working via the python api (`file_batch_job = client.batches.create()`), but remember, if something is wrong with the file you're submitting, they won't tell you what, or how.

replies(1): >>44529717 #

13. Jensson ◴[11 Jul 25 04:27 UTC] No.44528385[source]▶

>>44527801 #

> If I could get even 1% of the batch in a quicker response and then the rest more slowly, that would have made a big difference.

You can do this, just send 1% using the regular API.

replies(1): >>44530433 #

14. great_psy ◴[11 Jul 25 04:34 UTC] No.44528424[source]▶

>>44492014 (OP) #

Is this an indication of the peak of the AI bubble ?

In a way this is saying that there are some GPUs just sitting around so they would rather get 50% than nothing for their use.

replies(2): >>44528499 #>>44528501 #

15. bayesianbot ◴[11 Jul 25 04:38 UTC] No.44528444[source]▶

>>44527652 #

DeepSeek has gone a bit different route - they give automatic 75% discount between UTC 16:30-00:30

https://api-docs.deepseek.com/quick_start/pricing

16. dlvhdr ◴[11 Jul 25 04:39 UTC] No.44528451[source]▶

>>44527652 #

The latest price increases beg to differ

replies(2): >>44529179 #>>44530641 #

17. graeme ◴[11 Jul 25 04:48 UTC] No.44528499[source]▶

>>44528424 #

Seems more like electricity pricing, which has peak and offpeak pricing for most business customers.

To handle peak daily load you need capacity that goes unused in offpeak hours.

18. reasonableklout ◴[11 Jul 25 04:48 UTC] No.44528501[source]▶

>>44528424 #

Why do you think that this means "idle GPU" rather than a company recognizing a growing need and allocating resources toward it?

It's cheaper because it's a different market with different needs which can be served by systems optimizing for throughput instead latency. Feels like you're looking for something that's not there.

19. dmitry-vsl ◴[11 Jul 25 05:25 UTC] No.44528650[source]▶

>>44492014 (OP) #

Is it possible to use batch mode with fine-tuned models?

20. cpard ◴[11 Jul 25 06:19 UTC] No.44528900{4}[source]▶

>>44527850 #

yeah I get that part of batch, but even with batch processing, you usually want to have some kind of sense of when the data will be done. Especially when downstream processes depend on that.

The other part that I think makes batch LLM inference unique, is that the results are not deterministic. That's where I think what the parent was saying about some of the data at least should be available earlier even if the rest will be available in 24h.

21. cpard ◴[11 Jul 25 06:23 UTC] No.44528924{4}[source]▶

>>44528329 #

What you describe makes total sense. I think that the tricky part is the "non-time-sensitive operations", in an environment where even if you don't care to have results in minutes, you have pipelines that run regularly and there are dependencies on them.

Maybe I'm just thinking too much in data engineering terms here.

22. cpard ◴[11 Jul 25 06:28 UTC] No.44528950{4}[source]▶

>>44527911 #

yes, what you describe feels like a one off job that you want to run, which is big and also not time critical.

Here's an example:

If you are a TV broadcaster and you want to summarize and annotate the content generated in the past 12 hours you most probably need to have access to the summaries of the previous 12 hours too.

Now if you submit a batch job for the first 12 hours of content, you might end up in a situation where you want to process the next batch but the previous one is not delivered yet.

And imo that's fine as long as you somehow know that it will take more than 12h to complete but it might be delivered to you in 1h or in 23h.

That's the part of the these batch APIs that I find hard to understand how you use in a production environment outside of one off jobs.

23. cpard ◴[11 Jul 25 06:30 UTC] No.44528956{4}[source]▶

>>44528102 #

that makes total sense and what it entails is that interactive inference >>> batch inference in the market today in terms of demand.

24. dmos62 ◴[11 Jul 25 07:09 UTC] No.44529179{3}[source]▶

>>44528451 #

What price increases?

replies(1): >>44529317 #

25. rvnx ◴[11 Jul 25 07:33 UTC] No.44529317{4}[source]▶

>>44529179 #

I guess the Gemini price increase

replies(1): >>44531095 #

26. calaphos ◴[11 Jul 25 07:51 UTC] No.44529423{3}[source]▶

>>44528108 #

Inference throughout scales really well with larger batch sizes (at the cost of latency) due to rising arithmetic intensity and the fact that it's almost always memory BW limited.

27. segalord ◴[11 Jul 25 07:54 UTC] No.44529444[source]▶

>>44492014 (OP) #

Man googles offerings are so inconsistent, batch processing has been available on vertex for a while now, I dont really get why they have two different offering in vertex and gemini, both are equally inaccessible

replies(2): >>44530311 #>>44530805 #

28. TheTaytay ◴[11 Jul 25 08:39 UTC] No.44529717[source]▶

>>44528356 #

Thank you for posting this! (When I run into errors with posted sample code, I spend WAY too long assuming it’s my fault.)

29. druskacik ◴[11 Jul 25 08:47 UTC] No.44529776[source]▶

>>44492014 (OP) #

I've been using OpenAI's batch API for some time, then replaced it with Mistral's batch API because it was cheaper (Mistral Small with $0.10 / $0.20 per million tokens was perfect for my use case). This makes me rethink my choice, e.g. Gemini 2.5 Flash-Lite seems to be a better model[0] with only a slight price increase.

[0] https://artificialanalysis.ai/leaderboards/models

30. tucnak ◴[11 Jul 25 09:14 UTC] No.44529985[source]▶

>>44492014 (OP) #

I really hope it means that 2.5 models will be available for batching in Vertex, too. We had spent quite a bit of effort making it work with BigQuery, and it's really cool when it works. There's edge-case, though, where it doesn't work: in case the batch is also referring to cached prompt. We did report this a few months ago.

31. anupj ◴[11 Jul 25 09:25 UTC] No.44530084[source]▶

>>44492014 (OP) #

Batch Mode for the Gemini API feels like Google’s way of asking, “What if we made AI more affordable and slower, but at massive scale?” Now you can process 10,000 prompts like “Summarize each customer review in one line” for half the cost, provided you’re willing to wait until tomorrow for the results.

replies(4): >>44530624 #>>44531272 #>>44533342 #>>44534982 #

32. nikolayasdf123 ◴[11 Jul 25 09:55 UTC] No.44530311[source]▶

>>44529444 #

omg I realized this is not Vertex AI face-palm

33. Implicated ◴[11 Jul 25 10:13 UTC] No.44530433{3}[source]▶

>>44528385 #

I was also rather puzzled at this comment - why not dev against real time endpoints and batch when you've got things where you need them?

34. kerisi ◴[11 Jul 25 10:14 UTC] No.44530436[source]▶

>>44492014 (OP) #

I've been using this with nothing notable to mention besides there seems to be a common bug where you receive an empty text response.

https://discuss.ai.google.dev/t/gemini-2-5-pro-with-empty-re...

35. dist-epoch ◴[11 Jul 25 10:43 UTC] No.44530624[source]▶

>>44530084 #

Most LLM providers have batch mode. Not sure why you are calling them out.

replies(1): >>44534996 #

36. dist-epoch ◴[11 Jul 25 10:45 UTC] No.44530641{3}[source]▶

>>44528451 #

Only because Flash was mispriced to start with. It was set too cheap compared with its capabilities. They didn't raise the price of Pro.

37. lazharichir ◴[11 Jul 25 10:47 UTC] No.44530651[source]▶

>>44527801 #

You can also do gemini flash lite for a subset and then batch the rest with flash or pro

38. dist-epoch ◴[11 Jul 25 10:48 UTC] No.44530652{3}[source]▶

>>44527819 #

> you have no visibility on when the job will be delivered

You do have - within 24 hours. So don't submit requests you need in 10 hours.

39. rockwotj ◴[11 Jul 25 11:10 UTC] No.44530805[source]▶

>>44529444 #

It’s because vertex is the “entrrprise” offering that is hippa compliant, etc. That is why vertex only has explicit prompt caching and not implicit, etc. Vertex usage is never used for training or model feedback, but the gemini API does. Basically the Gemini API is Google’s way of being able to move faster like openai and the other foundational model providers, but still having an enterprise offering. Go check Anthropic’s documentation, they even say if you have enterprise or regulatory needs go use bedrock or vertex.

replies(1): >>44532647 #

40. lopuhin ◴[11 Jul 25 11:33 UTC] No.44530959[source]▶

>>44492014 (OP) #

I find OpenAI's new flex processing more attractive, as it has the same 50% discount, but allows to use the same API as regular chat mode, so you can still do stuff where Batch API won't work (e.g. evaluating agents), and in practice I found it to work well enough when paired with client-side request caching: https://platform.openai.com/docs/guides/flex-processing?api-...

replies(1): >>44531047 #

41. irthomasthomas ◴[11 Jul 25 11:46 UTC] No.44531047[source]▶

>>44530959 #

It's nice that they stack the batch pricing and caching discount. I asked the Google guy if they did the same but got no reply, so probably not.

Edit: anthropic also stack batching and caching discounts

42. dmos62 ◴[11 Jul 25 11:53 UTC] No.44531095{5}[source]▶

>>44529317 #

Ah, 2.5 flash non-thinking price was increased to match the price of 2.5 flash thinking.

replies(1): >>44532762 #

43. diggan ◴[11 Jul 25 12:16 UTC] No.44531272[source]▶

>>44530084 #

> Now you can process 10,000 prompts like “Summarize each customer review in one line” for half the cost, provided you’re willing to wait until tomorrow for the results.

Sounds like a great option to have available? Not every task I use LLMs for need immediate responses, and if I wasn't using local models for those things, getting a 50% discount and having to wait a day sounds like a fine tradeoff.

44. laborcontract ◴[11 Jul 25 14:08 UTC] No.44532342[source]▶

>>44527652 #

One open secret is that batch mode generations often take much less than 24 hours. I've done a lot of generations where I get my results within 5ish minutes.

replies(1): >>44537409 #

45. Deathmax ◴[11 Jul 25 14:39 UTC] No.44532647{3}[source]▶

>>44530805 #

Vertex's offering of Gemini very much does implicit caching, and has always been the case [1]. The recent addition of applying implicit cache hit discounts also works on Vertex, as long as you don't use the `global` endpoint and hit one of the regional endpoints.

[1]: http://web.archive.org/web/20240517173258/https://cloud.goog..., "By default Google caches a customer's inputs and outputs for Gemini models to accelerate responses to subsequent prompts from the customer. Cached contents are stored for up to 24 hours."

46. Workaccount2 ◴[11 Jul 25 14:48 UTC] No.44532762{6}[source]▶

>>44531095 #

No, 2.5 flash non-thinking was replaced with 2.5 flash lite, and 2.5 flash thinking had it's cost rebalanced (input price increased/output price decreased)

2.5 flash non-thinking doesn't exist anymore. People call it a price increase but it's just confusion about what Google did.

replies(1): >>44536758 #

47. briangriffinfan ◴[11 Jul 25 14:58 UTC] No.44532883{3}[source]▶

>>44528108 #

50% is my personal threshold for a discount going from not worth it to worth it.

48. XTXinverseXTY ◴[11 Jul 25 15:33 UTC] No.44533342[source]▶

>>44530084 #

This is an extremely common use case.

Reading your comment history: are you an LLM?

https://news.ycombinator.com/item?id=44531907

https://news.ycombinator.com/item?id=44531868

49. okdood64 ◴[11 Jul 25 17:39 UTC] No.44534982[source]▶

>>44530084 #

I don't understand the point you're making. This has been a commonly used offering since cloud blew up.

https://aws.amazon.com/ec2/spot/

50. okdood64 ◴[11 Jul 25 17:40 UTC] No.44534996{3}[source]▶

>>44530624 #

I'll take it further. Regular cloud compute have batch workload capabilities at cheaper rates, as well since forever.

51. dekhn ◴[11 Jul 25 19:30 UTC] No.44536031{4}[source]▶

>>44528102 #

Yes, almost certainly in this case Google sees traffic die off when a data center is in the dark. Specifically, there is a diurnal cycle of traffic, and Google usually routes users to close-by resources. So, late at night, all those backends which were running hot doing low-latency replies to users in near-real-time can instead switch over to processing batches. When I built an idle cycle harvester at google, I thought most of hte free cycles would come from low-usage periods, but it turned out that some clusters were just massively underutilized and had free resources all 24 hours.

52. sunaookami ◴[11 Jul 25 21:03 UTC] No.44536758{7}[source]▶

>>44532762 #

They try to frame it as such but 2.5 Flash Lite is not the same as 2.5 Flash without thinking. It's worse.

53. ridgewell ◴[11 Jul 25 22:28 UTC] No.44537409{3}[source]▶

>>44532342 #

It can depend a lot on the shape of your batch to my understanding. A small batch job can be tasked out a lot quicker than a large batch job waiting for just the right moment where capacity fits.