Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.
Quotas and limits: Understand the operational constraints, including queue times, request limits, and processing times.
Best practices: Get recommendations for maximizing throughput and optimizing costs for your batch prediction jobs.
Use batch prediction for asynchronous, high-throughput, and cost-effective inference for your
large-scale data processing needs.
Why use batch prediction?
In many real-world scenarios, you don't need an immediate response from a
language model. Instead, you might have a large dataset of prompts that you need
to process efficiently and affordably. This is an ideal use case for batch prediction.
Key benefits include:
Cost-Effectiveness: Batch processing costs 50% less than real-time inference, which makes it ideal for large-scale, non-urgent
tasks.
High rate limits: You can process hundreds of thousands of requests in a single
batch with a higher rate limit compared to the real-time Gemini API.
Simplified Workflow: Instead of managing a complex pipeline of individual
real-time requests, you can submit a single batch job and retrieve the results
when processing is complete. The service handles format validation,
parallelizes requests, and automatically retries to help ensure a high completion rate. Jobs typically complete within 24 hours of starting.
Batch prediction is optimized for the following large-scale processing tasks:
Content Generation: Generate product descriptions, social media posts, or
other creative text in bulk.
Data Annotation and Classification: Classify user reviews, categorize
documents, or perform sentiment analysis on a large corpus of text.
Offline Analysis: Summarize articles, extract key information from
reports, or translate documents at scale.
Gemini models that support batch predictions
The following base and tuned Gemini models support batch predictions:
Quota: The batch service doesn't have predefined quota limits. It uses a large, shared pool of resources that are dynamically allocated based on availability and real-time demand. During periods of high traffic, your batch requests might be queued until capacity becomes available.
Queue Time: If a batch job is queued due to high traffic, it remains in the queue for up to 72 hours before it expires.
Request Limits: A single batch job can include up to 200,000 requests. For inputs from Cloud Storage, the file size limit is 1 GB.
Processing Time: The service processes batch jobs asynchronously and they are not
designed for real-time applications. Most jobs complete within 24 hours after they start running, not including queue time. After 24 hours, the service cancels incomplete
jobs, and you are charged only for completed requests.
To get the most out of batch prediction with Gemini, follow these
best practices:
Combine jobs: To maximize throughput, combine smaller jobs into one large
job, within system limits. For example, submitting one batch job with 200,000
requests provides better throughput than 1,000 jobs with 200 requests each.
Monitor Job Status: You can monitor job progress using API, SDK, or UI.
For more information, see monitor the job status. If a job fails, check the error messages to diagnose and troubleshoot the issue.
Optimize for Cost: Take advantage of the cost savings of batch
processing for any tasks that don't require an immediate response.
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Hard to understand","hardToUnderstand","thumb-down"],["Incorrect information or sample code","incorrectInformationOrSampleCode","thumb-down"],["Missing the information/samples I need","missingTheInformationSamplesINeed","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2025-08-21 UTC."],[],[],null,[]]