Batch prediction with Gemini

This page shows you how to use batch prediction with Gemini models to process large datasets efficiently. It covers the following topics:

Use batch prediction for asynchronous, high-throughput, and cost-effective inference for your large-scale data processing needs.

Why use batch prediction?

In many real-world scenarios, you don't need an immediate response from a language model. Instead, you might have a large dataset of prompts that you need to process efficiently and affordably. This is an ideal use case for batch prediction.

Key benefits include:

  • Cost-Effectiveness: Batch processing costs 50% less than real-time inference, which makes it ideal for large-scale, non-urgent tasks.
  • High rate limits: You can process hundreds of thousands of requests in a single batch with a higher rate limit compared to the real-time Gemini API.
  • Simplified Workflow: Instead of managing a complex pipeline of individual real-time requests, you can submit a single batch job and retrieve the results when processing is complete. The service handles format validation, parallelizes requests, and automatically retries to help ensure a high completion rate. Jobs typically complete within 24 hours of starting.

Batch prediction is optimized for the following large-scale processing tasks:

  • Content Generation: Generate product descriptions, social media posts, or other creative text in bulk.
  • Data Annotation and Classification: Classify user reviews, categorize documents, or perform sentiment analysis on a large corpus of text.
  • Offline Analysis: Summarize articles, extract key information from reports, or translate documents at scale.

Gemini models that support batch predictions

The following base and tuned Gemini models support batch predictions:

Quotas and limits

Batch prediction has the following limitations:

  • Quota: The batch service doesn't have predefined quota limits. It uses a large, shared pool of resources that are dynamically allocated based on availability and real-time demand. During periods of high traffic, your batch requests might be queued until capacity becomes available.
  • Queue Time: If a batch job is queued due to high traffic, it remains in the queue for up to 72 hours before it expires.
  • Request Limits: A single batch job can include up to 200,000 requests. For inputs from Cloud Storage, the file size limit is 1 GB.
  • Processing Time: The service processes batch jobs asynchronously and they are not designed for real-time applications. Most jobs complete within 24 hours after they start running, not including queue time. After 24 hours, the service cancels incomplete jobs, and you are charged only for completed requests.
  • Unsupported features: Batch prediction does not support Context Caching, RAG, or Global endpoints.

Best practices

To get the most out of batch prediction with Gemini, follow these best practices:

  • Combine jobs: To maximize throughput, combine smaller jobs into one large job, within system limits. For example, submitting one batch job with 200,000 requests provides better throughput than 1,000 jobs with 200 requests each.
  • Monitor Job Status: You can monitor job progress using API, SDK, or UI. For more information, see monitor the job status. If a job fails, check the error messages to diagnose and troubleshoot the issue.
  • Optimize for Cost: Take advantage of the cost savings of batch processing for any tasks that don't require an immediate response.

What's next