Generative AI on Vertex AI rate limits

A quota restricts how much of a shared Google Cloud resource your Google Cloud project can use, including hardware, software, and network components. Therefore, quotas are a part of a system that does the following:

  • Monitors your use or consumption of Google Cloud products and services.
  • Restricts your consumption of those resources, for reasons that include ensuring fairness and reducing spikes in usage.
  • Maintains configurations that automatically enforce prescribed restrictions.
  • Provides a means to request or make changes to the quota.

In most cases, when a quota is exceeded, the system immediately blocks access to the relevant Google resource, and the task that you're trying to perform fails. In most cases, quotas apply to each Google Cloud project and are shared across all applications and IP addresses that use that Google Cloud project.

Quotas by region and model

The requests per minute (RPM) quota applies to a base model and all versions, identifiers, and tuned versions of that model. For example, a request to gemini-1.0-pro and a request to gemini-1.0-pro-001 are counted as two requests toward the RPM quota of the base model, gemini-1.0-pro. Similarly, a request to gemini-1.0-pro-001 and gemini-1.0-pro-002 are counted as two requests toward the RPM quota of the base model, gemini-1.0-pro. The same applies to tuned models, so a request to gemini-1.0-pro-001 and a tuned model based on gemini-1.0-pro-001 named my-tuned-chat-model are counted as two requests toward the base model, gemini-1.0-pro.

The quotas apply to Generative AI on Vertex AI requests for a given Google Cloud project and supported region.

To view the quotas in the Google Cloud console, do the following:

  1. In the Google Cloud console, go to the IAM & Admin Quotas page.

    View Quotas in Console

  2. In the Filter field, specify the dimension or metric.
Dimension (model identifier) Metric (quota identifier for Gemini models)
base_model: gemini-1.5-flash
base_model: gemini-1.5-pro
You can request adjustments in the following:
All other models You can adjust only one quota:

Choose a region to view the quota limits for each available model:

Rate limits

The following rate limits apply to the listed models across all regions for the metric, generate_content_input_tokens_per_minute_per_base_model:

Base model Tokens per minute
base_model: gemini-1.5-flash 4M (4,000,000)
base_model: gemini-1.5-pro 4M (4,000,000)

Batch quotas

The following quotas and limits are the same across the regions for Generative AI on Vertex AI batch prediction jobs:

Quota Value
textembedding_gecko_concurrent_batch_prediction_jobs 4

Custom-trained model quotas

The following quotas apply to Generative AI on Vertex AI tuned models for a given project and region:

Quota Value
Restricted image training TPU V3 pod cores per region
* supported Region - europe-west4
Restricted image training Nvidia A100 80GB GPUs per region
* supported Region - us-central1
* supported Region - us-east4


* Tuning scenarios have accelerator reservations in specific regions. Quotas for tuning are supported and must be requested in specific regions.

Online evaluation quotas

The evaluation online service uses the text-bison model as an autorater with Google IP prompts and mechanisms to ensure consistent and objective evaluation for model-based metrics.

A single evaluation request for a model-based metric might result in multiple underlying requests to the online prediction service. Each model's quota is calculated on a per-project basis, which means that any requests directed to the text-bison for model inference and model-based evaluation contribute to the quota. Different model quotas are set differently. The quota for the evaluation service and the quota for the underlying autorater model are shown in the table.

Request quota Default quota
Online evaluation service requests per minute 1,000 requests per project per region
Online prediction requests per minute for base_model, base_model: text-bison 1,600 requests per project per region

If you receive an error related to quotas while using the evaluation online service, you might need to file a quota increase request. See View and Manage Quotas for more information.

Limit Value
Online evaluation service request timeout 60 seconds

First-time users of the online evaluation service within a new project might experience an initial setup delay generally up to two minutes. This is a one-time process. If your first request fails, wait a few minutes and then retry. Subsequent evaluation requests typically complete within 60 seconds.

The maximum input and output tokens are limited for the model-based metrics as per the model used as the autorater. See Model information | Generative AI on Vertex AI | Google Cloud for limits for relevant models.

LlamaIndex on Vertex AI quotas for RAG

The following quotas are for performing retrieval-augmented generation (RAG) by using LlamaIndex on Vertex AI:

Service Quota
LlamaIndex on Vertex AI data management APIs 60 requests per minute (RPM)
RetrievalContexts API 1,500 RPM
base_model: textembedding-gecko 1,500 RPM
Online prediction requests1 30,000 RPM
Data ingestion 1,000 files

1This quota applies for public endpoints only. Private endpoints have unlimited requests per minute.

Pipeline evaluation quotas

If you receive an error related to quotas while using the evaluation pipelines service, you might need to file a quota increase request. See View and Manage Quotas for more information.

The evaluation pipelines service uses Vertex AI Pipelines to run PipelineJobs. See relevant quotas for Vertex AI Pipelines. The following are general quota recommendations:

Service Quota Recommendation
Vertex AI API Concurrent LLM batch prediction jobs per region Pointwise: 1 * num_concurrent_pipelines

Pairwise: 2 * num_concurrent_pipelines
Vertex AI API Evaluation requests per minute per region 1000 * num_concurrent_pipelines

Additionally, when calculating model-based evaluation metrics, the autorater might hit quota issues. The relevant quota depends on which autorater was used:

Tasks Quota Base model Recommendation
Online prediction requests per base model per minute per region per base_model text-bison 60 * num_concurrent_pipelines

Vertex AI Pipelines

Each tuning job uses Vertex AI Pipelines. For more information, see Vertex AI Pipelines quotas and limits.

Dynamic shared quota

For services that support dynamic shared quota, Google Cloud distributes on-demand capacity among all queries being processed. This capability eliminates the need for you to submit quota increase requests (QIRs).

To apply a consumer override to your project as a cost control measure and to prevent budget overruns, see Creating a consumer quota override.

If you require a specified maximum amount of throughput, contact your Google Cloud account representative about provisioned throughput.

You can also monitor your usage through Quotas & System Limits in your Google Cloud console.

For information about models that support dynamic shared quota, see Use the Claude models from Anthropic.

Example of how dynamic shared quota works

Google Cloud looks at the available capacity in a specific region, such as North America, and then looks at how many customers are sending requests. Consider customer A, who sends 25 queries per minute (QPM), and customer B, who sends 25 QPM. The service can support 100 QPM. If customer A increases the rate of their queries to 75 QPM, then dynamic shared quota supports the increase. If customer A increases the rate of their queries to 100 QPM, then dynamic shared quota throttles customer A down to 75 QPM in order to continue to serve customer B at 25 QPM.

To troubleshoot errors that might occur with the use of dynamic shared quota, see Troubleshoot quota errors.

Quota increases

If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see Work with quotas.

What's next