Quotas and system limits

This document lists the quotas and system limits that apply to Vertex AI Agent Builder.

Quotas have default values, but you can typically request adjustments.
System limits are fixed values that can't be changed.

Google Cloud uses quotas to help ensure fairness and reduce spikes in resource use and availability. A quota restricts how much of a Google Cloud resource your Google Cloud project can use. Quotas apply to a range of resource types, including hardware, software, and network components. For example, quotas can restrict the number of API calls to a service, the number of load balancers used concurrently by your project, or the number of projects that you can create. Quotas protect the community of Google Cloud users by preventing the overloading of services. Quotas also help you to manage your own Google Cloud resources.

The Cloud Quotas system does the following:

Monitors your consumption of Google Cloud products and services
Restricts your consumption of those resources
Provides a way to request changes to the quota value and automate quota adjustments

In most cases, when you attempt to consume more of a resource than its quota allows, the system blocks access to the resource, and the task that you're trying to perform fails.

Quotas generally apply at the Google Cloud project level. Your use of a resource in one project doesn't affect your available quota in another project. Within a Google Cloud project, quotas are shared across all applications and IP addresses.

For more information, see the Cloud Quotas overview.

Vertex AI Agent Engine quotas

The following quotas apply to Vertex AI Agent Engine for a given project in each region:

Description	Quota	Metric
Create, delete, or update Vertex AI Agent Engine resources per minute	10	`aiplatform.googleapis.com/reasoning_engine_service_write_requests`
Create, delete, or update Vertex AI Agent Engine sessions per minute	100	`aiplatform.googleapis.com/session_write_requests`
`Query` or `StreamQuery` Vertex AI Agent Engine per minute	90	`aiplatform.googleapis.com/reasoning_engine_service_query_requests`
Append event to Vertex AI Agent Engine sessions per minute	300	`aiplatform.googleapis.com/session_event_append_requests`
Maximum number of Vertex AI Agent Engine resources	100	`aiplatform.googleapis.com/reasoning_engine_service_entities`
Create, delete, or update Vertex AI Agent Engine memory resources per minute	100	`aiplatform.googleapis.com/memory_bank_write_requests`
Get, list, or retrieve from Vertex AI Agent Engine Memory Bank per minute	300	`aiplatform.googleapis.com/memory_bank_read_requests`
Sandbox environment (Code Execution) execute requests per minute	1000	`aiplatform.googleapis.com/sandbox_environment_execute_requests`
Sandbox environment (Code Execution) entities per region	1000	`aiplatform.googleapis.com/sandbox_environment_entities`
A2A Agent post requests like `sendMessage` and `cancelTask`per minute	60	`aiplatform.googleapis.com/a2a_agent_post_requests`
A2A Agent get requests like `getTask` and `getCard` per minute	600	`aiplatform.googleapis.com/a2a_agent_get_requests`
Concurrent live bidirectional connections using the `BidiStreamQuery` API per minute	10	`aiplatform.googleapis.com/reasoning_engine_service_concurrent_query_requests`

Quota management for production loads

As your traffic scales, you likely need to request increases for specific Vertex AI API quotas to avoid 429 Resource Exhausted errors. You can proactively configure your runtime and increase your quotas to keep your Vertex AI Agent Engine Runtime responsive, scalable, and reliable under production load.

For information about how to optimize and scale Vertex AI Agent Engine performance, see Optimize and scale Vertex AI Agent Engine Runtime performance.

Use the following steps to estimate your peak quota requirements:

Define your variables:
- U: Peak concurrent users (for example, 250).
- X: Average requests per user per minute (for example, 2).
- Y: Average session events generated per request (for example, 12 for a complex chain that involves multiple tool calls).
Calculate your peak load:
- Calculate your peak queries per minute (QPM): U * X
- Calculate your peak session events per minute: Peak QPM * Y
Request a quota with a buffer: When you request a quota increase, add a buffer (for example, 50%) on top of your calculated peak to handle unexpected spikes.

The following table shows calculations for key performance-related quotas for Vertex AI Agent Engine, using the example variables of peak concurrent users=250, average requests per user per minute=2, and average session events generated per request=12:

Quota name Quota description Base calculation (peak) Recommended value (with 50% buffer)

Query Agent Engine per minute (aiplatform.googleapis.com/reasoning_engine_service_query_requests) The total number of query or stream_query calls that your agent can receive per minute. 250 users * 2 req/min = 500 QPM 500 * 1.5 = 750

Quota name	Quota description	Base calculation (peak)	Recommended value (with 50% buffer)
Query Agent Engine per minute (`aiplatform.googleapis.com/reasoning_engine_service_query_requests`)	The total number of `query` or `stream_query` calls that your agent can receive per minute.	`250 users * 2 req/min = 500 QPM`	`500 * 1.5 =` `750`
Append session events per minute (`aiplatform.googleapis.com/session_event_append_requests`)	The number of turns or events within all ongoing sessions. A single query can generate multiple session events in a chain, for example: Call LLM. LLM response: use tool. Execute tool. Call the LLM with tool response. LLM gives the final response.	`500 QPM * 12 events/req = 6,000`	`6,000 * 1.5 =` `9,000`
Session writes per minute (`aiplatform.googleapis.com/session_write_requests`)	The rate of creating or updating session resources. This is typically less than or equal to the query rate.	Typically <= Peak QPM (`500`)	Typically <= query quota (`750`)

Append session events per minute (aiplatform.googleapis.com/session_event_append_requests)

The number of turns or events within all ongoing sessions. A single query can generate multiple session events in a chain, for example:

Call LLM.
LLM response: use tool.
Execute tool.
Call the LLM with tool response.
LLM gives the final response.

500 QPM * 12 events/req = 6,000

6,000 * 1.5 = 9,000

Session writes per minute (aiplatform.googleapis.com/session_write_requests) The rate of creating or updating session resources. This is typically less than or equal to the query rate. Typically <= Peak QPM (500) Typically <= query quota (750)

Request a quota adjustment

To adjust most quotas, use the Google Cloud console. For more information, see Request a quota adjustment.

Vertex AI Agent Engine Express mode quotas

Vertex AI Free Tier express mode users have the following quotas for Vertex AI Agent Engine services with no cost. See the Vertex AI in Express mode overview for more information about the Free Tier and express mode. The following quotas apply to Vertex AI Agent Engine for a given express mode project in each region:

Description	Quota	Metric
Maximum number of Vertex AI Agent Engine resources	10	`aiplatform.googleapis.com/reasoning_engine_service_entities`
Create, delete, or update Vertex AI Agent Engine resources per minute	10	`aiplatform.googleapis.com/reasoning_engine_service_write_requests`
`Query` or `StreamQuery` Vertex AI Agent Engine per minute	10	`aiplatform.googleapis.com/reasoning_engine_service_query_requests`
Concurrent live bidirectional connections using the `BidiStreamQuery` API per minute	1	`aiplatform.googleapis.com/reasoning_engine_service_concurrent_query_requests`
Create, delete, or update Vertex AI Agent Engine sessions per minute	10	`aiplatform.googleapis.com/session_write_requests`
Append event to Vertex AI Agent Engine sessions per minute	30	`aiplatform.googleapis.com/session_event_append_requests`
Create, delete, or update Vertex AI Agent Engine memory resources per minute	10	`aiplatform.googleapis.com/memory_bank_write_requests`
Get, list, or retrieve from Vertex AI Agent Engine Memory Bank per minute	10	`aiplatform.googleapis.com/memory_bank_read_requests`