Provisioned Throughput is a fixed-cost monthly subscription or weekly service that reserves throughput for supported generative AI models on Vertex AI. To reserve your throughput, you must specify the model and available locations in which the model runs.
This page explains when to use Provisioned Throughput, how it works, and how to subscribe.
Supported models
The following tables show the models that support Provisioned Throughput, the throughput for each generative AI scale unit (GSU), and the burndown rates for each model.
Google models
This table shows the throughput, purchase increment, and burndown rates for Google models that support Provisioned Throughput. The Google models are measured in characters per second, which is defined as your prompt input and generated text output characters across all requests per second.
Model | Throughput per GSU (chars/sec) | Minimum GSU purchase increment | Burndown rates | |
---|---|---|---|---|
Gemini 1.5 Flash | Less than or equal to 128,000 context window: 54,000 Greater than 128,000 context window: 27,000 |
1 | Less than or equal to 128,000 context window: 1 input char = 1 char 1 output char = 4 chars 1 image = 1,067 chars 1 video per second = 1,067 chars 1 audio per second = 107 chars |
Greater than 128,000 context window: 1 input char = 2 chars 1 output char = 8 chars 1 image = 2,134 chars 1 video per second = 2,134 chars 1 audio per second = 214 chars |
Gemini 1.5 Pro | 800 | 1 | Less than or equal to 128,000 context window: 1 input char = 1 char 1 output char = 3 chars 1 image = 1,052 chars 1 video per second = 1,052 chars 1 audio per second = 100 chars |
Greater than 128,000 context window: 1 input char = 2 chars 1 output char = 6 chars 1 image = 2,104 chars 1 video per second = 2,104 chars 1 audio per second = 200 chars |
Gemini 1.0 Pro | 8,000 | 1 | 1 input char = 1 char 1 output char = 3 chars 1 image = 20,000 chars 1 video per second = 16,000 chars |
|
Imagen 3 | 0.025 Throughput is measured in images/sec instead of chars/sec. |
1 | Only output images count toward your Provisioned Throughput quota. | |
Imagen 3 Fast | 0.05 Throughput is measured in images/sec instead of chars/sec. |
1 | Only output images count toward your Provisioned Throughput quota. | |
Imagen 2 | 0.05 Throughput is measured in images/sec instead of chars/sec. |
1 | Only output images count toward your Provisioned Throughput quota. | |
Imagen 2 Edit | 0.05 Throughput is measured in images/sec instead of chars/sec. |
1 | Only output images count toward your Provisioned Throughput quota. | |
MedLM medium | 2,000 | 1 | 1 input char = 1 char 1 output char = 2 chars |
|
MedLM large | 200 | 1 | 1 input char = 1 char 1 output char = 3 chars |
|
MedLM large 1.5 | 200 | 1 | 1 input char = 1 char 1 output char = 3 chars |
For more information about supported locations, see Available locations.
You can upgrade to new models as they are made available. For information about model availability and discontinuation dates, see Google models.
Preview features
The preview features for Provisioned Throughput require access approval. To request access, fill out and submit the Provisioned Throughput access control form.
The Preview version provides the following for Google models:
Provisioned Throughput can be applied to both base models and supervised fine-tuned versions of those base models.
Supervised fine-tuned model endpoints and their corresponding base model count towards the same Provisioned Throughput quota.
For example, Provisioned Throughput purchased for
gemini-1.5-pro-002
for a specific project prioritizes requests that are made from supervised fine-tuned versions ofgemini-1.5-pro-002
created within that project. Use the appropriate header to control traffic behavior.Provisioned Throughput can be purchased for a one-week term instead of a monthly subscription, with the option to provide a start date within two weeks in the future of placing your order.
Google legacy models
See Legacy models that support Provisioned Throughput.
Partner models
This table shows the throughput, purchase increment, and burndown rates for partner models that support Provisioned Throughput. Claude models are measured in tokens per second, which is defined as a total of input and output tokens across all requests per second.
Model | Throughput per GSU (tokens/sec) | Minimum GSU purchase | GSU purchase increment | Burndown rates |
---|---|---|---|---|
Anthropic's Claude 3.5 Sonnet v2 | 350 | 25 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
Anthropic's Claude 3.5 Haiku | 2,000 | 10 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
Anthropic's Claude 3 Opus | 70 | 35 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
Anthropic's Claude 3 Haiku | 4,200 | 5 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
Anthropic's Claude 3.5 Sonnet | 350 | 25 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
Anthropic's Claude 3 Sonnet | 350 | 25 | 1 | 1 input token = 1 token 1 output token = 5 tokens |
For more information about supported locations, see Available locations.
When to use Provisioned Throughput
If any of the following considerations apply to your use case, consider using Provisioned Throughput:
- Your critical workloads consistently require high throughput. Throughput measurement depends on the model.
- You are building real-time generative AI production applications, such as chatbots and agents.
- Your throughput needs exceed 20,000 characters per second.
- You want to provide a consistent and predictable experience for users of your applications.
- You want deterministic generative AI costs by paying a fixed-monthly price with control of overages.
- You want deterministic generative AI costs by paying a fixed monthly or weekly price with control of overages.
Provisioned Throughput is one of two ways to consume your generative AI models. The second way is pay-as-you-go, which is also referred to as on-demand.
How Provisioned Throughput is measured
This section explains the concepts of generative AI scale unit (GSU) and burndown rates. Provisioned Throughput is calculated and priced using GSUs and burndown rates.
A generative AI scale unit (GSU) is a measure of throughput for your prompts and responses. This amount specifies how much throughput to provision a model with.
To produce a standard unit across models, all inputs and outputs are converted to input characters per second (throughput) using model-specific ratios called burndown rates.
Different models use different amounts of throughput. For information about the minimum GSU purchase amount and increments for each model, see Supported models and burndown rates in this document.
This equation demonstrates how throughput is calculated:
inputs_per_query = inputs_across_modalities_converted_using_burndown_rates
outputs_per_query = outputs_across_modalities_converted_using_burndown_rates
throughput_per_second = (inputs_per_query + outputs_per_query) * queries_per_second
The calculated throughput per second determines how many GSUs that you need for your use case.
Example of estimating your Provisioned Throughput needs
To estimate your Provisioned Throughput needs, use the estimation tool in the Google Cloud console. The following example illustrates the process of estimating the amount of Provisioned Throughput for your model. The region isn't considered in the estimation calculations.
Gather your requirements.
In this example, your requirement is to ensure that you can send 2,000 characters with 2 images and receive 300 characters of output for 10 queries per second using
gemini-1.5-flash
.This step means that you understand your use case, because you have identified the size of your inputs and outputs, the number of queries per second (QPS), and your model.
To estimate your throughput, specify your model. In this example, your model is
gemini-1.5-flash
.Specify the type of input, and identify the burndown rate. Use the burndown rates table to identify the burndown rate based on your type of input.
An image's burndown rate for the
gemini-1.5-flash
model is 1,067 characters.
Calculate your throughput.
Multiply the number of images by the burndown rate for the input type for your specific model.
2 images * 1,067 input characters per image = 2,134 input characters
Your total output characters is 300. Return to the burndown rates table, and find the burndown rate for output characters (four characters per output character) for your specific model (
gemini-1.5-flash
).300 output characters * 4 characters per output character = 1,200 converted input characters
Add your totals together.
2,000 input characters + 2,134 converted input characters for the images + 1,200 converted input characters for the output = 5,334 converted input characters per query
Multiply the characters per query by your expected queries per second to get the total throughput per second.
5,334 converted input characters per query * 10 QPS = 53,340 total converted input characters per second
Calculate your GSUs.
The GSUs are the total throughput per second divided by throughput per GSU from the burndown table.
53,340 total converted input chars per second ÷ 54,000 throughput per GSU = 0.988 GSUs
The minimum GSU purchase increment for
gemini-1.5-flash
is 1, which meets your requirement.
What to consider before subscribing
To help you decide whether you want to subscribe to Provisioned Throughput, review this list of details about the subscription:
You can't cancel your order.
Your Provisioned Throughput purchase is a commitment, which means that you can't cancel the order. However, you can increase the number of purchased GSUs. If you accidentally purchase a commitment or there's a problem with your configuration, contact your Google Cloud account representative for assistance.
You can auto-renew your subscription.
When you submit your order, you can choose to auto-renew your subscription at the end of its term, or let the subscription expire. You can cancel the auto-renew process. To cancel your subscription before it auto renews, cancel the auto renewal 30 days prior to the start of the next term.
You can configure monthly subscriptions to renew automatically each month. Weekly terms don't support automatic renewal.
If you need assistance with this process, contact your Google Cloud account representative.
You can change your model version or region with notice.
Provisioned Throughput is enabled after you've chosen your project, region, model, and version. You can change your model version within the same model publisher or region with a 10-business-day notice by contacting your Google Cloud account representative for assistance. For example, you can switch between Google's models. You can switch between partner A's models. You can switch between partner B's models. You can't switch between Google, partner A, and partner B's models.
There is no downtime when you switch to Provisioned Throughput from pay-as-you-go.
There is no downtime when you switch between models for a Provisioned Throughput order. However, the lead time to acquire throughput is required.
By default, the overage is billed as pay-as-you-go.
If your throughput exceeds your Provisioned Throughput order amount, overages are processed and billed as pay-as-you-go. You can control overages on a per-request basis. For more information, see Use the REST API.
Requests are prioritized.
Requests from Provisioned Throughput customers are prioritized and serviced first before on-demand requests.
You must commit to a minimum usage and payment.
Minimum usage is dependent on the generative AI model that you select. Any usage beyond the purchased throughput rate isn't assured and is serviced on a reasonable-efforts basis.
Throughput doesn't accumulate.
Any unused throughput doesn't accumulate or carry over to the next month.
Provisioned Throughput is measured on characters or tokens per second.
Provisioned Throughput is measured on characters or tokens per second, not on queries per minute (QPM). As a result, measuring Provisioned Throughput depends on your use case's query size and QPM.
Provisioned Throughput checks your quota.
Your Provisioned Throughput quota is checked each time you make a request within your quota window. For
gemini-1.5-flash-002
andgemini-1.5-pro-002
models, the quota window is 30 seconds. This means that you might temporarily experience prioritized traffic that exceeds your quota amount on a per-second basis in some cases, but you shouldn't exceed your quota on a 30-second basis. The quota window for other models is one minute.Supervised fine-tuned model endpoints and their corresponding base model count towards the same Provisioned Throughput quota. This is a Preview feature. Fill out and submit the Provisioned Throughput access control form.
For example, Provisioned Throughput purchased for
gemini-1.5-pro-002
for a specific project prioritizes requests made from supervised fine-tuned versions ofgemini-1.5-pro-002
created within that project. Use the appropriate header to control traffic behavior.
Purchase Provisioned Throughput
This section provides the permissions you must have to place or to view a Provisioned Throughput order, and the instructions for placing and viewing your orders.
Permissions
To subscribe to Provisioned Throughput, you must have one of the following permissions assigned to your project, which lets you list and place new orders.
aiplatform.googleapis.com/provisionedThroughputAdmin
: Specific to Provisioned Throughput.aiplatform.googleapis.com/admin
: Gives administrative rights to every resource in Vertex AI.
This role lets you only list your orders:
aiplatform.googleapis.com/viewer
Place a Provisioned Throughput order
Before you place your order to use Imagen models, submit the Request to grant permissions form to be granted permissions.
Before you place an order to use MedLM-large-1.5, contact your Google Cloud account representative to request access. If you expect your QPM to exceed 30,000, then to maximize your Provisioned Throughput order, request an increase to your default Vertex AI system quota using the following information:
- Service: The Vertex AI API.
- Name:
Online prediction requests per minute per region
- Service type: A quota.
- Dimensions: The region where you ordered Provisioned Throughput.
- Value: This is your chosen online-prediction traffic limit.
Follow these steps to purchase Provisioned Throughput:
Console
- In the Google Cloud console, go to the Provisioned Throughput page.
- To start a new order, click Create.
- Enter an Order name.
- Select the Model.
- Select the Region.
- Enter the Number of generative AI scale units (GSUs) that you must
purchase. If you must estimate the number of GSUs, click the
Estimation tool.
- Select your Model.
- Enter the number of Queries per second.
- Enter the number of Input characters per query.
- Enter the number of Input images per query.
- Enter the number of Video seconds per query.
- Enter the number of Audio seconds per query.
- Enter the number of Output characters per query.
- If you want to use the values that you entered into the estimation tool, click Use calculated.
- Select your Term.
If you choose one week, you have the option to provide a start date and time within two weeks into the future of placing an order. If you provide no start date and time, we process the order as soon as we can ensure that the capacity is available. Requested start dates and times are processed on a best-effort basis, and orders aren't guaranteed to be fulfilled by these dates until the order status is set to Approved.
If your requested start date is too close to the current date, your order might be approved and activated after your requested start date, which means that your end date remains seven days from the activation date.
- Select your Renewal option.
- Click Continue.
- In the Summary section, review the price and throughput estimates for your order. Read the terms listed and linked in the form.
- To finalize your order, click Confirm.
Check order status
After you submit your Provisioned Throughput order, the order status might appear as one of the following:
- Pending review: You placed your order. Because approval depends on available capacity to provision your order, your order is waiting for review and approval. For more information about the status of your pending order, contact your Google Cloud account representative.
- Approved: Google has approved your order.
- Active: Google has activated your order, and then billing starts.
- Expired: Your order has expired.
View Provisioned Throughput orders
Follow these steps to view your Provisioned Throughput orders:
Console
- In the Google Cloud console, go to the Provisioned Throughput page.
- Select the Region. Your list of orders appears.
Use Provisioned Throughput
This section explains how to control overages or bypass Provisioned Throughput and how to monitor the usage of Provisioned Throughput.
Control overages or bypass Provisioned Throughput
Use the REST API to control overages when you exceed your purchased throughput or to bypass Provisioned Throughput on a per-request basis.
Read through each option to determine what you must do to meet your use case.
Default behavior
If you exceed your purchased amount of throughput, the overages go to on-demand and are billed at the pay-as-you-go rate. After your Provisioned Throughput order is active, the default behavior takes place automatically. You don't have to change your code to begin consuming your order.
This curl example demonstrates the default behavior.
! curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
$URL \
-d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'
Use only Provisioned Throughput
If you are managing costs by avoiding on-demand charges, use only Provisioned Throughput. Requests which exceed the Provisioned Throughput order amount return an error 429.
This curl example demonstrates how you can use the REST API to use your Provisioned Throughput subscription only, with overages returning an error 429.
Set the X-Vertex-AI-LLM-Request-Type
header to dedicated
.
! curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-H "X-Vertex-AI-LLM-Request-Type: dedicated" \
$URL \
-d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'
Use only pay-as-you-go
This is also referred to as using on-demand. Requests bypass the Provisioned Throughput order and are sent directly to pay-as-you-go. This might be useful for experiments or applications that are in development.
This curl example demonstrates how you can use the REST API to bypass Provisioned Throughput, and use only pay-as-you-go.
Set the X-Vertex-AI-LLM-Request-Type
header to shared
.
! curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
-H "X-Vertex-AI-LLM-Request-Type: shared" \
$URL \
-d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'
Monitor Provisioned Throughput
You can monitor your Provisioned Throughput usage through monitoring metrics and on a per-request basis.
Response headers
If a request was processed using Provisioned Throughput, the following HTTP
header is present in the response. This line of code applies only to the
generateContent
API call.
{"X-Vertex-AI-LLM-Request-Type": "dedicated"}
Metrics
Provisioned Throughput can be monitored using a set of metrics that are measured
on the aiplatform.googleapis.com/PublisherModel
resource type. Each metric is
filterable along the following dimensions:
type
:input
,output
request_type
:dedicated
,shared
To filter a metric to view the Provisioned Throughput usage, use the dedicated
request type. The path prefix for a metric is
aiplatform.googleapis.com/publisher/online_serving
.
For example, the full path for the /consumed_throughput
metric is
aiplatform.googleapis.com/publisher/online_serving/consumed_throughput
.
The following Cloud Monitoring metrics are available on the
aiplatform.googleapis.com/PublisherModel
resource:
Metric | Display name | Description | Filter for Provisioned Throughput usage |
---|---|---|---|
/characters |
Characters | Input and output character count distribution. | |
/character_count |
Character count | Accumulated input and output character count. | |
/consumed_throughput |
Character Throughput | Throughput consumed (accounts for the burndown rate) in characters. | |
/model_invocation_count |
Model invocation count | Number of model invocations (prediction requests). | |
/model_invocation_latencies |
Model invocation latencies | Model invocation latencies (prediction latencies). | |
/first_token_latencies |
First token latencies | Duration from request received to first token returned. | |
/tokens |
Tokens | Input and output token count distribution. | |
/token_count |
Token count | Accumulated input and output token count. |
Troubleshoot Provisioned Throughput
To correct the 429 error generated by Provisioned Throughput, do the following:
- Use the default example, which doesn't set a header in prediction requests. Any overages are processed on-demand and billed as pay-as-you-go.
- Increase the number of GSUs in your Provisioned Throughput subscription.
What's next
- Contact your Google Cloud account representative to place a Provisioned Throughput order or to increase the number of GSUs on an existing order.
- For more information about troubleshooting error 429 when using dynamic shared
quota or Provisioned Throughput, see
Error code
429
. - To learn more about dynamic shared quota (DSQ), see Dynamic shared
quota.