Use Provisioned Throughput

This guide explains how to use Provisioned Throughput. You learn how to do the following:

How Provisioned Throughput works

Provisioned Throughput works by checking your usage against your quota during a defined quota enforcement period.

Provisioned Throughput quota checking

Your maximum quota for Provisioned Throughput is calculated by multiplying the number of purchased generative AI scale units (GSUs) by the throughput per GSU. Vertex AI checks your usage against this quota for each request made within your quota enforcement period. The quota enforcement period is the frequency at which the maximum quota is enforced.

When a request is received, the true response size is unknown. To prioritize response speed for real-time applications, Provisioned Throughput estimates the output token size.

If the initial estimate exceeds your available maximum quota, Vertex AI processes the request as pay-as-you-go. Otherwise, the request is processed using Provisioned Throughput.

After the response is generated and the true output token size is known, Vertex AI reconciles the actual usage and quota. The difference between the estimate and the actual usage is added back to your available Provisioned Throughput quota.

Provisioned Throughput quota enforcement period

For gemini-2.0-flash-lite and gemini-2.0-flash models, the quota enforcement period can be up to 30 seconds and is subject to change. This means you might temporarily have traffic that exceeds your per-second quota, but your usage shouldn't exceed your quota over a 30-second period. These periods are based on the Vertex AI internal clock and are independent of when you make requests.

For example, if you purchase one GSU of gemini-2.0-flash-001, you can expect 3,360 tokens per second of always-on throughput. On average, you can't exceed 100,800 tokens on a 30-second basis, which is calculated as follows:

3,360 tokens per second * 30 seconds = 100,800 tokens

If you submit a single request that consumes 8,000 tokens in one second, it might still be processed as a Provisioned Throughput request. Even though this exceeds your 3,360 tokens-per-second limit, it doesn't exceed the 100,800 tokens-per-30-seconds threshold.

Control overages or bypass Provisioned Throughput

You can use the API to control overages when you exceed your purchased throughput or to bypass Provisioned Throughput on a per-request basis.

Option Description Use Case
Default behavior Requests that exceed your purchased throughput are processed as on-demand and billed at the pay-as-you-go rate. To process all requests and accept potential on-demand costs for traffic spikes.
Use only Provisioned Throughput Requests that exceed your purchased throughput return an error instead of being processed as on-demand. To strictly manage costs by preventing on-demand charges.
Use only pay-as-you-go Requests bypass your purchased throughput and are processed directly as on-demand. For experiments or development workloads where you don't want to consume your provisioned capacity.

Default behavior

By default, if you exceed your purchased throughput, Vertex AI processes the overage traffic on-demand and bills it at the pay-as-you-go rate. This behavior is automatic after your Provisioned Throughput order is active in a region. No code changes are required.

Use only Provisioned Throughput

To avoid on-demand charges, you can configure your requests to use only Provisioned Throughput. Any request that exceeds your purchased throughput returns an error 429.

When sending requests to the API, set the X-Vertex-AI-LLM-Request-Type HTTP header to dedicated.

Use only pay-as-you-go

You can bypass your Provisioned Throughput order and send requests directly to pay-as-you-go. This approach is useful for experiments or development workloads.

When sending requests to the API, set the X-Vertex-AI-LLM-Request-Type HTTP header to shared.

Example

Python

Install

pip install --upgrade google-genai

To learn more, see the SDK reference documentation.

Set environment variables to use the Gen AI SDK with Vertex AI:

# Replace the `GOOGLE_CLOUD_PROJECT` and `GOOGLE_CLOUD_LOCATION` values
# with appropriate values for your project.
export GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
export GOOGLE_CLOUD_LOCATION=us-central1
export GOOGLE_GENAI_USE_VERTEXAI=True

from google import genai
from google.genai.types import HttpOptions

client = genai.Client(
    http_options=HttpOptions(
        api_version="v1",
        headers={
            # Options:
            # - "dedicated": Use Provisioned Throughput
            # - "shared": Use pay-as-you-go
            # https://cloud.google.com/vertex-ai/generative-ai/docs/use-provisioned-throughput
            "X-Vertex-AI-LLM-Request-Type": "shared"
        },
    )
)
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="How does AI work?",
)
print(response.text)
# Example response:
# Okay, let's break down how AI works. It's a broad field, so I'll focus on the ...
#
# Here's a simplified overview:
# ...

REST

After you set up your environment, you can use REST to test a text prompt. The following sample sends a request to the publisher model endpoint.

curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  -H "X-Vertex-AI-LLM-Request-Type: dedicated" \ # Options: dedicated, shared
  $URL \
  -d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'

Monitor Provisioned Throughput

You can monitor your Provisioned Throughput usage with metrics on the aiplatform.googleapis.com/PublisherModel resource type.

Dimensions

You can filter on metrics using the following dimensions:

Dimension Values
type input
output
request_type

dedicated: Traffic is processed using Provisioned Throughput.

spillover: Traffic is processed as pay-as-you-go quota after you exceed your Provisioned Throughput quota.

shared: If Provisioned Throughput is active, then traffic is processed as pay-as-you-go quota using the shared HTTP header. If Provisioned Throughput isn't active, then traffic is processed as pay-as-you-go, by default.

Path prefix

The path prefix for a metric is aiplatform.googleapis.com/publisher/online_serving.

For example, the full path for the /consumed_throughput metric is aiplatform.googleapis.com/publisher/online_serving/consumed_throughput.

Metrics

The following Cloud Monitoring metrics are available on the aiplatform.googleapis.com/PublisherModel resource for the Gemini models. Use the dedicated request types to filter for Provisioned Throughput usage.

Metric Display name Description
/dedicated_gsu_limit Limit (GSU) Dedicated limit in GSUs. Use this metric to understand your Provisioned Throughput maximum quota in GSUs.
/tokens Tokens Input and output token count distribution.
/token_count Token count Accumulated input and output token count.
/consumed_token_throughput Token throughput Throughput usage, which accounts for the burndown rate in tokens and incorporates quota reconciliation. See Provisioned Throughput quota checking.

Use this metric to understand how your Provisioned Throughput quota was used.
/dedicated_token_limit Limit (tokens per second) Dedicated limit in tokens per second. Use this metric to understand your Provisioned Throughput maximum quota for token-based models.
/characters Characters Input and output character count distribution.
/character_count Character count Accumulated input and output character count.
/consumed_throughput Character throughput Throughput usage, which accounts for the burndown rate in characters and incorporates quota reconciliation Provisioned Throughput quota checking.

Use this metric to understand how your Provisioned Throughput quota was used.

For token-based models, this metric is equivalent to the throughput consumed in tokens multiplied by 4.
/dedicated_character_limit Limit (characters per second) Dedicated limit in characters per second. Use this metric to understand your Provisioned Throughput maximum quota for character-based models.
/model_invocation_count Model invocation count Number of model invocations (prediction requests).
/model_invocation_latencies Model invocation latencies Model invocation latencies (prediction latencies).
/first_token_latencies First token latencies Duration from request received to first token returned.

For Anthropic models, a filter for Provisioned Throughput is also available, but only for the tokens/token_count metric.

Dashboards

Default monitoring dashboards for Provisioned Throughput provide metrics that help you understand your usage and utilization. To access the dashboards, follow these steps:

  1. In the Google Cloud console, go to the Provisioned Throughput page.

    Go to Provisioned Throughput

  2. To view the Provisioned Throughput utilization of each model across your orders, select the Utilization summary tab.

    In the Provisioned Throughput utilization by model table, you can view the following for the selected time range:

    • Total number of GSUs you had.
    • Peak throughput usage in terms of GSUs.
    • The average GSU utilization.
    • The number of times you reached your Provisioned Throughput limit.
  3. Select a model from the Provisioned Throughput utilization by model table to see more metrics specific to the selected model.

Limitations of the dashboard

The dashboard might display unexpected results for fluctuating, spiky, or infrequent traffic (for example, less than one query per second). This can be caused by the following:

  • Time ranges: Using time ranges larger than 12 hours can lead to a less accurate representation of the quota enforcement period. Throughput metrics and their derivatives, such as utilization, display averages across alignment periods that are based on the selected time range. When the time range expands, each alignment period also expands. Because quota enforcement is calculated at a sub-minute level, setting the time range to 12 hours or less results in minute-level data that is more comparable to the actual quota enforcement period. For more information, see Alignment: within-series regularization and Regularizing time intervals.
  • Concurrent requests: If you submit multiple requests at the same time, monitoring aggregations might affect your ability to filter down to specific requests.
  • Reporting latency: Provisioned Throughput throttles traffic when you make a request but reports usage metrics after the quota is reconciled.
  • Period alignment: Provisioned Throughput quota enforcement periods are independent from and might not align with monitoring aggregation periods or request-or-response periods.
  • Error messages: If no errors occur, you might still see an error message in the error rate chart, such as An error occurred requesting data. One or more resources could not be found.

Monitor Genmedia models

The metrics for the Veo 3 and Imagen models express throughput in tokens, as follows:

  • For Veo models: 100 tokens = 1 video second
  • For Imagen models: 1 token = 1 image

For example, if you're monitoring your Provisioned Throughput usage for the Veo 3 model, the /consumed_token_throughput metric represents the video seconds throughput and the /dedicated_token_limit represents the dedicated limit in video seconds per second.

For information about the burndown rates for each model, see Supported models. For example, if you're using Veo 3, then 1 output video+audio second = 1.6 output video seconds. Therefore, in this case, 1 video+audio second is equivalent to 160 tokens.

Alerting

You can set default alerts to help manage your traffic usage.

Enable alerts

To enable alerts in the dashboard, follow these steps:

  1. In the Google Cloud console, go to the Provisioned Throughput page.

    Go to Provisioned Throughput

  2. To view the Provisioned Throughput utilization of each model across your orders, select the Utilization summary tab.

  3. Select Recommended alerts, and the following alerts display:

    • Provisioned Throughput Usage Reached Limit
    • Provisioned Throughput Utilization Exceeded 80%
    • Provisioned Throughput Utilization Exceeded 90%
  4. Check the alerts that help you manage your traffic.

View more alert details

To view more information about alerts, follow these steps:

  1. Go to the Integrations page.

    Go to Integrations

  2. Enter vertex into the Filter field and press Enter. Google Vertex AI appears.

  3. To view more information, click View details. The Google Vertex AI details pane displays.

  4. Select the Alerts tab, where you can select an Alert Policy template.

What's next