Use Provisioned Throughput

This page explains how to control overages or bypass Provisioned Throughput and how to monitor the usage of Provisioned Throughput.

Control overages or bypass Provisioned Throughput

Use the REST API to control overages when you exceed your purchased throughput or to bypass Provisioned Throughput on a per-request basis.

Read through each option to determine what you must do to meet your use case.

Default behavior

If you exceed your purchased amount of throughput, the overages go to on-demand and are billed at the pay-as-you-go rate. After your Provisioned Throughput order is active, the default behavior takes place automatically. You don't have to change your code to begin consuming your order.

This curl example demonstrates the default behavior.

! curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  $URL \
  -d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'

Use only Provisioned Throughput

If you are managing costs by avoiding on-demand charges, use only Provisioned Throughput. Requests which exceed the Provisioned Throughput order amount return an error 429.

This curl example demonstrates how you can use the REST API to use your Provisioned Throughput subscription only, with overages returning an error 429.

Set the X-Vertex-AI-LLM-Request-Type header to dedicated.

! curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  -H "X-Vertex-AI-LLM-Request-Type: dedicated" \
  $URL \
  -d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'

Use only pay-as-you-go

This is also referred to as using on-demand. Requests bypass the Provisioned Throughput order and are sent directly to pay-as-you-go. This might be useful for experiments or applications that are in development.

This curl example demonstrates how you can use the REST API to bypass Provisioned Throughput, and use only pay-as-you-go.

Set the X-Vertex-AI-LLM-Request-Type header to shared.

! curl -X POST \
  -H "Authorization: Bearer $(gcloud auth print-access-token)" \
  -H "Content-Type: application/json" \
  -H "X-Vertex-AI-LLM-Request-Type: shared" \
  $URL \
  -d '{"contents": [{"role": "user", "parts": [{"text": "Hello."}]}]}'

Monitor Provisioned Throughput

You can monitor your Provisioned Throughput usage through monitoring metrics and on a per-request basis.

Response headers

If a request was processed using Provisioned Throughput, the following HTTP header is present in the response. This line of code applies only to the generateContent API call.

  {"X-Vertex-AI-LLM-Request-Type": "dedicated"}

Metrics

Provisioned Throughput can be monitored using a set of metrics that are measured on the aiplatform.googleapis.com/PublisherModel resource type. Each metric is filterable along the following dimensions:

  • type: input, output
  • request_type: dedicated, shared

To filter a metric to view the Provisioned Throughput usage, use the dedicated request type. The path prefix for a metric is aiplatform.googleapis.com/publisher/online_serving.

For example, the full path for the /consumed_throughput metric is aiplatform.googleapis.com/publisher/online_serving/consumed_throughput.

The following Cloud Monitoring metrics are available on the aiplatform.googleapis.com/PublisherModel resource in the Gemini models and have a filter for Provisioned Throughput usage:

Metric Display name Description
/characters Characters Input and output character count distribution.
/character_count Character count Accumulated input and output character count.
/consumed_throughput Character Throughput Throughput consumed (accounts for the burndown rate) in characters.
/model_invocation_count Model invocation count Number of model invocations (prediction requests).
/model_invocation_latencies Model invocation latencies Model invocation latencies (prediction latencies).
/first_token_latencies First token latencies Duration from request received to first token returned.
/tokens Tokens Input and output token count distribution.
/token_count Token count Accumulated input and output token count.

Anthropic models also have a filter for Provisioned Throughput but only for tokens/token_count.

What's next